Classification of proteins: available structural space for molecular modeling.
Andreeva, Antonina
2012-01-01
The wealth of available protein structural data provides unprecedented opportunity to study and better understand the underlying principles of protein folding and protein structure evolution. A key to achieving this lies in the ability to analyse these data and to organize them in a coherent classification scheme. Over the past years several protein classifications have been developed that aim to group proteins based on their structural relationships. Some of these classification schemes explore the concept of structural neighbourhood (structural continuum), whereas other utilize the notion of protein evolution and thus provide a discrete rather than continuum view of protein structure space. This chapter presents a strategy for classification of proteins with known three-dimensional structure. Steps in the classification process along with basic definitions are introduced. Examples illustrating some fundamental concepts of protein folding and evolution with a special focus on the exceptions to them are presented.
ECOD: An Evolutionary Classification of Protein Domains
Kinch, Lisa N.; Pei, Jimin; Shi, Shuoyong; Kim, Bong-Hyun; Grishin, Nick V.
2014-01-01
Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or “fold”). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies. PMID:25474468
ECOD: an evolutionary classification of protein domains.
Cheng, Hua; Schaeffer, R Dustin; Liao, Yuxing; Kinch, Lisa N; Pei, Jimin; Shi, Shuoyong; Kim, Bong-Hyun; Grishin, Nick V
2014-12-01
Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or "fold"). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies.
The value of protein structure classification information—Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.
2015-01-01
ABSTRACT The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP–extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012–2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non‐SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings. Proteins 2015; 83:2025–2038. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. PMID:26313554
Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions.
Mai, Te-Lun; Hu, Geng-Ming; Chen, Chi-Ming
2016-07-01
Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.
The value of protein structure classification information-Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
2015-08-27
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
The value of protein structure classification information-Surveying the scientific literature
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
Classification of proteins with shared motifs and internal repeats in the ECOD database
Kinch, Lisa N.; Liao, Yuxing
2016-01-01
Abstract Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain‐like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade. PMID:26833690
TIM Barrel Protein Structure Classification Using Alignment Approach and Best Hit Strategy
NASA Astrophysics Data System (ADS)
Chu, Jia-Han; Lin, Chun Yuan; Chang, Cheng-Wen; Lee, Chihan; Yang, Yuh-Shyong; Tang, Chuan Yi
2007-11-01
The classification of protein structures is essential for their function determination in bioinformatics. It has been estimated that around 10% of all known enzymes have TIM barrel domains from the Structural Classification of Proteins (SCOP) database. With its high sequence variation and diverse functionalities, TIM barrel protein becomes to be an attractive target for protein engineering and for the evolution study. Hence, in this paper, an alignment approach with the best hit strategy is proposed to classify the TIM barrel protein structure in terms of superfamily and family levels in the SCOP. This work is also used to do the classification for class level in the Enzyme nomenclature (ENZYME) database. Two testing data sets, TIM40D and TIM95D, both are used to evaluate this approach. The resulting classification has an overall prediction accuracy rate of 90.3% for the superfamily level in the SCOP, 89.5% for the family level in the SCOP and 70.1% for the class level in the ENZYME. These results demonstrate that the alignment approach with the best hit strategy is a simple and viable method for the TIM barrel protein structure classification, even only has the amino acid sequences information.
An updated version of NPIDB includes new classifications of DNA–protein complexes and their families
Zanegina, Olga; Kirsanov, Dmitriy; Baulin, Eugene; Karyagina, Anna; Alexeevski, Andrei; Spirin, Sergey
2016-01-01
The recent upgrade of nucleic acid–protein interaction database (NPIDB, http://npidb.belozersky.msu.ru/) includes a newly elaborated classification of complexes of protein domains with double-stranded DNA and a classification of families of related complexes. Our classifications are based on contacting structural elements of both DNA: the major groove, the minor groove and the backbone; and protein: helices, beta-strands and unstructured segments. We took into account both hydrogen bonds and hydrophobic interaction. The analyzed material contains 1942 structures of protein domains from 748 PDB entries. We have identified 97 interaction modes of individual protein domain–DNA complexes and 17 DNA–protein interaction classes of protein domain families. We analyzed the sources of diversity of DNA–protein interaction modes in different complexes of one protein domain family. The observed interaction mode is sometimes influenced by artifacts of crystallization or diversity in secondary structure assignment. The interaction classes of domain families are more stable and thus possess more biological sense than a classification of single complexes. Integration of the classification into NPIDB allows the user to browse the database according to the interacting structural elements of DNA and protein molecules. For each family, we present average DNA shape parameters in contact zones with domains of the family. PMID:26656949
Chandonia, John-Marc; Fox, Naomi K; Brenner, Steven E
2017-02-03
SCOPe (Structural Classification of Proteins-extended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP) database. SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases. SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. SCOPe continues high-quality manual classification of new superfamilies, a key feature of SCOP. Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy. SCOPe 2.06 contains 77,439 Protein Data Bank entries, double the 38,221 structures classified in SCOP. Copyright © 2016 The Author(s). Published by Elsevier Ltd.. All rights reserved.
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification.
Sinclair, Robert M; Ravantti, Janne J; Bamford, Dennis H
2017-04-15
Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. Copyright © 2017 Sinclair et al.
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification
Sinclair, Robert M.; Ravantti, Janne J.
2017-01-01
ABSTRACT Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. PMID:28122979
Protein classification using sequential pattern mining.
Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I
2006-01-01
Protein classification in terms of fold recognition can be employed to determine the structural and functional properties of a newly discovered protein. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. One of the most efficient SPM algorithms, cSPADE, is employed for protein primary structure analysis. Then a classifier uses the extracted sequential patterns for classifying proteins of unknown structure in the appropriate fold category. The proposed methodology exhibited an overall accuracy of 36% in a multi-class problem of 17 candidate categories. The classification performance reaches up to 65% when the three most probable protein folds are considered.
The Classification of Protein Domains.
Dawson, Natalie; Sillitoe, Ian; Marsden, Russell L; Orengo, Christine A
2017-01-01
The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships.In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.
A topological approach for protein classification
Cang, Zixuan; Mu, Lin; Wu, Kedi; ...
2015-11-04
Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.
Kavianpour, Hamidreza; Vasighi, Mahdi
2017-02-01
Nowadays, having knowledge about cellular attributes of proteins has an important role in pharmacy, medical science and molecular biology. These attributes are closely correlated with the function and three-dimensional structure of proteins. Knowledge of protein structural class is used by various methods for better understanding the protein functionality and folding patterns. Computational methods and intelligence systems can have an important role in performing structural classification of proteins. Most of protein sequences are saved in databanks as characters and strings and a numerical representation is essential for applying machine learning methods. In this work, a binary representation of protein sequences is introduced based on reduced amino acids alphabets according to surrounding hydrophobicity index. Many important features which are hidden in these long binary sequences can be clearly displayed through their cellular automata images. The extracted features from these images are used to build a classification model by support vector machine. Comparing to previous studies on the several benchmark datasets, the promising classification rates obtained by tenfold cross-validation imply that the current approach can help in revealing some inherent features deeply hidden in protein sequences and improve the quality of predicting protein structural class.
Automatic classification of protein structures using physicochemical parameters.
Mohan, Abhilash; Rao, M Divya; Sunderrajan, Shruthi; Pennathur, Gautam
2014-09-01
Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.
Andreeva, Antonina
2016-06-15
The Structural Classification of Proteins (SCOP) database has facilitated the development of many tools and algorithms and it has been successfully used in protein structure prediction and large-scale genome annotations. During the development of SCOP, numerous exceptions were found to topological rules, along with complex evolutionary scenarios and peculiarities in proteins including the ability to fold into alternative structures. This article reviews cases of structural variations observed for individual proteins and among groups of homologues, knowledge of which is essential for protein structure modelling. © 2016 The Author(s). published by Portland Press Limited on behalf of the Biochemical Society.
Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.
In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less
Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric
Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.; ...
2015-10-09
In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less
An information-based network approach for protein classification
Wan, Xiaogeng; Zhao, Xin; Yau, Stephen S. T.
2017-01-01
Protein classification is one of the critical problems in bioinformatics. Early studies used geometric distances and polygenetic-tree to classify proteins. These methods use binary trees to present protein classification. In this paper, we propose a new protein classification method, whereby theories of information and networks are used to classify the multivariate relationships of proteins. In this study, protein universe is modeled as an undirected network, where proteins are classified according to their connections. Our method is unsupervised, multivariate, and alignment-free. It can be applied to the classification of both protein sequences and structures. Nine examples are used to demonstrate the efficiency of our new method. PMID:28350835
SCOWLP classification: Structural comparison and analysis of protein binding regions
Teyra, Joan; Paszkowski-Rogacz, Maciej; Anders, Gerd; Pisabarro, M Teresa
2008-01-01
Background Detailed information about protein interactions is critical for our understanding of the principles governing protein recognition mechanisms. The structures of many proteins have been experimentally determined in complex with different ligands bound either in the same or different binding regions. Thus, the structural interactome requires the development of tools to classify protein binding regions. A proper classification may provide a general view of the regions that a protein uses to bind others and also facilitate a detailed comparative analysis of the interacting information for specific protein binding regions at atomic level. Such classification might be of potential use for deciphering protein interaction networks, understanding protein function, rational engineering and design. Description Protein binding regions (PBRs) might be ideally described as well-defined separated regions that share no interacting residues one another. However, PBRs are often irregular, discontinuous and can share a wide range of interacting residues among them. The criteria to define an individual binding region can be often arbitrary and may differ from other binding regions within a protein family. Therefore, the rational behind protein interface classification should aim to fulfil the requirements of the analysis to be performed. We extract detailed interaction information of protein domains, peptides and interfacial solvent from the SCOWLP database and we classify the PBRs of each domain family. For this purpose, we define a similarity index based on the overlapping of interacting residues mapped in pair-wise structural alignments. We perform our classification with agglomerative hierarchical clustering using the complete-linkage method. Our classification is calculated at different similarity cut-offs to allow flexibility in the analysis of PBRs, feature especially interesting for those protein families with conflictive binding regions. The hierarchical classification of PBRs is implemented into the SCOWLP database and extends the SCOP classification with three additional family sub-levels: Binding Region, Interface and Contacting Domains. SCOWLP contains 9,334 binding regions distributed within 2,561 families. In 65% of the cases we observe families containing more than one binding region. Besides, 22% of the regions are forming complex with more than one different protein family. Conclusion The current SCOWLP classification and its web application represent a framework for the study of protein interfaces and comparative analysis of protein family binding regions. This comparison can be performed at atomic level and allows the user to study interactome conservation and variability. The new SCOWLP classification may be of great utility for reconstruction of protein complexes, understanding protein networks and ligand design. SCOWLP will be updated with every SCOP release. The web application is available at . PMID:18182098
3D Complex: A Structural Classification of Protein Complexes
Levy, Emmanuel D; Pereira-Leal, Jose B; Chothia, Cyrus; Teichmann, Sarah A
2006-01-01
Most of the proteins in a cell assemble into complexes to carry out their function. It is therefore crucial to understand the physicochemical properties as well as the evolution of interactions between proteins. The Protein Data Bank represents an important source of information for such studies, because more than half of the structures are homo- or heteromeric protein complexes. Here we propose the first hierarchical classification of whole protein complexes of known 3-D structure, based on representing their fundamental structural features as a graph. This classification provides the first overview of all the complexes in the Protein Data Bank and allows nonredundant sets to be derived at different levels of detail. This reveals that between one-half and two-thirds of known structures are multimeric, depending on the level of redundancy accepted. We also analyse the structures in terms of the topological arrangement of their subunits and find that they form a small number of arrangements compared with all theoretically possible ones. This is because most complexes contain four subunits or less, and the large majority are homomeric. In addition, there is a strong tendency for symmetry in complexes, even for heteromeric complexes. Finally, through comparison of Biological Units in the Protein Data Bank with the Protein Quaternary Structure database, we identified many possible errors in quaternary structure assignments. Our classification, available as a database and Web server at http://www.3Dcomplex.org, will be a starting point for future work aimed at understanding the structure and evolution of protein complexes. PMID:17112313
Fourier-based classification of protein secondary structures.
Shu, Jian-Jun; Yong, Kian Yan
2017-04-15
The correct prediction of protein secondary structures is one of the key issues in predicting the correct protein folded shape, which is used for determining gene function. Existing methods make use of amino acids properties as indices to classify protein secondary structures, but are faced with a significant number of misclassifications. The paper presents a technique for the classification of protein secondary structures based on protein "signal-plotting" and the use of the Fourier technique for digital signal processing. New indices are proposed to classify protein secondary structures by analyzing hydrophobicity profiles. The approach is simple and straightforward. Results show that the more types of protein secondary structures can be classified by means of these newly-proposed indices. Copyright © 2017 Elsevier Inc. All rights reserved.
Fan, Ming; Zheng, Bin; Li, Lihua
2015-10-01
Knowledge of the structural class of a given protein is important for understanding its folding patterns. Although a lot of efforts have been made, it still remains a challenging problem for prediction of protein structural class solely from protein sequences. The feature extraction and classification of proteins are the main problems in prediction. In this research, we extended our earlier work regarding these two aspects. In protein feature extraction, we proposed a scheme by calculating the word frequency and word position from sequences of amino acid, reduced amino acid, and secondary structure. For an accurate classification of the structural class of protein, we developed a novel Multi-Agent Ada-Boost (MA-Ada) method by integrating the features of Multi-Agent system into Ada-Boost algorithm. Extensive experiments were taken to test and compare the proposed method using four benchmark datasets in low homology. The results showed classification accuracies of 88.5%, 96.0%, 88.4%, and 85.5%, respectively, which are much better compared with the existing methods. The source code and dataset are available on request.
An alternative view of protein fold space.
Shindyalov, I N; Bourne, P E
2000-02-15
Comparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures. Substructures are long (50-150 residue) highly repetitive near-contiguous pieces of polypeptide chain that occur frequently in a set of proteins from the PDB defined as structurally non-redundant over the complete polypeptide chain. The substructure classification is based on a previously reported Combinatorial Extension (CE) algorithm that provides a significantly different set of structure alignments than those previously described, having, for example, only a 40% overlap with FSSP. Qualitatively the algorithm provides longer contiguous aligned segments at the price of a slightly higher root-mean-square deviation (rmsd). Clustering these alignments gives a discreet and highly repetitive set of substructures not detectable by sequence similarity alone. In some cases different substructures represent all or different parts of well known folds indicative of the Russian doll effect--the continuity of protein fold space. In other cases they fall into different structure and functional classifications. It is too early to determine whether these newly classified substructures represent new insights into the evolution of a structural framework important to many proteins. What is apparent from on-going work is that these substructures have the potential to be useful probes in finding remote sequence homology and in structure prediction studies. The characteristics of the complete all-by-all comparison of the polypeptide chains present in the PDB and details of the filtering procedure by pair-wise structure alignment that led to the emergent substructure gallery are discussed. Substructure classification, alignments, and tools to analyze them are available at http://cl.sdsc.edu/ce.html.
Esque, Jérémy; Urbain, Aurélie; Etchebest, Catherine; de Brevern, Alexandre G
2015-11-01
Transmembrane proteins (TMPs) are major drug targets, but the knowledge of their precise topology structure remains highly limited compared with globular proteins. In spite of the difficulties in obtaining their structures, an important effort has been made these last years to increase their number from an experimental and computational point of view. In view of this emerging challenge, the development of computational methods to extract knowledge from these data is crucial for the better understanding of their functions and in improving the quality of structural models. Here, we revisit an efficient unsupervised learning procedure, called Hybrid Protein Model (HPM), which is applied to the analysis of transmembrane proteins belonging to the all-α structural class. HPM method is an original classification procedure that efficiently combines sequence and structure learning. The procedure was initially applied to the analysis of globular proteins. In the present case, HPM classifies a set of overlapping protein fragments, extracted from a non-redundant databank of TMP 3D structure. After fine-tuning of the learning parameters, the optimal classification results in 65 clusters. They represent at best similar relationships between sequence and local structure properties of TMPs. Interestingly, HPM distinguishes among the resulting clusters two helical regions with distinct hydrophobic patterns. This underlines the complexity of the topology of these proteins. The HPM classification enlightens unusual relationship between amino acids in TMP fragments, which can be useful to elaborate new amino acids substitution matrices. Finally, two challenging applications are described: the first one aims at annotating protein functions (channel or not), the second one intends to assess the quality of the structures (X-ray or models) via a new scoring function deduced from the HPM classification.
Arana-Daniel, Nancy; Gallegos, Alberto A; López-Franco, Carlos; Alanís, Alma Y; Morales, Jacob; López-Franco, Adriana
2016-01-01
With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.
Classification of Dynamical Diffusion States in Single Molecule Tracking Microscopy
Bosch, Peter J.; Kanger, Johannes S.; Subramaniam, Vinod
2014-01-01
Single molecule tracking of membrane proteins by fluorescence microscopy is a promising method to investigate dynamic processes in live cells. Translating the trajectories of proteins to biological implications, such as protein interactions, requires the classification of protein motion within the trajectories. Spatial information of protein motion may reveal where the protein interacts with cellular structures, because binding of proteins to such structures often alters their diffusion speed. For dynamic diffusion systems, we provide an analytical framework to determine in which diffusion state a molecule is residing during the course of its trajectory. We compare different methods for the quantification of motion to utilize this framework for the classification of two diffusion states (two populations with different diffusion speed). We found that a gyration quantification method and a Bayesian statistics-based method are the most accurate in diffusion-state classification for realistic experimentally obtained datasets, of which the gyration method is much less computationally demanding. After classification of the diffusion, the lifetime of the states can be determined, and images of the diffusion states can be reconstructed at high resolution. Simulations validate these applications. We apply the classification and its applications to experimental data to demonstrate the potential of this approach to obtain further insights into the dynamics of cell membrane proteins. PMID:25099798
Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions.
Najibi, Seyed Morteza; Maadooliat, Mehdi; Zhou, Lan; Huang, Jianhua Z; Gao, Xin
2017-01-01
Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.
Protein classification using probabilistic chain graphs and the Gene Ontology structure.
Carroll, Steven; Pavlovic, Vladimir
2006-08-01
Probabilistic graphical models have been developed in the past for the task of protein classification. In many cases, classifications obtained from the Gene Ontology have been used to validate these models. In this work we directly incorporate the structure of the Gene Ontology into the graphical representation for protein classification. We present a method in which each protein is represented by a replicate of the Gene Ontology structure, effectively modeling each protein in its own 'annotation space'. Proteins are also connected to one another according to different measures of functional similarity, after which belief propagation is run to make predictions at all ontology terms. The proposed method was evaluated on a set of 4879 proteins from the Saccharomyces Genome Database whose interactions were also recorded in the GRID project. Results indicate that direct utilization of the Gene Ontology improves predictive ability, outperforming traditional models that do not take advantage of dependencies among functional terms. Average increase in accuracy (precision) of positive and negative term predictions of 27.8% (2.0%) over three different similarity measures and three subontologies was observed. C/C++/Perl implementation is available from authors upon request.
A consensus view of fold space: Combining SCOP, CATH, and the Dali Domain Dictionary
Day, Ryan; Beck, David A.C.; Armen, Roger S.; Daggett, Valerie
2003-01-01
We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web. PMID:14500873
A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary.
Day, Ryan; Beck, David A C; Armen, Roger S; Daggett, Valerie
2003-10-01
We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web.
The history of the CATH structural classification of protein domains.
Sillitoe, Ian; Dawson, Natalie; Thornton, Janet; Orengo, Christine
2015-12-01
This article presents a historical review of the protein structure classification database CATH. Together with the SCOP database, CATH remains comprehensive and reasonably up-to-date with the now more than 100,000 protein structures in the PDB. We review the expansion of the CATH and SCOP resources to capture predicted domain structures in the genome sequence data and to provide information on the likely functions of proteins mediated by their constituent domains. The establishment of comprehensive function annotation resources has also meant that domain families can be functionally annotated allowing insights into functional divergence and evolution within protein families. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.
Shamim, Mohammad Tabrez Anwar; Anwaruddin, Mohammad; Nagarajaram, H A
2007-12-15
Fold recognition is a key step in the protein structure discovery process, especially when traditional sequence comparison methods fail to yield convincing structural homologies. Although many methods have been developed for protein fold recognition, their accuracies remain low. This can be attributed to insufficient exploitation of fold discriminatory features. We have developed a new method for protein fold recognition using structural information of amino acid residues and amino acid residue pairs. Since protein fold recognition can be treated as a protein fold classification problem, we have developed a Support Vector Machine (SVM) based classifier approach that uses secondary structural state and solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors. Among the individual properties examined secondary structural state frequencies of amino acids gave an overall accuracy of 65.2% for fold discrimination, which is better than the accuracy by any method reported so far in the literature. Combination of secondary structural state frequencies with solvent accessibility state frequencies of amino acids and amino acid pairs further improved the fold discrimination accuracy to more than 70%, which is approximately 8% higher than the best available method. In this study we have also tested, for the first time, an all-together multi-class method known as Crammer and Singer method for protein fold classification. Our studies reveal that the three multi-class classification methods, namely one versus all, one versus one and Crammer and Singer method, yield similar predictions. Dataset and stand-alone program are available upon request.
Elman RNN based classification of proteins sequences on account of their mutual information.
Mishra, Pooja; Nath Pandey, Paras
2012-10-21
In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.
Re-visiting protein-centric two-tier classification of existing DNA-protein complexes
2012-01-01
Background Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification. Results On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc. Conclusions Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information. PMID:22800292
Re-visiting protein-centric two-tier classification of existing DNA-protein complexes.
Malhotra, Sony; Sowdhamini, Ramanathan
2012-07-16
Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification. On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc. Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.
An overview of the structures of protein-DNA complexes
Luscombe, Nicholas M; Austin, Susan E; Berman , Helen M; Thornton, Janet M
2000-01-01
On the basis of a structural analysis of 240 protein-DNA complexes contained in the Protein Data Bank (PDB), we have classified the DNA-binding proteins involved into eight different structural/functional groups, which are further classified into 54 structural families. Here we present this classification and review the functions, structures and binding interactions of these protein-DNA complexes. PMID:11104519
Support vector machine based classification of fast Fourier transform spectroscopy of proteins
NASA Astrophysics Data System (ADS)
Lazarevic, Aleksandar; Pokrajac, Dragoljub; Marcano, Aristides; Melikechi, Noureddine
2009-02-01
Fast Fourier transform spectroscopy has proved to be a powerful method for study of the secondary structure of proteins since peak positions and their relative amplitude are affected by the number of hydrogen bridges that sustain this secondary structure. However, to our best knowledge, the method has not been used yet for identification of proteins within a complex matrix like a blood sample. The principal reason is the apparent similarity of protein infrared spectra with actual differences usually masked by the solvent contribution and other interactions. In this paper, we propose a novel machine learning based method that uses protein spectra for classification and identification of such proteins within a given sample. The proposed method uses principal component analysis (PCA) to identify most important linear combinations of original spectral components and then employs support vector machine (SVM) classification model applied on such identified combinations to categorize proteins into one of given groups. Our experiments have been performed on the set of four different proteins, namely: Bovine Serum Albumin, Leptin, Insulin-like Growth Factor 2 and Osteopontin. Our proposed method of applying principal component analysis along with support vector machines exhibits excellent classification accuracy when identifying proteins using their infrared spectra.
Applying graph theory to protein structures: an atlas of coiled coils.
Heal, Jack W; Bartlett, Gail J; Wood, Christopher W; Thomson, Andrew R; Woolfson, Derek N
2018-05-02
To understand protein structure, folding and function fully and to design proteins de novo reliably, we must learn from natural protein structures that have been characterised experimentally. The number of protein structures available is large and growing exponentially, which makes this task challenging. Indeed, computational resources are becoming increasingly important for classifying and analysing this resource. Here, we use tools from graph theory to define an atlas classification scheme for automatically categorising certain protein substructures. Focusing on the α-helical coiled coils, which are ubiquitous protein-structure and protein-protein interaction motifs, we present a suite of computational resources designed for analysing these assemblies. iSOCKET enables interactive analysis of side-chain packing within proteins to identify coiled coils automatically and with considerable user control. Applying a graph theory-based atlas classification scheme to structures identified by iSOCKET gives the Atlas of Coiled Coils, a fully automated, updated overview of extant coiled coils. The utility of this approach is illustrated with the first formal classification of an emerging subclass of coiled coils called α-helical barrels. Furthermore, in the Atlas, the known coiled-coil universe is presented alongside a partial enumeration of the 'dark matter' of coiled-coil structures; i.e., those coiled-coil architectures that are theoretically possible but have not been observed to date, and thus present defined targets for protein design. iSOCKET is available as part of the open-source GitHub repository associated with this work (https://github.com/woolfson-group/isocket). This repository also contains all the data generated when classifying the protein graphs. The Atlas of Coiled Coils is available at: http://coiledcoils.chm.bris.ac.uk/atlas/app.
Acyl carrier protein structural classification and normal mode analysis
Cantu, David C; Forrester, Michael J; Charov, Katherine; Reilly, Peter J
2012-01-01
All acyl carrier protein primary and tertiary structures were gathered into the ThYme database. They are classified into 16 families by amino acid sequence similarity, with members of the different families having sequences with statistically highly significant differences. These classifications are supported by tertiary structure superposition analysis. Tertiary structures from a number of families are very similar, suggesting that these families may come from a single distant ancestor. Normal vibrational mode analysis was conducted on experimentally determined freestanding structures, showing greater fluctuations at chain termini and loops than in most helices. Their modes overlap more so within families than between different families. The tertiary structures of three acyl carrier protein families that lacked any known structures were predicted as well. PMID:22374859
T-RMSD: a web server for automated fine-grained protein structural classification.
Magis, Cedrik; Di Tommaso, Paolo; Notredame, Cedric
2013-07-01
This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein family or group of structurally related proteins using distance RMSD (dRMSD) variations. These distances are computed between all pairs of equivalent residues, as defined by the ungapped columns within a given multiple sequence alignment. Using these generated distance matrices (one per equivalent position), T-RMSD produces a structural tree with support values for each cluster node, reminiscent of bootstrap values. These values, associated with the tree topology, allow a quantitative estimate of structural distances between proteins or group of proteins defined by the tree topology. The clusters thus defined have been shown to be structurally and functionally informative. The T-RMSD web server is a free website open to all users and available at http://tcoffee.crg.cat/apps/tcoffee/do:trmsd.
T-RMSD: a web server for automated fine-grained protein structural classification
Magis, Cedrik; Di Tommaso, Paolo; Notredame, Cedric
2013-01-01
This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein family or group of structurally related proteins using distance RMSD (dRMSD) variations. These distances are computed between all pairs of equivalent residues, as defined by the ungapped columns within a given multiple sequence alignment. Using these generated distance matrices (one per equivalent position), T-RMSD produces a structural tree with support values for each cluster node, reminiscent of bootstrap values. These values, associated with the tree topology, allow a quantitative estimate of structural distances between proteins or group of proteins defined by the tree topology. The clusters thus defined have been shown to be structurally and functionally informative. The T-RMSD web server is a free website open to all users and available at http://tcoffee.crg.cat/apps/tcoffee/do:trmsd. PMID:23716642
Pascual-García, Alberto; Abia, David; Ortiz, Angel R; Bastolla, Ugo
2009-03-01
Structural classifications of proteins assume the existence of the fold, which is an intrinsic equivalence class of protein domains. Here, we test in which conditions such an equivalence class is compatible with objective similarity measures. We base our analysis on the transitive property of the equivalence relationship, requiring that similarity of A with B and B with C implies that A and C are also similar. Divergent gene evolution leads us to expect that the transitive property should approximately hold. However, if protein domains are a combination of recurrent short polypeptide fragments, as proposed by several authors, then similarity of partial fragments may violate the transitive property, favouring the continuous view of the protein structure space. We propose a measure to quantify the violations of the transitive property when a clustering algorithm joins elements into clusters, and we find out that such violations present a well defined and detectable cross-over point, from an approximately transitive regime at high structure similarity to a regime with large transitivity violations and large differences in length at low similarity. We argue that protein structure space is discrete and hierarchic classification is justified up to this cross-over point, whereas at lower similarities the structure space is continuous and it should be represented as a network. We have tested the qualitative behaviour of this measure, varying all the choices involved in the automatic classification procedure, i.e., domain decomposition, alignment algorithm, similarity score, and clustering algorithm, and we have found out that this behaviour is quite robust. The final classification depends on the chosen algorithms. We used the values of the clustering coefficient and the transitivity violations to select the optimal choices among those that we tested. Interestingly, this criterion also favours the agreement between automatic and expert classifications. As a domain set, we have selected a consensus set of 2,890 domains decomposed very similarly in SCOP and CATH. As an alignment algorithm, we used a global version of MAMMOTH developed in our group, which is both rapid and accurate. As a similarity measure, we used the size-normalized contact overlap, and as a clustering algorithm, we used average linkage. The resulting automatic classification at the cross-over point was more consistent than expert ones with respect to the structure similarity measure, with 86% of the clusters corresponding to subsets of either SCOP or CATH superfamilies and fewer than 5% containing domains in distinct folds according to both SCOP and CATH. Almost 15% of SCOP superfamilies and 10% of CATH superfamilies were split, consistent with the notion of fold change in protein evolution. These results were qualitatively robust for all choices that we tested, although we did not try to use alignment algorithms developed by other groups. Folds defined in SCOP and CATH would be completely joined in the regime of large transitivity violations where clustering is more arbitrary. Consistently, the agreement between SCOP and CATH at fold level was lower than their agreement with the automatic classification obtained using as a clustering algorithm, respectively, average linkage (for SCOP) or single linkage (for CATH). The networks representing significant evolutionary and structural relationships between clusters beyond the cross-over point may allow us to perform evolutionary, structural, or functional analyses beyond the limits of classification schemes. These networks and the underlying clusters are available at http://ub.cbm.uam.es/research/ProtNet.php.
A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3
Dietmann, Sabine; Park, Jong; Notredame, Cedric; Heger, Andreas; Lappe, Michael; Holm, Liisa
2001-01-01
The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families. PMID:11125048
Dong, Yadong; Sun, Yongqi; Qin, Chao
2018-01-01
The existing protein complex detection methods can be broadly divided into two categories: unsupervised and supervised learning methods. Most of the unsupervised learning methods assume that protein complexes are in dense regions of protein-protein interaction (PPI) networks even though many true complexes are not dense subgraphs. Supervised learning methods utilize the informative properties of known complexes; they often extract features from existing complexes and then use the features to train a classification model. The trained model is used to guide the search process for new complexes. However, insufficient extracted features, noise in the PPI data and the incompleteness of complex data make the classification model imprecise. Consequently, the classification model is not sufficient for guiding the detection of complexes. Therefore, we propose a new robust score function that combines the classification model with local structural information. Based on the score function, we provide a search method that works both forwards and backwards. The results from experiments on six benchmark PPI datasets and three protein complex datasets show that our approach can achieve better performance compared with the state-of-the-art supervised, semi-supervised and unsupervised methods for protein complex detection, occasionally significantly outperforming such methods.
Improved protein surface comparison and application to low-resolution protein structure data.
Sael, Lee; Kihara, Daisuke
2010-12-14
Recent advancements of experimental techniques for determining protein tertiary structures raise significant challenges for protein bioinformatics. With the number of known structures of unknown function expanding at a rapid pace, an urgent task is to provide reliable clues to their biological function on a large scale. Conventional approaches for structure comparison are not suitable for a real-time database search due to their slow speed. Moreover, a new challenge has arisen from recent techniques such as electron microscopy (EM), which provide low-resolution structure data. Previously, we have introduced a method for protein surface shape representation using the 3D Zernike descriptors (3DZDs). The 3DZD enables fast structure database searches, taking advantage of its rotation invariance and compact representation. The search results of protein surface represented with the 3DZD has showngood agreement with the existing structure classifications, but some discrepancies were also observed. The three new surface representations of backbone atoms, originally devised all-atom-surface representation, and the combination of all-atom surface with the backbone representation are examined. All representations are encoded with the 3DZD. Also, we have investigated the applicability of the 3DZD for searching protein EM density maps of varying resolutions. The surface representations are evaluated on structure retrieval using two existing classifications, SCOP and the CE-based classification. Overall, the 3DZDs representing backbone atoms show better retrieval performance than the original all-atom surface representation. The performance further improved when the two representations are combined. Moreover, we observed that the 3DZD is also powerful in comparing low-resolution structures obtained by electron microscopy.
Multi-label literature classification based on the Gene Ontology graph.
Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua
2008-12-08
The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cang, Zixuan; Mu, Lin; Wu, Kedi
Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.
Improved protein surface comparison and application to low-resolution protein structure data
2010-01-01
Background Recent advancements of experimental techniques for determining protein tertiary structures raise significant challenges for protein bioinformatics. With the number of known structures of unknown function expanding at a rapid pace, an urgent task is to provide reliable clues to their biological function on a large scale. Conventional approaches for structure comparison are not suitable for a real-time database search due to their slow speed. Moreover, a new challenge has arisen from recent techniques such as electron microscopy (EM), which provide low-resolution structure data. Previously, we have introduced a method for protein surface shape representation using the 3D Zernike descriptors (3DZDs). The 3DZD enables fast structure database searches, taking advantage of its rotation invariance and compact representation. The search results of protein surface represented with the 3DZD has showngood agreement with the existing structure classifications, but some discrepancies were also observed. Results The three new surface representations of backbone atoms, originally devised all-atom-surface representation, and the combination of all-atom surface with the backbone representation are examined. All representations are encoded with the 3DZD. Also, we have investigated the applicability of the 3DZD for searching protein EM density maps of varying resolutions. The surface representations are evaluated on structure retrieval using two existing classifications, SCOP and the CE-based classification. Conclusions Overall, the 3DZDs representing backbone atoms show better retrieval performance than the original all-atom surface representation. The performance further improved when the two representations are combined. Moreover, we observed that the 3DZD is also powerful in comparing low-resolution structures obtained by electron microscopy. PMID:21172052
Automatic classification of protein structures relying on similarities between alignments
2012-01-01
Background Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. Results When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and gives a reduced graph in which no ternary constraints are violated. Our approach is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. Such method was used for classifying ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments. Conclusions We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP. PMID:22974051
Kinjo, Akira R.; Nakamura, Haruki
2012-01-01
Comparison and classification of protein structures are fundamental means to understand protein functions. Due to the computational difficulty and the ever-increasing amount of structural data, however, it is in general not feasible to perform exhaustive all-against-all structure comparisons necessary for comprehensive classifications. To efficiently handle such situations, we have previously proposed a method, now called GIRAF. We herein describe further improvements in the GIRAF protein structure search and alignment method. The GIRAF method achieves extremely efficient search of similar structures of ligand binding sites of proteins by exploiting database indexing of structural features of local coordinate frames. In addition, it produces refined atom-wise alignments by iterative applications of the Hungarian method to the bipartite graph defined for a pair of superimposed structures. By combining the refined alignments based on different local coordinate frames, it is made possible to align structures involving domain movements. We provide detailed accounts for the database design, the search and alignment algorithms as well as some benchmark results. PMID:27493524
NOXclass: prediction of protein-protein interaction types.
Zhu, Hongbo; Domingues, Francisco S; Sommer, Ingolf; Lengauer, Thomas
2006-01-19
Structural models determined by X-ray crystallography play a central role in understanding protein-protein interactions at the molecular level. Interpretation of these models requires the distinction between non-specific crystal packing contacts and biologically relevant interactions. This has been investigated previously and classification approaches have been proposed. However, less attention has been devoted to distinguishing different types of biological interactions. These interactions are classified as obligate and non-obligate according to the effect of the complex formation on the stability of the protomers. So far no automatic classification methods for distinguishing obligate, non-obligate and crystal packing interactions have been made available. Six interface properties have been investigated on a dataset of 243 protein interactions. The six properties have been combined using a support vector machine algorithm, resulting in NOXclass, a classifier for distinguishing obligate, non-obligate and crystal packing interactions. We achieve an accuracy of 91.8% for the classification of these three types of interactions using a leave-one-out cross-validation procedure. NOXclass allows the interpretation and analysis of protein quaternary structures. In particular, it generates testable hypotheses regarding the nature of protein-protein interactions, when experimental results are not available. We expect this server will benefit the users of protein structural models, as well as protein crystallographers and NMR spectroscopists. A web server based on the method and the datasets used in this study are available at http://noxclass.bioinf.mpi-inf.mpg.de/.
Protein structure database search and evolutionary classification.
Yang, Jinn-Moon; Tung, Chi-Hua
2006-01-01
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].
Neumann, Sindy; Hartmann, Holger; Martin-Galiano, Antonio J; Fuchs, Angelika; Frishman, Dmitrij
2012-03-01
Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/. Copyright © 2011 Wiley Periodicals, Inc.
Mining sequential patterns for protein fold recognition.
Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I
2008-02-01
Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.
Structural classification of small, disulfide-rich protein domains.
Cheek, Sara; Krishna, S Sri; Grishin, Nick V
2006-05-26
Disulfide-rich domains are small protein domains whose global folds are stabilized primarily by the formation of disulfide bonds and, to a much lesser extent, by secondary structure and hydrophobic interactions. Disulfide-rich domains perform a wide variety of roles functioning as growth factors, toxins, enzyme inhibitors, hormones, pheromones, allergens, etc. These domains are commonly found both as independent (single-domain) proteins and as domains within larger polypeptides. Here, we present a comprehensive structural classification of approximately 3000 small, disulfide-rich protein domains. We find that these domains can be arranged into 41 fold groups on the basis of structural similarity. Our fold groups, which describe broader structural relationships than existing groupings of these domains, bring together representatives with previously unacknowledged similarities; 18 of the 41 fold groups include domains from several SCOP folds. Within the fold groups, the domains are assembled into families of homologs. We define 98 families of disulfide-rich domains, some of which include newly detected homologs, particularly among knottin-like domains. On the basis of this classification, we have examined cases of convergent and divergent evolution of functions performed by disulfide-rich proteins. Disulfide bonding patterns in these domains are also evaluated. Reducible disulfide bonding patterns are much less frequent, while symmetric disulfide bonding patterns are more common than expected from random considerations. Examples of variations in disulfide bonding patterns found within families and fold groups are discussed.
Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics
Faye, Ibrahima; Samir, Brahim Belhaouari; Md Said, Abas
2014-01-01
Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth. PMID:25045727
Predicting β-Turns in Protein Using Kernel Logistic Regression
Elbashir, Murtada Khalafallah; Sheng, Yu; Wang, Jianxin; Wu, FangXiang; Li, Min
2013-01-01
A β-turn is a secondary protein structure type that plays a significant role in protein configuration and function. On average 25% of amino acids in protein structures are located in β-turns. It is very important to develope an accurate and efficient method for β-turns prediction. Most of the current successful β-turns prediction methods use support vector machines (SVMs) or neural networks (NNs). The kernel logistic regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems. However, it is often not found in β-turns classification, mainly because it is computationally expensive. In this paper, we used KLR to obtain sparse β-turns prediction in short evolution time. Secondary structure information and position-specific scoring matrices (PSSMs) are utilized as input features. We achieved Q total of 80.7% and MCC of 50% on BT426 dataset. These results show that KLR method with the right algorithm can yield performance equivalent to or even better than NNs and SVMs in β-turns prediction. In addition, KLR yields probabilistic outcome and has a well-defined extension to multiclass case. PMID:23509793
Predicting β-turns in protein using kernel logistic regression.
Elbashir, Murtada Khalafallah; Sheng, Yu; Wang, Jianxin; Wu, Fangxiang; Li, Min
2013-01-01
A β-turn is a secondary protein structure type that plays a significant role in protein configuration and function. On average 25% of amino acids in protein structures are located in β-turns. It is very important to develope an accurate and efficient method for β-turns prediction. Most of the current successful β-turns prediction methods use support vector machines (SVMs) or neural networks (NNs). The kernel logistic regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems. However, it is often not found in β-turns classification, mainly because it is computationally expensive. In this paper, we used KLR to obtain sparse β-turns prediction in short evolution time. Secondary structure information and position-specific scoring matrices (PSSMs) are utilized as input features. We achieved Q total of 80.7% and MCC of 50% on BT426 dataset. These results show that KLR method with the right algorithm can yield performance equivalent to or even better than NNs and SVMs in β-turns prediction. In addition, KLR yields probabilistic outcome and has a well-defined extension to multiclass case.
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Stafford, William Noble; Leslie, Christina
2007-01-01
Background Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. Results We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. Conclusion By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition. PMID:17570145
SeqRate: sequence-based protein folding type classification and rates prediction
2010-01-01
Background Protein folding rate is an important property of a protein. Predicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish the different kinetic nature (two-state folding or multi-state folding) of the proteins. Here we developed a method, SeqRate, to predict both protein folding kinetic type (two-state versus multi-state) and real-value folding rate using sequence length, amino acid composition, contact order, contact number, and secondary structure information predicted from only protein sequence with support vector machines. Results We systematically studied the contributions of individual features to folding rate prediction. On a standard benchmark dataset, the accuracy of folding kinetic type classification is 80%. The Pearson correlation coefficient and the mean absolute difference between predicted and experimental folding rates (sec-1) in the base-10 logarithmic scale are 0.81 and 0.79 for two-state protein folders, and 0.80 and 0.68 for three-state protein folders. SeqRate is the first sequence-based method for protein folding type classification and its accuracy of fold rate prediction is improved over previous sequence-based methods. Its performance can be further enhanced with additional information, such as structure-based geometric contacts, as inputs. Conclusions Both the web server and software of predicting folding rate are publicly available at http://casp.rnet.missouri.edu/fold_rate/index.html. PMID:20438647
Benchmarking protein classification algorithms via supervised cross-validation.
Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor
2008-04-24
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
Jahandideh, Samad; Srinivasasainagendra, Vinodh; Zhi, Degui
2012-11-07
RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.
Towards quantitative classification of folded proteins in terms of elementary functions.
Hu, Shuangwei; Krokhotin, Andrei; Niemi, Antti J; Peng, Xubiao
2011-04-01
A comparative classification scheme provides a good basis for several approaches to understand proteins, including prediction of relations between their structure and biological function. But it remains a challenge to combine a classification scheme that describes a protein starting from its well-organized secondary structures and often involves direct human involvement, with an atomary-level physics-based approach where a protein is fundamentally nothing more than an ensemble of mutually interacting carbon, hydrogen, oxygen, and nitrogen atoms. In order to bridge these two complementary approaches to proteins, conceptually novel tools need to be introduced. Here we explain how an approach toward geometric characterization of entire folded proteins can be based on a single explicit elementary function that is familiar from nonlinear physical systems where it is known as the kink soliton. Our approach enables the conversion of hierarchical structural information into a quantitative form that allows for a folded protein to be characterized in terms of a small number of global parameters that are in principle computable from atomary-level considerations. As an example we describe in detail how the native fold of the myoglobin 1M6C emerges from a combination of kink solitons with a very high atomary-level accuracy. We also verify that our approach describes longer loops and loops connecting α helices with β strands, with the same overall accuracy. ©2011 American Physical Society
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Functional classification of protein structures by local structure matching in graph representation.
Mills, Caitlyn L; Garg, Rohan; Lee, Joslynn S; Tian, Liang; Suciu, Alexandru; Cooperman, Gene; Beuning, Penny J; Ondrechen, Mary Jo
2018-03-31
As a result of high-throughput protein structure initiatives, over 14,400 protein structures have been solved by structural genomics (SG) centers and participating research groups. While the totality of SG data represents a tremendous contribution to genomics and structural biology, reliable functional information for these proteins is generally lacking. Better functional predictions for SG proteins will add substantial value to the structural information already obtained. Our method described herein, Graph Representation of Active Sites for Prediction of Function (GRASP-Func), predicts quickly and accurately the biochemical function of proteins by representing residues at the predicted local active site as graphs rather than in Cartesian coordinates. We compare the GRASP-Func method to our previously reported method, structurally aligned local sites of activity (SALSA), using the ribulose phosphate binding barrel (RPBB), 6-hairpin glycosidase (6-HG), and Concanavalin A-like Lectins/Glucanase (CAL/G) superfamilies as test cases. In each of the superfamilies, SALSA and the much faster method GRASP-Func yield similar correct classification of previously characterized proteins, providing a validated benchmark for the new method. In addition, we analyzed SG proteins using our SALSA and GRASP-Func methods to predict function. Forty-one SG proteins in the RPBB superfamily, nine SG proteins in the 6-HG superfamily, and one SG protein in the CAL/G superfamily were successfully classified into one of the functional families in their respective superfamily by both methods. This improved, faster, validated computational method can yield more reliable predictions of function that can be used for a wide variety of applications by the community. © 2018 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
The COG database: new developments in phylogenetic classification of proteins from complete genomes
Tatusov, Roman L.; Natale, Darren A.; Garkavtsev, Igor V.; Tatusova, Tatiana A.; Shankavaram, Uma T.; Rao, Bachoti S.; Kiryutin, Boris; Galperin, Michael Y.; Fedorova, Natalie D.; Koonin, Eugene V.
2001-01-01
The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis. PMID:11125040
Hu, Jing; Zhang, Xiaolong; Liu, Xiaoming; Tang, Jinshan
2015-06-01
Discovering hot regions in protein-protein interaction is important for drug and protein design, while experimental identification of hot regions is a time-consuming and labor-intensive effort; thus, the development of predictive models can be very helpful. In hot region prediction research, some models are based on structure information, and others are based on a protein interaction network. However, the prediction accuracy of these methods can still be improved. In this paper, a new method is proposed for hot region prediction, which combines density-based incremental clustering with feature-based classification. The method uses density-based incremental clustering to obtain rough hot regions, and uses feature-based classification to remove the non-hot spot residues from the rough hot regions. Experimental results show that the proposed method significantly improves the prediction performance of hot regions. Copyright © 2015 Elsevier Ltd. All rights reserved.
ECOD: new developments in the evolutionary classification of domains
Schaeffer, R. Dustin; Liao, Yuxing; Cheng, Hua; Grishin, Nick V.
2017-01-01
Evolutionary Classification Of protein Domains (ECOD) (http://prodata.swmed.edu/ecod) comprehensively classifies protein with known spatial structures maintained by the Protein Data Bank (PDB) into evolutionary groups of protein domains. ECOD relies on a combination of automatic and manual weekly updates to achieve its high accuracy and coverage with a short update cycle. ECOD classifies the approximately 120 000 depositions of the PDB into more than 500 000 domains in ∼3400 homologous groups. We show the performance of the weekly update pipeline since the release of ECOD, describe improvements to the ECOD website and available search options, and discuss novel structures and homologous groups that have been classified in the recent updates. Finally, we discuss the future directions of ECOD and further improvements planned for the hierarchy and update process. PMID:27899594
Huang, Chuen-Der; Lin, Chin-Teng; Pal, Nikhil Ranjan
2003-12-01
The structure classification of proteins plays a very important role in bioinformatics, since the relationships and characteristics among those known proteins can be exploited to predict the structure of new proteins. The success of a classification system depends heavily on two things: the tools being used and the features considered. For the bioinformatics applications, the role of appropriate features has not been paid adequate importance. In this investigation we use three novel ideas for multiclass protein fold classification. First, we use the gating neural network, where each input node is associated with a gate. This network can select important features in an online manner when the learning goes on. At the beginning of the training, all gates are almost closed, i.e., no feature is allowed to enter the network. Through the training, gates corresponding to good features are completely opened while gates corresponding to bad features are closed more tightly, and some gates may be partially open. The second novel idea is to use a hierarchical learning architecture (HLA). The classifier in the first level of HLA classifies the protein features into four major classes: all alpha, all beta, alpha + beta, and alpha/beta. And in the next level we have another set of classifiers, which further classifies the protein features into 27 folds. The third novel idea is to induce the indirect coding features from the amino-acid composition sequence of proteins based on the N-gram concept. This provides us with more representative and discriminative new local features of protein sequences for multiclass protein fold classification. The proposed HLA with new indirect coding features increases the protein fold classification accuracy by about 12%. Moreover, the gating neural network is found to reduce the number of features drastically. Using only half of the original features selected by the gating neural network can reach comparable test accuracy as that using all the original features. The gating mechanism also helps us to get a better insight into the folding process of proteins. For example, tracking the evolution of different gates we can find which characteristics (features) of the data are more important for the folding process. And, of course, it also reduces the computation time.
Topological properties of complex networks in protein structures
NASA Astrophysics Data System (ADS)
Kim, Kyungsik; Jung, Jae-Won; Min, Seungsik
2014-03-01
We study topological properties of networks in structural classification of proteins. We model the native-state protein structure as a network made of its constituent amino-acids and their interactions. We treat four structural classes of proteins composed predominantly of α helices and β sheets and consider several proteins from each of these classes whose sizes range from amino acids of the Protein Data Bank. Particularly, we simulate and analyze the network metrics such as the mean degree, the probability distribution of degree, the clustering coefficient, the characteristic path length, the local efficiency, and the cost. This work was supported by the KMAR and DP under Grant WISE project (153-3100-3133-302-350).
Protein Secondary Structure Prediction Using AutoEncoder Network and Bayes Classifier
NASA Astrophysics Data System (ADS)
Wang, Leilei; Cheng, Jinyong
2018-03-01
Protein secondary structure prediction is belong to bioinformatics,and it's important in research area. In this paper, we propose a new prediction way of protein using bayes classifier and autoEncoder network. Our experiments show some algorithms including the construction of the model, the classification of parameters and so on. The data set is a typical CB513 data set for protein. In terms of accuracy, the method is the cross validation based on the 3-fold. Then we can get the Q3 accuracy. Paper results illustrate that the autoencoder network improved the prediction accuracy of protein secondary structure.
Classification of ligand molecules in PDB with graph match-based structural superposition.
Shionyu-Mitsuyama, Clara; Hijikata, Atsushi; Tsuji, Toshiyuki; Shirai, Tsuyoshi
2016-12-01
The fast heuristic graph match algorithm for small molecules, COMPLIG, was improved by adding a structural superposition process to verify the atom-atom matching. The modified method was used to classify the small molecule ligands in the Protein Data Bank (PDB) by their three-dimensional structures, and 16,660 types of ligands in the PDB were classified into 7561 clusters. In contrast, a classification by a previous method (without structure superposition) generated 3371 clusters from the same ligand set. The characteristic feature in the current classification system is the increased number of singleton clusters, which contained only one ligand molecule in a cluster. Inspections of the singletons in the current classification system but not in the previous one implied that the major factors for the isolation were differences in chirality, cyclic conformations, separation of substructures, and bond length. Comparisons between current and previous classification systems revealed that the superposition-based classification was effective in clustering functionally related ligands, such as drugs targeted to specific biological processes, owing to the strictness of the atom-atom matching.
Merckel, Michael C; Huiskonen, Juha T; Bamford, Dennis H; Goldman, Adrian; Tuma, Roman
2005-04-15
Comparisons of bacteriophage PRD1 and adenovirus protein structures and virion architectures have been instrumental in unraveling an evolutionary relationship and have led to a proposal of a phylogeny-based virus classification. The structure of the PRD1 spike protein P5 provides further insight into the evolution of viral proteins. The crystallized P5 fragment comprises two structural domains: a globular knob and a fibrous shaft. The head folds into a ten-stranded jelly roll beta barrel, which is structurally related to the tumor necrosis factor (TNF) and the PRD1 coat protein domains. The shaft domain is a structural counterpart to the adenovirus spike shaft. The structural relationships between PRD1, TNF, and adenovirus proteins suggest that the vertex proteins may have originated from an ancestral TNF-like jelly roll coat protein via a combination of gene duplication and deletion.
Goonesekere, Nalin Cw
2009-01-01
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
CAB-Align: A Flexible Protein Structure Alignment Method Based on the Residue-Residue Contact Area.
Terashi, Genki; Takeda-Shitaka, Mayuko
2015-01-01
Proteins are flexible, and this flexibility has an essential functional role. Flexibility can be observed in loop regions, rearrangements between secondary structure elements, and conformational changes between entire domains. However, most protein structure alignment methods treat protein structures as rigid bodies. Thus, these methods fail to identify the equivalences of residue pairs in regions with flexibility. In this study, we considered that the evolutionary relationship between proteins corresponds directly to the residue-residue physical contacts rather than the three-dimensional (3D) coordinates of proteins. Thus, we developed a new protein structure alignment method, contact area-based alignment (CAB-align), which uses the residue-residue contact area to identify regions of similarity. The main purpose of CAB-align is to identify homologous relationships at the residue level between related protein structures. The CAB-align procedure comprises two main steps: First, a rigid-body alignment method based on local and global 3D structure superposition is employed to generate a sufficient number of initial alignments. Then, iterative dynamic programming is executed to find the optimal alignment. We evaluated the performance and advantages of CAB-align based on four main points: (1) agreement with the gold standard alignment, (2) alignment quality based on an evolutionary relationship without 3D coordinate superposition, (3) consistency of the multiple alignments, and (4) classification agreement with the gold standard classification. Comparisons of CAB-align with other state-of-the-art protein structure alignment methods (TM-align, FATCAT, and DaliLite) using our benchmark dataset showed that CAB-align performed robustly in obtaining high-quality alignments and generating consistent multiple alignments with high coverage and accuracy rates, and it performed extremely well when discriminating between homologous and nonhomologous pairs of proteins in both single and multi-domain comparisons. The CAB-align software is freely available to academic users as stand-alone software at http://www.pharm.kitasato-u.ac.jp/bmd/bmd/Publications.html.
2015-01-01
Identifying determinant(s) of protein thermostability is key for rational and data-driven protein engineering. By analyzing more than 130 pairs of mesophilic/(hyper)thermophilic proteins, we identified the quality (residue-wise energy) of hydrophobic interactions as a key factor for protein thermostability. This distinguishes our study from previous ones that investigated predominantly structural determinants. Considering this key factor, we successfully discriminated between pairs of mesophilic/(hyper)thermophilic proteins (discrimination accuracy: ∼80%) and searched for structural weak spots in E. coli dihydrofolate reductase (classification accuracy: 70%). PMID:24437522
Amaranth, quinoa and chia protein isolates: Physicochemical and structural properties.
López, Débora N; Galante, Micaela; Robson, María; Boeris, Valeria; Spelzini, Darío
2018-04-01
An increasing use of vegetable protein is required to support the production of protein-rich foods which can replace animal proteins in the human diet. Amaranth, chia and quinoa seeds contain proteins which have biological and functional properties that provide nutritional benefits due to their reasonably well-balanced aminoacid content. This review analyses these vegetable proteins and focuses on recent research on protein classification and isolation as well as structural characterization by means of fluorescence spectroscopy, surface hydrophobicity and differential scanning calorimetry. Isolation procedures have a profound influence on the structural properties of the proteins and, therefore, on their in vitro digestibility. The present article provides a comprehensive overview of the properties and characterization of these proteins. Copyright © 2017 Elsevier B.V. All rights reserved.
Using linear algebra for protein structural comparison and classification
2009-01-01
In this article, we describe a novel methodology to extract semantic characteristics from protein structures using linear algebra in order to compose structural signature vectors which may be used efficiently to compare and classify protein structures into fold families. These signatures are built from the pattern of hydrophobic intrachain interactions using Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) techniques. Considering proteins as documents and contacts as terms, we have built a retrieval system which is able to find conserved contacts in samples of myoglobin fold family and to retrieve these proteins among proteins of varied folds with precision of up to 80%. The classifier is a web tool available at our laboratory website. Users can search for similar chains from a specific PDB, view and compare their contact maps and browse their structures using a JMol plug-in. PMID:21637532
Using linear algebra for protein structural comparison and classification.
Gomide, Janaína; Melo-Minardi, Raquel; Dos Santos, Marcos Augusto; Neshich, Goran; Meira, Wagner; Lopes, Júlio César; Santoro, Marcelo
2009-07-01
In this article, we describe a novel methodology to extract semantic characteristics from protein structures using linear algebra in order to compose structural signature vectors which may be used efficiently to compare and classify protein structures into fold families. These signatures are built from the pattern of hydrophobic intrachain interactions using Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) techniques. Considering proteins as documents and contacts as terms, we have built a retrieval system which is able to find conserved contacts in samples of myoglobin fold family and to retrieve these proteins among proteins of varied folds with precision of up to 80%. The classifier is a web tool available at our laboratory website. Users can search for similar chains from a specific PDB, view and compare their contact maps and browse their structures using a JMol plug-in.
Rebelling for a Reason: Protein Structural “Outliers”
Arumugam, Gandhimathi; Nair, Anu G.; Hariharaputran, Sridhar; Ramanathan, Sowdhamini
2013-01-01
Analysis of structural variation in domain superfamilies can reveal constraints in protein evolution which aids protein structure prediction and classification. Structure-based sequence alignment of distantly related proteins, organized in PASS2 database, provides clues about structurally conserved regions among different functional families. Some superfamily members show large structural differences which are functionally relevant. This paper analyses the impact of structural divergence on function for multi-member superfamilies, selected from the PASS2 superfamily alignment database. Functional annotations within superfamilies, with structural outliers or ‘rebels’, are discussed in the context of structural variations. Overall, these data reinforce the idea that functional similarities cannot be extrapolated from mere structural conservation. The implication for fold-function prediction is that the functional annotations can only be inherited with very careful consideration, especially at low sequence identities. PMID:24073209
A new definition and properties of the similarity value between two protein structures.
Saberi Fathi, S M
2016-10-01
Knowledge regarding the 3D structure of a protein provides useful information about the protein's functional properties. Particularly, structural similarity between proteins can be used as a good predictor of functional similarity. One method that uses the 3D geometrical structure of proteins in order to compare them is the similarity value (SV). In this paper, we introduce a new definition of the SV measure for comparing two proteins. To this end, we consider the mass of the protein's atoms and concentrate on the number of protein's atoms to be compared. This defines a new measure, called the weighted similarity value (WSV), adding physical properties to geometrical properties. We also show that our results are in good agreement with the results obtained by TM-SCORE and DALILITE. WSV can be of use in protein classification and in drug discovery.
Pelay-Gimeno, Marta; Glas, Adrian; Koch, Oliver; Grossmann, Tom N
2015-01-01
Protein–protein interactions (PPIs) are involved at all levels of cellular organization, thus making the development of PPI inhibitors extremely valuable. The identification of selective inhibitors is challenging because of the shallow and extended nature of PPI interfaces. Inhibitors can be obtained by mimicking peptide binding epitopes in their bioactive conformation. For this purpose, several strategies have been evolved to enable a projection of side chain functionalities in analogy to peptide secondary structures, thereby yielding molecules that are generally referred to as peptidomimetics. Herein, we introduce a new classification of peptidomimetics (classes A–D) that enables a clear assignment of available approaches. Based on this classification, the Review summarizes strategies that have been applied for the structure-based design of PPI inhibitors through stabilizing or mimicking turns, β-sheets, and helices. PMID:26119925
Meslamani, Jamel; Rognan, Didier; Kellenberger, Esther
2011-05-01
The sc-PDB database is an annotated archive of druggable binding sites extracted from the Protein Data Bank. It contains all-atoms coordinates for 8166 protein-ligand complexes, chosen for their geometrical and physico-chemical properties. The sc-PDB provides a functional annotation for proteins, a chemical description for ligands and the detailed intermolecular interactions for complexes. The sc-PDB now includes a hierarchical classification of all the binding sites within a functional class. The sc-PDB entries were first clustered according to the protein name indifferent of the species. For each cluster, we identified dissimilar sites (e.g. catalytic and allosteric sites of an enzyme). SCOPE AND APPLICATIONS: The classification of sc-PDB targets by binding site diversity was intended to facilitate chemogenomics approaches to drug design. In ligand-based approaches, it avoids comparing ligands that do not share the same binding site. In structure-based approaches, it permits to quantitatively evaluate the diversity of the binding site definition (variations in size, sequence and/or structure). The sc-PDB database is freely available at: http://bioinfo-pharma.u-strasbg.fr/scPDB.
Representing and comparing protein structures as paths in three-dimensional space
Zhi, Degui; Krishna, S Sri; Cao, Haibo; Pevzner, Pavel; Godzik, Adam
2006-01-01
Background Most existing formulations of protein structure comparison are based on detailed atomic level descriptions of protein structures and bypass potential insights that arise from a higher-level abstraction. Results We propose a structure comparison approach based on a simplified representation of proteins that describes its three-dimensional path by local curvature along the generalized backbone of the polypeptide. We have implemented a dynamic programming procedure that aligns curvatures of proteins by optimizing a defined sum turning angle deviation measure. Conclusion Although our procedure does not directly optimize global structural similarity as measured by RMSD, our benchmarking results indicate that it can surprisingly well recover the structural similarity defined by structure classification databases and traditional structure alignment programs. In addition, our program can recognize similarities between structures with extensive conformation changes that are beyond the ability of traditional structure alignment programs. We demonstrate the applications of procedure to several contexts of structure comparison. An implementation of our procedure, CURVE, is available as a public webserver. PMID:17052359
Method for protein structure alignment
Blankenbecler, Richard; Ohlsson, Mattias; Peterson, Carsten; Ringner, Markus
2005-02-22
This invention provides a method for protein structure alignment. More particularly, the present invention provides a method for identification, classification and prediction of protein structures. The present invention involves two key ingredients. First, an energy or cost function formulation of the problem simultaneously in terms of binary (Potts) assignment variables and real-valued atomic coordinates. Second, a minimization of the energy or cost function by an iterative method, where in each iteration (1) a mean field method is employed for the assignment variables and (2) exact rotation and/or translation of atomic coordinates is performed, weighted with the corresponding assignment variables.
Shao, Wei; Liu, Mingxia; Zhang, Daoqiang
2016-01-01
The systematic study of subcellular location pattern is very important for fully characterizing the human proteome. Nowadays, with the great advances in automated microscopic imaging, accurate bioimage-based classification methods to predict protein subcellular locations are highly desired. All existing models were constructed on the independent parallel hypothesis, where the cellular component classes are positioned independently in a multi-class classification engine. The important structural information of cellular compartments is missed. To deal with this problem for developing more accurate models, we proposed a novel cell structure-driven classifier construction approach (SC-PSorter) by employing the prior biological structural information in the learning model. Specifically, the structural relationship among the cellular components is reflected by a new codeword matrix under the error correcting output coding framework. Then, we construct multiple SC-PSorter-based classifiers corresponding to the columns of the error correcting output coding codeword matrix using a multi-kernel support vector machine classification approach. Finally, we perform the classifier ensemble by combining those multiple SC-PSorter-based classifiers via majority voting. We evaluate our method on a collection of 1636 immunohistochemistry images from the Human Protein Atlas database. The experimental results show that our method achieves an overall accuracy of 89.0%, which is 6.4% higher than the state-of-the-art method. The dataset and code can be downloaded from https://github.com/shaoweinuaa/. dqzhang@nuaa.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Computational intelligence techniques for biological data mining: An overview
NASA Astrophysics Data System (ADS)
Faye, Ibrahima; Iqbal, Muhammad Javed; Said, Abas Md; Samir, Brahim Belhaouari
2014-10-01
Computational techniques have been successfully utilized for a highly accurate analysis and modeling of multifaceted and raw biological data gathered from various genome sequencing projects. These techniques are proving much more effective to overcome the limitations of the traditional in-vitro experiments on the constantly increasing sequence data. However, most critical problems that caught the attention of the researchers may include, but not limited to these: accurate structure and function prediction of unknown proteins, protein subcellular localization prediction, finding protein-protein interactions, protein fold recognition, analysis of microarray gene expression data, etc. To solve these problems, various classification and clustering techniques using machine learning have been extensively used in the published literature. These techniques include neural network algorithms, genetic algorithms, fuzzy ARTMAP, K-Means, K-NN, SVM, Rough set classifiers, decision tree and HMM based algorithms. Major difficulties in applying the above algorithms include the limitations found in the previous feature encoding and selection methods while extracting the best features, increasing classification accuracy and decreasing the running time overheads of the learning algorithms. The application of this research would be potentially useful in the drug design and in the diagnosis of some diseases. This paper presents a concise overview of the well-known protein classification techniques.
Specificity of molecular interactions in transient protein-protein interaction interfaces.
Cho, Kyu-il; Lee, KiYoung; Lee, Kwang H; Kim, Dongsup; Lee, Doheon
2006-11-15
In this study, we investigate what types of interactions are specific to their biological function, and what types of interactions are persistent regardless of their functional category in transient protein-protein heterocomplexes. This is the first approach to analyze protein-protein interfaces systematically at the molecular interaction level in the context of protein functions. We perform systematic analysis at the molecular interaction level using classification and feature subset selection technique prevalent in the field of pattern recognition. To represent the physicochemical properties of protein-protein interfaces, we design 18 molecular interaction types using canonical and noncanonical interactions. Then, we construct input vector using the frequency of each interaction type in protein-protein interface. We analyze the 131 interfaces of transient protein-protein heterocomplexes in PDB: 33 protease-inhibitors, 52 antibody-antigens, 46 signaling proteins including 4 cyclin dependent kinase and 26 G-protein. Using kNN classification and feature subset selection technique, we show that there are specific interaction types based on their functional category, and such interaction types are conserved through the common binding mechanism, rather than through the sequence or structure conservation. The extracted interaction types are C(alpha)-- H...O==C interaction, cation...anion interaction, amine...amine interaction, and amine...cation interaction. With these four interaction types, we achieve the classification success rate up to 83.2% with leave-one-out cross-validation at k = 15. Of these four interaction types, C(alpha)--H...O==C shows binding specificity for protease-inhibitor complexes, while cation-anion interaction is predominant in signaling complexes. The amine ... amine and amine...cation interaction give a minor contribution to the classification accuracy. When combined with these two interactions, they increase the accuracy by 3.8%. In the case of antibody-antigen complexes, the sign is somewhat ambiguous. From the evolutionary perspective, while protease-inhibitors and sig-naling proteins have optimized their interfaces to suit their biological functions, antibody-antigen interactions are the happenstance, implying that antibody-antigen complexes do not show distinctive interaction types. Persistent interaction types such as pi...pi, amide-carbonyl, and hydroxyl-carbonyl interaction, are also investigated. Analyzing the structural orientations of the pi...pi stacking interactions, we find that herringbone shape is a major configuration in transient protein-protein interfaces. This result is different from that of protein core, where parallel-displaced configurations are the major configuration. We also analyze overall trend of amide-carbonyl and hydroxyl-carbonyl interactions. It is noticeable that nearly 82% of the interfaces have at least one hydroxyl-carbonyl interactions. (c) 2006 Wiley-Liss, Inc.
Doppelt-Azeroual, Olivia; Delfaud, François; Moriaud, Fabrice; de Brevern, Alexandre G
2010-04-01
Ligand-protein interactions are essential for biological processes, and precise characterization of protein binding sites is crucial to understand protein functions. MED-SuMo is a powerful technology to localize similar local regions on protein surfaces. Its heuristic is based on a 3D representation of macromolecules using specific surface chemical features associating chemical characteristics with geometrical properties. MED-SMA is an automated and fast method to classify binding sites. It is based on MED-SuMo technology, which builds a similarity graph, and it uses the Markov Clustering algorithm. Purine binding sites are well studied as drug targets. Here, purine binding sites of the Protein DataBank (PDB) are classified. Proteins potentially inhibited or activated through the same mechanism are gathered. Results are analyzed according to PROSITE annotations and to carefully refined functional annotations extracted from the PDB. As expected, binding sites associated with related mechanisms are gathered, for example, the Small GTPases. Nevertheless, protein kinases from different Kinome families are also found together, for example, Aurora-A and CDK2 proteins which are inhibited by the same drugs. Representative examples of different clusters are presented. The effectiveness of the MED-SMA approach is demonstrated as it gathers binding sites of proteins with similar structure-activity relationships. Moreover, an efficient new protocol associates structures absent of cocrystallized ligands to the purine clusters enabling those structures to be associated with a specific binding mechanism. Applications of this classification by binding mode similarity include target-based drug design and prediction of cross-reactivity and therefore potential toxic side effects.
Doppelt-Azeroual, Olivia; Delfaud, François; Moriaud, Fabrice; de Brevern, Alexandre G
2010-01-01
Ligand–protein interactions are essential for biological processes, and precise characterization of protein binding sites is crucial to understand protein functions. MED-SuMo is a powerful technology to localize similar local regions on protein surfaces. Its heuristic is based on a 3D representation of macromolecules using specific surface chemical features associating chemical characteristics with geometrical properties. MED-SMA is an automated and fast method to classify binding sites. It is based on MED-SuMo technology, which builds a similarity graph, and it uses the Markov Clustering algorithm. Purine binding sites are well studied as drug targets. Here, purine binding sites of the Protein DataBank (PDB) are classified. Proteins potentially inhibited or activated through the same mechanism are gathered. Results are analyzed according to PROSITE annotations and to carefully refined functional annotations extracted from the PDB. As expected, binding sites associated with related mechanisms are gathered, for example, the Small GTPases. Nevertheless, protein kinases from different Kinome families are also found together, for example, Aurora-A and CDK2 proteins which are inhibited by the same drugs. Representative examples of different clusters are presented. The effectiveness of the MED-SMA approach is demonstrated as it gathers binding sites of proteins with similar structure-activity relationships. Moreover, an efficient new protocol associates structures absent of cocrystallized ligands to the purine clusters enabling those structures to be associated with a specific binding mechanism. Applications of this classification by binding mode similarity include target-based drug design and prediction of cross-reactivity and therefore potential toxic side effects. PMID:20162627
Xu, Dong; Zhang, Jian; Roy, Ambrish; Zhang, Yang
2011-01-01
I-TASSER is an automated pipeline for protein tertiary structure prediction using multiple threading alignments and iterative structure assembly simulations. In CASP9 experiments, two new algorithms, QUARK and FG-MD, were added to the I-TASSER pipeline for improving the structural modeling accuracy. QUARK is a de novo structure prediction algorithm used for structure modeling of proteins that lack detectable template structures. For distantly homologous targets, QUARK models are found useful as a reference structure for selecting good threading alignments and guiding the I-TASSER structure assembly simulations. FG-MD is an atomic-level structural refinement program that uses structural fragments collected from the PDB structures to guide molecular dynamics simulation and improve the local structure of predicted model, including hydrogen-bonding networks, torsion angles and steric clashes. Despite considerable progress in both the template-based and template-free structure modeling, significant improvements on protein target classification, domain parsing, model selection, and ab initio folding of beta-proteins are still needed to further improve the I-TASSER pipeline. PMID:22069036
Extension of the classical classification of β-turns
de Brevern, Alexandre G.
2016-01-01
The functional properties of a protein primarily depend on its three-dimensional (3D) structure. These properties have classically been assigned, visualized and analysed on the basis of protein secondary structures. The β-turn is the third most important secondary structure after helices and β-strands. β-turns have been classified according to the values of the dihedral angles φ and ψ of the central residue. Conventionally, eight different types of β-turns have been defined, whereas those that cannot be defined are classified as type IV β-turns. This classification remains the most widely used. Nonetheless, the miscellaneous type IV β-turns represent 1/3rd of β-turn residues. An unsupervised specific clustering approach was designed to search for recurrent new turns in the type IV category. The classical rules of β-turn type assignment were central to the approach. The four most frequently occurring clusters defined the new β-turn types. Unexpectedly, these types, designated IV1, IV2, IV3 and IV4, represent half of the type IV β-turns and occur more frequently than many of the previously established types. These types show convincing particularities, in terms of both structures and sequences that allow for the classical β-turn classification to be extended for the first time in 25 years. PMID:27627963
Extension of the classical classification of β-turns.
de Brevern, Alexandre G
2016-09-15
The functional properties of a protein primarily depend on its three-dimensional (3D) structure. These properties have classically been assigned, visualized and analysed on the basis of protein secondary structures. The β-turn is the third most important secondary structure after helices and β-strands. β-turns have been classified according to the values of the dihedral angles φ and ψ of the central residue. Conventionally, eight different types of β-turns have been defined, whereas those that cannot be defined are classified as type IV β-turns. This classification remains the most widely used. Nonetheless, the miscellaneous type IV β-turns represent 1/3(rd) of β-turn residues. An unsupervised specific clustering approach was designed to search for recurrent new turns in the type IV category. The classical rules of β-turn type assignment were central to the approach. The four most frequently occurring clusters defined the new β-turn types. Unexpectedly, these types, designated IV1, IV2, IV3 and IV4, represent half of the type IV β-turns and occur more frequently than many of the previously established types. These types show convincing particularities, in terms of both structures and sequences that allow for the classical β-turn classification to be extended for the first time in 25 years.
Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya; Tang, Haiming; Mills, Caitlin; Kang, Diane; Thomas, Paul D
2017-01-04
The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Structural Characterisation of Proteins from the Peroxiredoxin Family
2014-01-01
SECURITY CLASSIFICATION OF: The oligomerisation of protein subunits is an area of much research interest, in particular the relationship to protein...or decision, unless so designated by other documentation. 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS (ES) U.S. Army Research Office P.O...Box 12211 Research Triangle Park, NC 27709-2211 peroxiredoxin, tecton, supramolecular assembly, TEM REPORT DOCUMENTATION PAGE 11. SPONSOR/MONITOR’S
Ribosome-inactivating proteins
Walsh, Matthew J; Dodd, Jennifer E; Hautbergue, Guillaume M
2013-01-01
Ribosome-inactivating proteins (RIPs) were first isolated over a century ago and have been shown to be catalytic toxins that irreversibly inactivate protein synthesis. Elucidation of atomic structures and molecular mechanism has revealed these proteins to be a diverse group subdivided into two classes. RIPs have been shown to exhibit RNA N-glycosidase activity and depurinate the 28S rRNA of the eukaryotic 60S ribosomal subunit. In this review, we compare archetypal RIP family members with other potent toxins that abolish protein synthesis: the fungal ribotoxins which directly cleave the 28S rRNA and the newly discovered Burkholderia lethal factor 1 (BLF1). BLF1 presents additional challenges to the current classification system since, like the ribotoxins, it does not possess RNA N-glycosidase activity but does irreversibly inactivate ribosomes. We further discuss whether the RIP classification should be broadened to include toxins achieving irreversible ribosome inactivation with similar turnovers to RIPs, but through different enzymatic mechanisms. PMID:24071927
UbSRD: The Ubiquitin Structural Relational Database.
Harrison, Joseph S; Jacobs, Tim M; Houlihan, Kevin; Van Doorslaer, Koenraad; Kuhlman, Brian
2016-02-22
The structurally defined ubiquitin-like homology fold (UBL) can engage in several unique protein-protein interactions and many of these complexes have been characterized with high-resolution techniques. Using Rosetta's structural classification tools, we have created the Ubiquitin Structural Relational Database (UbSRD), an SQL database of features for all 509 UBL-containing structures in the PDB, allowing users to browse these structures by protein-protein interaction and providing a platform for quantitative analysis of structural features. We used UbSRD to define the recognition features of ubiquitin (UBQ) and SUMO observed in the PDB and the orientation of the UBQ tail while interacting with certain types of proteins. While some of the interaction surfaces on UBQ and SUMO overlap, each molecule has distinct features that aid in molecular discrimination. Additionally, we find that the UBQ tail is malleable and can adopt a variety of conformations upon binding. UbSRD is accessible as an online resource at rosettadesign.med.unc.edu/ubsrd. Copyright © 2015 Elsevier Ltd. All rights reserved.
GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.
Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd
2018-01-01
In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.
CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.
Marchler-Bauer, Aron; Bo, Yu; Han, Lianyi; He, Jane; Lanczycki, Christopher J; Lu, Shennan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Geer, Lewis Y; Bryant, Stephen H
2017-01-04
NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Ali, Safdar; Majid, Abdul; Khan, Asifullah
2014-04-01
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.
Fast protein tertiary structure retrieval based on global surface shape similarity.
Sael, Lee; Li, Bin; La, David; Fang, Yi; Ramani, Karthik; Rustamov, Raif; Kihara, Daisuke
2008-09-01
Characterization and identification of similar tertiary structure of proteins provides rich information for investigating function and evolution. The importance of structure similarity searches is increasing as structure databases continue to expand, partly due to the structural genomics projects. A crucial drawback of conventional protein structure comparison methods, which compare structures by their main-chain orientation or the spatial arrangement of secondary structure, is that a database search is too slow to be done in real-time. Here we introduce a global surface shape representation by three-dimensional (3D) Zernike descriptors, which represent a protein structure compactly as a series expansion of 3D functions. With this simplified representation, the search speed against a few thousand structures takes less than a minute. To investigate the agreement between surface representation defined by 3D Zernike descriptor and conventional main-chain based representation, a benchmark was performed against a protein classification generated by the combinatorial extension algorithm. Despite the different representation, 3D Zernike descriptor retrieved proteins of the same conformation defined by combinatorial extension in 89.6% of the cases within the top five closest structures. The real-time protein structure search by 3D Zernike descriptor will open up new possibility of large-scale global and local protein surface shape comparison. 2008 Wiley-Liss, Inc.
Xia, Jiaqi; Peng, Zhenling; Qi, Dawei; Mu, Hongbo; Yang, Jianyi
2017-03-15
Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information. http://yanglab.nankai.edu.cn/TA-fold/. yangjy@nankai.edu.cn or mhb-506@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
A Functional-Phylogenetic Classification System for Transmembrane Solute Transporters
Saier, Milton H.
2000-01-01
A comprehensive classification system for transmembrane molecular transporters has been developed and recently approved by the transport panel of the nomenclature committee of the International Union of Biochemistry and Molecular Biology. This system is based on (i) transporter class and subclass (mode of transport and energy coupling mechanism), (ii) protein phylogenetic family and subfamily, and (iii) substrate specificity. Almost all of the more than 250 identified families of transporters include members that function exclusively in transport. Channels (115 families), secondary active transporters (uniporters, symporters, and antiporters) (78 families), primary active transporters (23 families), group translocators (6 families), and transport proteins of ill-defined function or of unknown mechanism (51 families) constitute distinct categories. Transport mode and energy coupling prove to be relatively immutable characteristics and therefore provide primary bases for classification. Phylogenetic grouping reflects structure, function, mechanism, and often substrate specificity and therefore provides a reliable secondary basis for classification. Substrate specificity and polarity of transport prove to be more readily altered during evolutionary history and therefore provide a tertiary basis for classification. With very few exceptions, a phylogenetic family of transporters includes members that function by a single transport mode and energy coupling mechanism, although a variety of substrates may be transported, sometimes with either inwardly or outwardly directed polarity. In this review, I provide cross-referencing of well-characterized constituent transporters according to (i) transport mode, (ii) energy coupling mechanism, (iii) phylogenetic grouping, and (iv) substrates transported. The structural features and distribution of recognized family members throughout the living world are also evaluated. The tabulations should facilitate familial and functional assignments of newly sequenced transport proteins that will result from future genome sequencing projects. PMID:10839820
A simple and fast heuristic for protein structure comparison.
Pelta, David A; González, Juan R; Moreno Vega, Marcos
2008-03-25
Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results.
RepeatsDB-lite: a web server for unit annotation of tandem repeat proteins.
Hirsh, Layla; Paladin, Lisanna; Piovesan, Damiano; Tosatto, Silvio C E
2018-05-09
RepeatsDB-lite (http://protein.bio.unipd.it/repeatsdb-lite) is a web server for the prediction of repetitive structural elements and units in tandem repeat (TR) proteins. TRs are a widespread but poorly annotated class of non-globular proteins carrying heterogeneous functions. RepeatsDB-lite extends the prediction to all TR types and strongly improves the performance both in terms of computational time and accuracy over previous methods, with precision above 95% for solenoid structures. The algorithm exploits an improved TR unit library derived from the RepeatsDB database to perform an iterative structural search and assignment. The web interface provides tools for analyzing the evolutionary relationships between units and manually refine the prediction by changing unit positions and protein classification. An all-against-all structure-based sequence similarity matrix is calculated and visualized in real-time for every user edit. Reviewed predictions can be submitted to RepeatsDB for review and inclusion.
The Protein-DNA Interface database
2010-01-01
The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 Å or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface. We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes. PMID:20482798
The Protein-DNA Interface database.
Norambuena, Tomás; Melo, Francisco
2010-05-18
The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 A or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface.We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes.
A new test set for validating predictions of protein-ligand interaction.
Nissink, J Willem M; Murray, Chris; Hartshorn, Mike; Verdonk, Marcel L; Cole, Jason C; Taylor, Robin
2002-12-01
We present a large test set of protein-ligand complexes for the purpose of validating algorithms that rely on the prediction of protein-ligand interactions. The set consists of 305 complexes with protonation states assigned by manual inspection. The following checks have been carried out to identify unsuitable entries in this set: (1) assessing the involvement of crystallographically related protein units in ligand binding; (2) identification of bad clashes between protein side chains and ligand; and (3) assessment of structural errors, and/or inconsistency of ligand placement with crystal structure electron density. In addition, the set has been pruned to assure diversity in terms of protein-ligand structures, and subsets are supplied for different protein-structure resolution ranges. A classification of the set by protein type is available. As an illustration, validation results are shown for GOLD and SuperStar. GOLD is a program that performs flexible protein-ligand docking, and SuperStar is used for the prediction of favorable interaction sites in proteins. The new CCDC/Astex test set is freely available to the scientific community (http://www.ccdc.cam.ac.uk). Copyright 2002 Wiley-Liss, Inc.
de Moraes, Fábio R; Neshich, Izabella A P; Mazoni, Ivan; Yano, Inácio H; Pereira, José G C; Salim, José A; Jardine, José G; Neshich, Goran
2014-01-01
Protein-protein interactions are involved in nearly all regulatory processes in the cell and are considered one of the most important issues in molecular biology and pharmaceutical sciences but are still not fully understood. Structural and computational biology contributed greatly to the elucidation of the mechanism of protein interactions. In this paper, we present a collection of the physicochemical and structural characteristics that distinguish interface-forming residues (IFR) from free surface residues (FSR). We formulated a linear discriminative analysis (LDA) classifier to assess whether chosen descriptors from the BlueStar STING database (http://www.cbi.cnptia.embrapa.br/SMS/) are suitable for such a task. Receiver operating characteristic (ROC) analysis indicates that the particular physicochemical and structural descriptors used for building the linear classifier perform much better than a random classifier and in fact, successfully outperform some of the previously published procedures, whose performance indicators were recently compared by other research groups. The results presented here show that the selected set of descriptors can be utilized to predict IFRs, even when homologue proteins are missing (particularly important for orphan proteins where no homologue is available for comparative analysis/indication) or, when certain conformational changes accompany interface formation. The development of amino acid type specific classifiers is shown to increase IFR classification performance. Also, we found that the addition of an amino acid conservation attribute did not improve the classification prediction. This result indicates that the increase in predictive power associated with amino acid conservation is exhausted by adequate use of an extensive list of independent physicochemical and structural parameters that, by themselves, fully describe the nano-environment at protein-protein interfaces. The IFR classifier developed in this study is now integrated into the BlueStar STING suite of programs. Consequently, the prediction of protein-protein interfaces for all proteins available in the PDB is possible through STING_interfaces module, accessible at the following website: (http://www.cbi.cnptia.embrapa.br/SMS/predictions/index.html).
de Moraes, Fábio R.; Neshich, Izabella A. P.; Mazoni, Ivan; Yano, Inácio H.; Pereira, José G. C.; Salim, José A.; Jardine, José G.; Neshich, Goran
2014-01-01
Protein-protein interactions are involved in nearly all regulatory processes in the cell and are considered one of the most important issues in molecular biology and pharmaceutical sciences but are still not fully understood. Structural and computational biology contributed greatly to the elucidation of the mechanism of protein interactions. In this paper, we present a collection of the physicochemical and structural characteristics that distinguish interface-forming residues (IFR) from free surface residues (FSR). We formulated a linear discriminative analysis (LDA) classifier to assess whether chosen descriptors from the BlueStar STING database (http://www.cbi.cnptia.embrapa.br/SMS/) are suitable for such a task. Receiver operating characteristic (ROC) analysis indicates that the particular physicochemical and structural descriptors used for building the linear classifier perform much better than a random classifier and in fact, successfully outperform some of the previously published procedures, whose performance indicators were recently compared by other research groups. The results presented here show that the selected set of descriptors can be utilized to predict IFRs, even when homologue proteins are missing (particularly important for orphan proteins where no homologue is available for comparative analysis/indication) or, when certain conformational changes accompany interface formation. The development of amino acid type specific classifiers is shown to increase IFR classification performance. Also, we found that the addition of an amino acid conservation attribute did not improve the classification prediction. This result indicates that the increase in predictive power associated with amino acid conservation is exhausted by adequate use of an extensive list of independent physicochemical and structural parameters that, by themselves, fully describe the nano-environment at protein-protein interfaces. The IFR classifier developed in this study is now integrated into the BlueStar STING suite of programs. Consequently, the prediction of protein-protein interfaces for all proteins available in the PDB is possible through STING_interfaces module, accessible at the following website: (http://www.cbi.cnptia.embrapa.br/SMS/predictions/index.html). PMID:24489849
Xu, Qifang; Dunbrack, Roland L
2012-11-01
Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.
Classification of protein quaternary structure by functional domain composition
Yu, Xiaojing; Wang, Chuan; Li, Yixue
2006-01-01
Background The number and the arrangement of subunits that form a protein are referred to as quaternary structure. Quaternary structure is an important protein attribute that is closely related to its function. Proteins with quaternary structure are called oligomeric proteins. Oligomeric proteins are involved in various biological processes, such as metabolism, signal transduction, and chromosome replication. Thus, it is highly desirable to develop some computational methods to automatically classify the quaternary structure of proteins from their sequences. Results To explore this problem, we adopted an approach based on the functional domain composition of proteins. Every protein was represented by a vector calculated from the domains in the PFAM database. The nearest neighbor algorithm (NNA) was used for classifying the quaternary structure of proteins from this information. The jackknife cross-validation test was performed on the non-redundant protein dataset in which the sequence identity was less than 25%. The overall success rate obtained is 75.17%. Additionally, to demonstrate the effectiveness of this method, we predicted the proteins in an independent dataset and achieved an overall success rate of 84.11% Conclusion Compared with the amino acid composition method and Blast, the results indicate that the domain composition approach may be a more effective and promising high-throughput method in dealing with this complicated problem in bioinformatics. PMID:16584572
Bordner, Andrew J.; Gorin, Andrey A.
2008-05-12
Here, protein-protein interactions are ubiquitous and essential for cellular processes. High-resolution X-ray crystallographic structures of protein complexes can elucidate the details of their function and provide a basis for many computational and experimental approaches. Here we demonstrate that existing annotations of protein complexes, including those provided by the Protein Data Bank (PDB) itself, contain a significant fraction of incorrect annotations. Results: We have developed a method for identifying protein complexes in the PDB X-ray structures by a four step procedure: (1) comprehensively collecting all protein-protein interfaces; (2) clustering similar protein-protein interfaces together; (3) estimating the probability that each cluster ismore » relevant based on a diverse set of properties; and (4) finally combining these scores for each entry in order to predict the complex structure. Unlike previous annotation methods, consistent prediction of complexes with identical or almost identical protein content is insured. The resulting clusters of biologically relevant interfaces provide a reliable catalog of evolutionary conserved protein-protein interactions.« less
Magis, Cedrik; Stricher, François; van der Sloot, Almer M; Serrano, Luis; Notredame, Cedric
2010-07-16
This study addresses the relation between structural and functional similarity in proteins. We introduce a novel method named tree based on root mean square deviation (T-RMSD), which uses distance RMSD (dRMSD) variations to build fine-grained structure-based classifications of proteins. The main improvement of the T-RMSD over similar methods, such as Dali, is its capacity to produce the equivalent of a bootstrap value for each cluster node. We validated our approach on two domain families studied extensively for their role in many biological and pathological pathways: the small GTPase RAS superfamily and the cysteine-rich domains (CRDs) associated with the tumor necrosis factor receptors (TNFRs) family. Our analysis showed that T-RMSD is able to automatically recover and refine existing classifications. In the case of the small GTPase ARF subfamily, T-RMSD can distinguish GTP- from GDP-bound states, while in the case of CRDs it can identify two new subgroups associated with well defined functional features (ligand binding and formation of ligand pre-assembly complex). We show how hidden Markov models (HMMs) can be built on these new groups and propose a methodology to use these models simultaneously in order to do fine-grained functional genomic annotation without known 3D structures. T-RMSD, an open source freeware incorporated in the T-Coffee package, is available online. 2010 Elsevier Ltd. All rights reserved.
fRMSDPred: Predicting Local RMSD Between Structural Fragments Using Sequence Information
2007-04-04
machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel
An Algorithm for Protein Helix Assignment Using Helix Geometry
Cao, Chen; Xu, Shutan; Wang, Lincong
2015-01-01
Helices are one of the most common and were among the earliest recognized secondary structure elements in proteins. The assignment of helices in a protein underlies the analysis of its structure and function. Though the mathematical expression for a helical curve is simple, no previous assignment programs have used a genuine helical curve as a model for helix assignment. In this paper we present a two-step assignment algorithm. The first step searches for a series of bona fide helical curves each one best fits the coordinates of four successive backbone Cα atoms. The second step uses the best fit helical curves as input to make helix assignment. The application to the protein structures in the PDB (protein data bank) proves that the algorithm is able to assign accurately not only regular α-helix but also 310 and π helices as well as their left-handed versions. One salient feature of the algorithm is that the assigned helices are structurally more uniform than those by the previous programs. The structural uniformity should be useful for protein structure classification and prediction while the accurate assignment of a helix to a particular type underlies structure-function relationship in proteins. PMID:26132394
PrionHome: a database of prions and other sequences relevant to prion phenomena.
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M A; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion.
PrionHome: A Database of Prions and Other Sequences Relevant to Prion Phenomena
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M. A.; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M.
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion. PMID:22363733
How do light harvesting proteins support long lived quantum coherences
2017-01-31
the structural basis for these two forms, our aim is to generate hybrid proteins via synthetic biology approaches. We have shown that we can fully...SUBJECT TERMS quantum biology , light harvesting, photosynthesis, AOARD 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT SAR 18. NUMBER OF...these two forms, our aim is to generate hybrid proteins via synthetic biology approaches. We have shown that we can fully unfold and separate the
Protein Information Resource: a community resource for expert annotation of protein data
Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy
2001-01-01
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-International databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041
Heinz, Eva; Lithgow, Trevor
2014-01-01
Members of the Omp85/TpsB protein superfamily are ubiquitously distributed in Gram-negative bacteria, and function in protein translocation (e.g., FhaC) or the assembly of outer membrane proteins (e.g., BamA). Several recent findings are suggestive of a further level of variation in the superfamily, including the identification of the novel membrane protein assembly factor TamA and protein translocase PlpD. To investigate the diversity and the causal evolutionary events, we undertook a comprehensive comparative sequence analysis of the Omp85/TpsB proteins. A total of 10 protein subfamilies were apparent, distinguished in their domain structure and sequence signatures. In addition to the proteins FhaC, BamA, and TamA, for which structural and functional information is available, are families of proteins with so far undescribed domain architectures linked to the Omp85 β-barrel domain. This study brings a classification structure to a dynamic protein superfamily of high interest given its essential function for Gram-negative bacteria as well as its diverse domain architecture, and we discuss several scenarios of putative functions of these so far undescribed proteins. PMID:25101071
Annotation and Classification of CRISPR-Cas Systems
Makarova, Kira S.; Koonin, Eugene V.
2018-01-01
The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods. PMID:25981466
Annotation and Classification of CRISPR-Cas Systems.
Makarova, Kira S; Koonin, Eugene V
2015-01-01
The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods.
Recognition of functional sites in protein structures.
Shulman-Peleg, Alexandra; Nussinov, Ruth; Wolfson, Haim J
2004-06-04
Recognition of regions on the surface of one protein, that are similar to a binding site of another is crucial for the prediction of molecular interactions and for functional classifications. We first describe a novel method, SiteEngine, that assumes no sequence or fold similarities and is able to recognize proteins that have similar binding sites and may perform similar functions. We achieve high efficiency and speed by introducing a low-resolution surface representation via chemically important surface points, by hashing triangles of physico-chemical properties and by application of hierarchical scoring schemes for a thorough exploration of global and local similarities. We proceed to rigorously apply this method to functional site recognition in three possible ways: first, we search a given functional site on a large set of complete protein structures. Second, a potential functional site on a protein of interest is compared with known binding sites, to recognize similar features. Third, a complete protein structure is searched for the presence of an a priori unknown functional site, similar to known sites. Our method is robust and efficient enough to allow computationally demanding applications such as the first and the third. From the biological standpoint, the first application may identify secondary binding sites of drugs that may lead to side-effects. The third application finds new potential sites on the protein that may provide targets for drug design. Each of the three applications may aid in assigning a function and in classification of binding patterns. We highlight the advantages and disadvantages of each type of search, provide examples of large-scale searches of the entire Protein Data Base and make functional predictions.
Rajendran, Senthilnathan; Jothi, Arunachalam
2018-05-16
The Three-dimensional structure of a protein depends on the interaction between their amino acid residues. These interactions are in turn influenced by various biophysical properties of the amino acids. There are several examples of proteins that share the same fold but are very dissimilar at the sequence level. For proteins to share a common fold some crucial interactions should be maintained despite insignificant sequence similarity. Since the interactions are because of the biophysical properties of the amino acids, we should be able to detect descriptive patterns for folds at such a property level. In this line, the main focus of our research is to analyze such proteins and to characterize them in terms of their biophysical properties. Protein structures with sequence similarity lesser than 40% were selected for ten different subfolds from three different mainfolds (according to CATH classification) and were used for this analysis. We used the normalized values of the 49 physio-chemical, energetic and conformational properties of amino acids. We characterize the folds based on the average biophysical property values. We also observed a fold specific correlational behavior of biophysical properties despite a very low sequence similarity in our data. We further trained three different binary classification models (Naive Bayes-NB, Support Vector Machines-SVM and Bayesian Generalized Linear Model-BGLM) which could discriminate mainfold based on the biophysical properties. We also show that among the three generated models, the BGLM classifier model was able to discriminate protein sequences coming under all beta category with 81.43% accuracy and all alpha, alpha-beta proteins with 83.37% accuracy. Copyright © 2018 Elsevier Ltd. All rights reserved.
Ribosome-inactivating proteins: potent poisons and molecular tools.
Walsh, Matthew J; Dodd, Jennifer E; Hautbergue, Guillaume M
2013-11-15
Ribosome-inactivating proteins (RIPs) were first isolated over a century ago and have been shown to be catalytic toxins that irreversibly inactivate protein synthesis. Elucidation of atomic structures and molecular mechanism has revealed these proteins to be a diverse group subdivided into two classes. RIPs have been shown to exhibit RNA N-glycosidase activity and depurinate the 28S rRNA of the eukaryotic 60S ribosomal subunit. In this review, we compare archetypal RIP family members with other potent toxins that abolish protein synthesis: the fungal ribotoxins which directly cleave the 28S rRNA and the newly discovered Burkholderia lethal factor 1 (BLF1). BLF1 presents additional challenges to the current classification system since, like the ribotoxins, it does not possess RNA N-glycosidase activity but does irreversibly inactivate ribosomes. We further discuss whether the RIP classification should be broadened to include toxins achieving irreversible ribosome inactivation with similar turnovers to RIPs, but through different enzymatic mechanisms.
NASA Astrophysics Data System (ADS)
Indelicato, G.; Burkhard, P.; Twarock, R.
2017-04-01
We introduce here a mathematical procedure for the structural classification of a specific class of self-assembling protein nanoparticles (SAPNs) that are used as a platform for repetitive antigen display systems. These SAPNs have distinctive geometries as a consequence of the fact that their peptide building blocks are formed from two linked coiled coils that are designed to assemble into trimeric and pentameric clusters. This allows a mathematical description of particle architectures in terms of bipartite (3,5)-regular graphs. Exploiting the relation with fullerene graphs, we provide a complete atlas of SAPN morphologies. The classification enables a detailed understanding of the spectrum of possible particle geometries that can arise in the self-assembly process. Moreover, it provides a toolkit for a systematic exploitation of SAPNs in bioengineering in the context of vaccine design, predicting the density of B-cell epitopes on the SAPN surface, which is critical for a strong humoral immune response.
A simple and fast heuristic for protein structure comparison
Pelta, David A; González, Juan R; Moreno Vega, Marcos
2008-01-01
Background Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. Results We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. Conclusion We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results. Software is available for download at . PMID:18366735
Jia, Yi; Huan, Jun; Buhr, Vincent; Zhang, Jintao; Carayannopoulos, Leonidas N
2009-01-01
Background Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and sequence-based functional annotations often fail. Results Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusion We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty. PMID:19208148
DOE R&D Accomplishments Database
Chandonia, John-Marc; Hon, Gary; Walker, Nigel S.; Lo Conte, Loredana; Koehl, Patrice; Levitt, Michael; Brenner, Steven E.
2003-09-15
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54,745 domains, more than three times as many as the initial release four years ago. ASTRAL has undergone major transformations in the past two years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand-alone database, as well as available integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB-style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods.
Dunbrack, Roland L.
2012-01-01
Motivation: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. Results: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM–HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. Availability: The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly. Contact: Roland.Dunbracks@fccc.edu PMID:22942020
FRASS: the web-server for RNA structural comparison
2010-01-01
Background The impressive increase of novel RNA structures, during the past few years, demands automated methods for structure comparison. While many algorithms handle only small motifs, few techniques, developed in recent years, (ARTS, DIAL, SARA, SARSA, and LaJolla) are available for the structural comparison of large and intact RNA molecules. Results The FRASS web-server represents a RNA chain with its Gauss integrals and allows one to compare structures of RNA chains and to find similar entries in a database derived from the Protein Data Bank. We observed that FRASS scores correlate well with the ARTS and LaJolla similarity scores. Moreover, the-web server can also reproduce satisfactorily the DARTS classification of RNA 3D structures and the classification of the SCOR functions that was obtained by the SARA method. Conclusions The FRASS web-server can be easily used to detect relationships among RNA molecules and to scan efficiently the rapidly enlarging structural databases. PMID:20553602
Prediction of protein mutant stability using classification and regression tool.
Huang, Liang-Tsung; Saraboji, K; Ho, Shinn-Ying; Hwang, Shiow-Fen; Ponnuswamy, M N; Gromiha, M Michael
2007-02-01
Prediction of protein stability upon amino acid substitutions is an important problem in molecular biology and the solving of which would help for designing stable mutants. In this work, we have analyzed the stability of protein mutants using two different datasets of 1396 and 2204 mutants obtained from ProTherm database, respectively for free energy change due to thermal (DeltaDeltaG) and denaturant denaturations (DeltaDeltaG(H(2)O)). We have used a set of 48 physical, chemical energetic and conformational properties of amino acid residues and computed the difference of amino acid properties for each mutant in both sets of data. These differences in amino acid properties have been related to protein stability (DeltaDeltaG and DeltaDeltaG(H(2)O)) and are used to train with classification and regression tool for predicting the stability of protein mutants. Further, we have tested the method with 4 fold, 5 fold and 10 fold cross validation procedures. We found that the physical properties, shape and flexibility are important determinants of protein stability. The classification of mutants based on secondary structure (helix, strand, turn and coil) and solvent accessibility (buried, partially buried, partially exposed and exposed) distinguished the stabilizing/destabilizing mutants at an average accuracy of 81% and 80%, respectively for DeltaDeltaG and DeltaDeltaG(H(2)O). The correlation between the experimental and predicted stability change is 0.61 for DeltaDeltaG and 0.44 for DeltaDeltaG(H(2)O). Further, the free energy change due to the replacement of amino acid residue has been predicted within an average error of 1.08 kcal/mol and 1.37 kcal/mol for thermal and chemical denaturation, respectively. The relative importance of secondary structure and solvent accessibility, and the influence of the dataset on prediction of protein mutant stability have been discussed.
Conserved thioredoxin fold is present in Pisum sativum L. sieve element occlusion-1 protein
Umate, Pavan; Tuteja, Renu
2010-01-01
Homology-based three-dimensional model for Pisum sativum sieve element occlusion 1 (Ps.SEO1) (forisomes) protein was constructed. A stretch of amino acids (residues 320 to 456) which is well conserved in all known members of forisomes proteins was used to model the 3D structure of Ps.SEO1. The structural prediction was done using Protein Homology/analogY Recognition Engine (PHYRE) web server. Based on studies of local sequence alignment, the thioredoxin-fold containing protein [Structural Classification of Proteins (SCOP) code d1o73a_], a member of the glutathione peroxidase family was selected as a template for modeling the spatial structure of Ps.SEO1. Selection was based on comparison of primary sequence, higher match quality and alignment accuracy. Motif 1 (EVF) is conserved in Ps.SEO1, Vicia faba (Vf.For1) and Medicago truncatula (MT.SEO3); motif 2 (KKED) is well conserved across all forisomes proteins and motif 3 (IGYIGNP) is conserved in Ps.SEO1 and Vf.For1. PMID:20404566
Single-Molecule Microscopy and Force Spectroscopy of Membrane Proteins
NASA Astrophysics Data System (ADS)
Engel, Andreas; Janovjak, Harald; Fotiadis, Dimtrios; Kedrov, Alexej; Cisneros, David; Müller, Daniel J.
Single-molecule atomic force microscopy (AFM) provides novel ways to characterize the structure-function relationship of native membrane proteins. High-resolution AFM topographs allow observing the structure of single proteins at sub-nanometer resolution as well as their conformational changes, oligomeric state, molecular dynamics and assembly. We will review these feasibilities illustrating examples of membrane proteins in native and reconstituted membranes. Classification of individual topographs of single proteins allows understanding the principles of motions of their extrinsic domains, to learn about their local structural flexibilities and to find the entropy minima of certain conformations. Combined with the visualization of functionally related conformational changes these insights allow understanding why certain flexibilities are required for the protein to function and how structurally flexible regions allow certain conformational changes. Complementary to AFM imaging, single-molecule force spectroscopy (SMFS) experiments detect molecular interactions established within and between membrane proteins. The sensitivity of this method makes it possible to measure interactions that stabilize secondary structures such as transmembrane α-helices, polypeptide loops and segments within. Changes in temperature or protein-protein assembly do not change the locations of stable structural segments, but influence their stability established by collective molecular interactions. Such changes alter the probability of proteins to choose a certain unfolding pathway. Recent examples have elucidated unfolding and refolding pathways of membrane proteins as well as their energy landscapes.
PACSY, a relational database management system for protein structure and chemical shift analysis.
Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo; Lee, Weontae; Markley, John L
2012-10-01
PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu.
Structural Organization and Strain Variation in the Genome of Varicella Zoster Virus
1984-10-23
Zoster 6 Growth of VZV in tissue culture 9 Structure and proteins of VZV 15 Structure of HSV DNA 20 Classification of herpesviruses based on DNA...structure 28 Strain variation in herpesvirus DNA 31 VZV DNA 33 Specific aims 36 II. MATERIALS AND METHODS 38 Cells and viruses 38 Isolation of virus...endonuclease fragments by colony hybridization 106 21. Selected methods of restriction endonuclease mapping .... 109 22. Identification of
Composite Structural Motifs of Binding Sites for Delineating Biological Functions of Proteins
Kinjo, Akira R.; Nakamura, Haruki
2012-01-01
Most biological processes are described as a series of interactions between proteins and other molecules, and interactions are in turn described in terms of atomic structures. To annotate protein functions as sets of interaction states at atomic resolution, and thereby to better understand the relation between protein interactions and biological functions, we conducted exhaustive all-against-all atomic structure comparisons of all known binding sites for ligands including small molecules, proteins and nucleic acids, and identified recurring elementary motifs. By integrating the elementary motifs associated with each subunit, we defined composite motifs that represent context-dependent combinations of elementary motifs. It is demonstrated that function similarity can be better inferred from composite motif similarity compared to the similarity of protein sequences or of individual binding sites. By integrating the composite motifs associated with each protein function, we define meta-composite motifs each of which is regarded as a time-independent diagrammatic representation of a biological process. It is shown that meta-composite motifs provide richer annotations of biological processes than sequence clusters. The present results serve as a basis for bridging atomic structures to higher-order biological phenomena by classification and integration of binding site structures. PMID:22347478
Cao, Han; Ng, Marcus C K; Jusoh, Siti Azma; Tai, Hio Kuan; Siu, Shirley W I
2017-09-01
[Formula: see text]-Helical transmembrane proteins are the most important drug targets in rational drug development. However, solving the experimental structures of these proteins remains difficult, therefore computational methods to accurately and efficiently predict the structures are in great demand. We present an improved structure prediction method TMDIM based on Park et al. (Proteins 57:577-585, 2004) for predicting bitopic transmembrane protein dimers. Three major algorithmic improvements are introduction of the packing type classification, the multiple-condition decoy filtering, and the cluster-based candidate selection. In a test of predicting nine known bitopic dimers, approximately 78% of our predictions achieved a successful fit (RMSD <2.0 Å) and 78% of the cases are better predicted than the two other methods compared. Our method provides an alternative for modeling TM bitopic dimers of unknown structures for further computational studies. TMDIM is freely available on the web at https://cbbio.cis.umac.mo/TMDIM . Website is implemented in PHP, MySQL and Apache, with all major browsers supported.
TMDIM: an improved algorithm for the structure prediction of transmembrane domains of bitopic dimers
NASA Astrophysics Data System (ADS)
Cao, Han; Ng, Marcus C. K.; Jusoh, Siti Azma; Tai, Hio Kuan; Siu, Shirley W. I.
2017-09-01
α-Helical transmembrane proteins are the most important drug targets in rational drug development. However, solving the experimental structures of these proteins remains difficult, therefore computational methods to accurately and efficiently predict the structures are in great demand. We present an improved structure prediction method TMDIM based on Park et al. (Proteins 57:577-585, 2004) for predicting bitopic transmembrane protein dimers. Three major algorithmic improvements are introduction of the packing type classification, the multiple-condition decoy filtering, and the cluster-based candidate selection. In a test of predicting nine known bitopic dimers, approximately 78% of our predictions achieved a successful fit (RMSD <2.0 Å) and 78% of the cases are better predicted than the two other methods compared. Our method provides an alternative for modeling TM bitopic dimers of unknown structures for further computational studies. TMDIM is freely available on the web at https://cbbio.cis.umac.mo/TMDIM. Website is implemented in PHP, MySQL and Apache, with all major browsers supported.
DockQ: A Quality Measure for Protein-Protein Docking Models
Basu, Sankar
2016-01-01
The state-of-the-art to assess the structural quality of docking models is currently based on three related yet independent quality measures: Fnat, LRMS, and iRMS as proposed and standardized by CAPRI. These quality measures quantify different aspects of the quality of a particular docking model and need to be viewed together to reveal the true quality, e.g. a model with relatively poor LRMS (>10Å) might still qualify as 'acceptable' with a descent Fnat (>0.50) and iRMS (<3.0Å). This is also the reason why the so called CAPRI criteria for assessing the quality of docking models is defined by applying various ad-hoc cutoffs on these measures to classify a docking model into the four classes: Incorrect, Acceptable, Medium, or High quality. This classification has been useful in CAPRI, but since models are grouped in only four bins it is also rather limiting, making it difficult to rank models, correlate with scoring functions or use it as target function in machine learning algorithms. Here, we present DockQ, a continuous protein-protein docking model quality measure derived by combining Fnat, LRMS, and iRMS to a single score in the range [0, 1] that can be used to assess the quality of protein docking models. By using DockQ on CAPRI models it is possible to almost completely reproduce the original CAPRI classification into Incorrect, Acceptable, Medium and High quality. An average PPV of 94% at 90% Recall demonstrating that there is no need to apply predefined ad-hoc cutoffs to classify docking models. Since DockQ recapitulates the CAPRI classification almost perfectly, it can be viewed as a higher resolution version of the CAPRI classification, making it possible to estimate model quality in a more quantitative way using Z-scores or sum of top ranked models, which has been so valuable for the CASP community. The possibility to directly correlate a quality measure to a scoring function has been crucial for the development of scoring functions for protein structure prediction, and DockQ should be useful in a similar development in the protein docking field. DockQ is available at http://github.com/bjornwallner/DockQ/ PMID:27560519
Toward a unified nomenclature for mammalian ADP-ribosyltransferases.
Hottiger, Michael O; Hassa, Paul O; Lüscher, Bernhard; Schüler, Herwig; Koch-Nolte, Friedrich
2010-04-01
ADP-ribosylation is a post-translational modification of proteins catalyzed by ADP-ribosyltransferases. It comprises the transfer of the ADP-ribose moiety from NAD+ to specific amino acid residues on substrate proteins or to ADP-ribose itself. Currently, 22 human genes encoding proteins that possess an ADP-ribosyltransferase catalytic domain are known. Recent structural and enzymological evidence of poly(ADP-ribose)polymerase (PARP) family members demonstrate that earlier proposed names and classifications of these proteins are no longer accurate. Here we summarize these new findings and propose a new consensus nomenclature for all ADP-ribosyltransferases (ARTs) based on the catalyzed reaction and on structural features. A unified nomenclature would facilitate communication between researchers both inside and outside the ADP-ribosylation field. 2009 Elsevier Ltd. All rights reserved.
PANDORA: keyword-based analysis of protein sets by integration of annotation sources.
Kaplan, Noam; Vaaknin, Avishay; Linial, Michal
2003-10-01
Recent advances in high-throughput methods and the application of computational tools for automatic classification of proteins have made it possible to carry out large-scale proteomic analyses. Biological analysis and interpretation of sets of proteins is a time-consuming undertaking carried out manually by experts. We have developed PANDORA (Protein ANnotation Diagram ORiented Analysis), a web-based tool that provides an automatic representation of the biological knowledge associated with any set of proteins. PANDORA uses a unique approach of keyword-based graphical analysis that focuses on detecting subsets of proteins that share unique biological properties and the intersections of such sets. PANDORA currently supports SwissProt keywords, NCBI Taxonomy, InterPro entries and the hierarchical classification terms from ENZYME, SCOP and GO databases. The integrated study of several annotation sources simultaneously allows a representation of biological relations of structure, function, cellular location, taxonomy, domains and motifs. PANDORA is also integrated into the ProtoNet system, thus allowing testing thousands of automatically generated clusters. We illustrate how PANDORA enhances the biological understanding of large, non-uniform sets of proteins originating from experimental and computational sources, without the need for prior biological knowledge on individual proteins.
DWARF – a data warehouse system for analyzing protein families
Fischer, Markus; Thai, Quan K; Grieb, Melanie; Pleiss, Jürgen
2006-01-01
Background The emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families. Description The data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families. Conclusion DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering. PMID:17094801
Zheng, Meiying; Cooper, David R.; Grossoehme, Nickolas E.; Yu, Minmin; Hung, Li-Wei; Cieslik, Marcin; Derewenda, Urszula; Lesley, Scott A.; Wilson, Ian A.; Giedroc, David P.; Derewenda, Zygmunt S.
2009-01-01
The GntR superfamily of dimeric transcription factors, with more than 6200 members encoded in bacterial genomes, are characterized by N-terminal winged-helix DNA-binding domains and diverse C-terminal regulatory domains which provide a basis for the classification of the constituent families. The largest of these families, FadR, contains nearly 3000 proteins with all-α-helical regulatory domains classified into two related Pfam families: FadR_C and FCD. Only two crystal structures of FadR-family members, those of Escherichia coli FadR protein and LldR from Corynebacterium glutamicum, have been described to date in the literature. Here, the crystal structure of TM0439, a GntR regulator with an FCD domain found in the Thermotoga maritima genome, is described. The FCD domain is similar to that of the LldR regulator and contains a buried metal-binding site. Using atomic absorption spectroscopy and Trp fluorescence, it is shown that the recombinant protein contains bound Ni2+ ions but that it is able to bind Zn2+ with K d < 70 nM. It is concluded that Zn2+ is the likely physiological metal and that it may perform either structural or regulatory roles or both. Finally, the TM0439 structure is compared with two other FadR-family structures recently deposited by structural genomics consortia. The results call for a revision in the classification of the FadR family of transcription factors. PMID:19307717
The proteome: structure, function and evolution
Fleming, Keiran; Kelley, Lawrence A; Islam, Suhail A; MacCallum, Robert M; Muller, Arne; Pazos, Florencio; Sternberg, Michael J.E
2006-01-01
This paper reports two studies to model the inter-relationships between protein sequence, structure and function. First, an automated pipeline to provide a structural annotation of proteomes in the major genomes is described. The results are stored in a database at Imperial College, London (3D-GENOMICS) that can be accessed at www.sbg.bio.ic.ac.uk. Analysis of the assignments to structural superfamilies provides evolutionary insights. 3D-GENOMICS is being integrated with related proteome annotation data at University College London and the European Bioinformatics Institute in a project known as e-protein (http://www.e-protein.org/). The second topic is motivated by the developments in structural genomics projects in which the structure of a protein is determined prior to knowledge of its function. We have developed a new approach PHUNCTIONER that uses the gene ontology (GO) classification to supervise the extraction of the sequence signal responsible for protein function from a structure-based sequence alignment. Using GO we can obtain profiles for a range of specificities described in the ontology. In the region of low sequence similarity (around 15%), our method is more accurate than assignment from the closest structural homologue. The method is also able to identify the specific residues associated with the function of the protein family. PMID:16524832
2005-09-01
Escherichia coliphage virus, and ovalbumnin (OV) protein species. The work suggests certain improvements that can be made to the IMS detection System...Escherichia coliphage virus, and ovalbumin (OV) protein species. However, the origin and structural identities of the pyrolyzate peaks observed in the GC-IMS...niger), Gram-negative Pantoea agglomerans ((EH) formerly Erwinia herbicola ), ovalbumin protein (OV), and the MS-2 Escherichia coliphage virus were
Topological distribution of four-alpha-helix bundles.
Presnell, S R; Cohen, F E
1989-01-01
The four-alpha-helix bundle, a common structural motif in globular proteins, provides an excellent forum for the examination of predictive constraints for protein backbone topology. An exhaustive examination of the Brookhaven Crystallographic Protein Data Bank and other literature sources has lead to the discovery of 20 putative four-alpha-helix bundles. Application of an analytical method that examines the difference between solvent-accessible surface areas in packed and partially unpacked bundles reduced the number of structures to 16. Angular requirements further reduced the list of bundles to 13. In 12 of these bundles, all pairs of neighboring helices were oriented in an anti-parallel fashion. This distribution is in accordance with structure types expected if the helix macro dipole effect makes a substantial contribution to the stability of the native structure. The characterizations and classifications made in this study prompt a reevaluation of constraints used in structure prediction efforts. Images PMID:2771946
Accurate secondary structure prediction and fold recognition for circular dichroism spectroscopy
Micsonai, András; Wien, Frank; Kernya, Linda; Lee, Young-Ho; Goto, Yuji; Réfrégiers, Matthieu; Kardos, József
2015-01-01
Circular dichroism (CD) spectroscopy is a widely used technique for the study of protein structure. Numerous algorithms have been developed for the estimation of the secondary structure composition from the CD spectra. These methods often fail to provide acceptable results on α/β-mixed or β-structure–rich proteins. The problem arises from the spectral diversity of β-structures, which has hitherto been considered as an intrinsic limitation of the technique. The predictions are less reliable for proteins of unusual β-structures such as membrane proteins, protein aggregates, and amyloid fibrils. Here, we show that the parallel/antiparallel orientation and the twisting of the β-sheets account for the observed spectral diversity. We have developed a method called β-structure selection (BeStSel) for the secondary structure estimation that takes into account the twist of β-structures. This method can reliably distinguish parallel and antiparallel β-sheets and accurately estimates the secondary structure for a broad range of proteins. Moreover, the secondary structure components applied by the method are characteristic to the protein fold, and thus the fold can be predicted to the level of topology in the CATH classification from a single CD spectrum. By constructing a web server, we offer a general tool for a quick and reliable structure analysis using conventional CD or synchrotron radiation CD (SRCD) spectroscopy for the protein science research community. The method is especially useful when X-ray or NMR techniques fail. Using BeStSel on data collected by SRCD spectroscopy, we investigated the structure of amyloid fibrils of various disease-related proteins and peptides. PMID:26038575
Protein structure determination by exhaustive search of Protein Data Bank derived databases.
Stokes-Rees, Ian; Sliz, Piotr
2010-12-14
Parallel sequence and structure alignment tools have become ubiquitous and invaluable at all levels in the study of biological systems. We demonstrate the application and utility of this same parallel search paradigm to the process of protein structure determination, benefitting from the large and growing corpus of known structures. Such searches were previously computationally intractable. Through the method of Wide Search Molecular Replacement, developed here, they can be completed in a few hours with the aide of national-scale federated cyberinfrastructure. By dramatically expanding the range of models considered for structure determination, we show that small (less than 12% structural coverage) and low sequence identity (less than 20% identity) template structures can be identified through multidimensional template scoring metrics and used for structure determination. Many new macromolecular complexes can benefit significantly from such a technique due to the lack of known homologous protein folds or sequences. We demonstrate the effectiveness of the method by determining the structure of a full-length p97 homologue from Trichoplusia ni. Example cases with the MHC/T-cell receptor complex and the EmoB protein provide systematic estimates of minimum sequence identity, structure coverage, and structural similarity required for this method to succeed. We describe how this structure-search approach and other novel computationally intensive workflows are made tractable through integration with the US national computational cyberinfrastructure, allowing, for example, rapid processing of the entire Structural Classification of Proteins protein fragment database.
LenVarDB: database of length-variant protein domains.
Mutt, Eshita; Mathew, Oommen K; Sowdhamini, Ramanathan
2014-01-01
Protein domains are functionally and structurally independent modules, which add to the functional variety of proteins. This array of functional diversity has been enabled by evolutionary changes, such as amino acid substitutions or insertions or deletions, occurring in these protein domains. Length variations (indels) can introduce changes at structural, functional and interaction levels. LenVarDB (freely available at http://caps.ncbs.res.in/lenvardb/) traces these length variations, starting from structure-based sequence alignments in our Protein Alignments organized as Structural Superfamilies (PASS2) database, across 731 structural classification of proteins (SCOP)-based protein domain superfamilies connected to 2 730 625 sequence homologues. Alignment of sequence homologues corresponding to a structural domain is available, starting from a structure-based sequence alignment of the superfamily. Orientation of the length-variant (indel) regions in protein domains can be visualized by mapping them on the structure and on the alignment. Knowledge about location of length variations within protein domains and their visual representation will be useful in predicting changes within structurally or functionally relevant sites, which may ultimately regulate protein function. Non-technical summary: Evolutionary changes bring about natural changes to proteins that may be found in many organisms. Such changes could be reflected as amino acid substitutions or insertions-deletions (indels) in protein sequences. LenVarDB is a database that provides an early overview of observed length variations that were set among 731 protein families and after examining >2 million sequences. Indels are followed up to observe if they are close to the active site such that they can affect the activity of proteins. Inclusion of such information can aid the design of bioengineering experiments.
Classification and Lineage Tracing of SH2 Domains Throughout Eukaryotes.
Liu, Bernard A
2017-01-01
Today there exists a rapidly expanding number of sequenced genomes. Cataloging protein interaction domains such as the Src Homology 2 (SH2) domain across these various genomes can be accomplished with ease due to existing algorithms and predictions models. An evolutionary analysis of SH2 domains provides a step towards understanding how SH2 proteins integrated with existing signaling networks to position phosphotyrosine signaling as a crucial driver of robust cellular communication networks in metazoans. However organizing and tracing SH2 domain across organisms and understanding their evolutionary trajectory remains a challenge. This chapter describes several methodologies towards analyzing the evolutionary trajectory of SH2 domains including a global SH2 domain classification system, which facilitates annotation of new SH2 sequences essential for tracing the lineage of SH2 domains throughout eukaryote evolution. This classification utilizes a combination of sequence homology, protein domain architecture and the boundary positions between introns and exons within the SH2 domain or genes encoding these domains. Discrete SH2 families can then be traced across various genomes to provide insight into its origins. Furthermore, additional methods for examining potential mechanisms for divergence of SH2 domains from structural changes to alterations in the protein domain content and genome duplication will be discussed. Therefore a better understanding of SH2 domain evolution may enhance our insight into the emergence of phosphotyrosine signaling and the expansion of protein interaction domains.
Derkacs, Amanda D Felder; Ward, Samuel R; Lieber, Richard L
2012-02-01
Understanding cytoskeletal dynamics in living tissue is prerequisite to understanding mechanisms of injury, mechanotransduction, and mechanical signaling. Real-time visualization is now possible using transfection with plasmids that encode fluorescent cytoskeletal proteins. Using this approach with the muscle-specific intermediate filament protein desmin, we found that a green fluorescent protein-desmin chimeric protein was unevenly distributed throughout the muscle fiber, resulting in some image areas that were saturated as well as others that lacked any signal. Our goal was to analyze the muscle fiber cytoskeletal network quantitatively in an unbiased fashion. To objectively select areas of the muscle fiber that are suitable for analysis, we devised a method that provides objective classification of regions of images of striated cytoskeletal structures into "usable" and "unusable" categories. This method consists of a combination of spatial analysis of the image using Fourier methods along with a boosted neural network that "decides" on the quality of the image based on previous training. We trained the neural network using the expert opinion of three scientists familiar with these types of images. We found that this method was over 300 times faster than manual classification and that it permitted objective and accurate classification of image regions.
NPIDB: Nucleic acid-Protein Interaction DataBase.
Kirsanov, Dmitry D; Zanegina, Olga N; Aksianov, Evgeniy A; Spirin, Sergei A; Karyagina, Anna S; Alexeevski, Andrei V
2013-01-01
The Nucleic acid-Protein Interaction DataBase (http://npidb.belozersky.msu.ru/) contains information derived from structures of DNA-protein and RNA-protein complexes extracted from the Protein Data Bank (3846 complexes in October 2012). It provides a web interface and a set of tools for extracting biologically meaningful characteristics of nucleoprotein complexes. The content of the database is updated weekly. The current version of the Nucleic acid-Protein Interaction DataBase is an upgrade of the version published in 2007. The improvements include a new web interface, new tools for calculation of intermolecular interactions, a classification of SCOP families that contains DNA-binding protein domains and data on conserved water molecules on the DNA-protein interface.
Phylogeny-dominant classification of J-proteins in Arabidopsis thaliana and Brassica oleracea.
Zhang, Bin; Qiu, Han-Lin; Qu, Dong-Hai; Ruan, Ying; Chen, Dong-Hong
2018-04-05
Hsp40s or DnaJ/J-proteins are evolutionarily conserved in all organisms as co-chaperones of molecular chaperone HSP70s that mainly participate in maintaining cellular protein homeostasis, such as protein folding, assembly, stabilization, and translocation under normal conditions as well as refolding and degradation under environmental stresses. It has been reported that Arabidopsis J-proteins are classified into four classes (types A-D) according to domain organization, but their phylogenetic relationships are unknown. Here, we identified 129 J-proteins in the world-wide popular vegetable Brassica oleracea, a close relative of the model plant Arabidopsis, and also revised the information of Arabidopsis J-proteins based on the latest online bioresources. According to phylogenetic analysis with domain organization and gene structure as references, the J-proteins from Arabidopsis and B. oleracea were classified into 15 main clades (I-XV) separated by a number of undefined small branches with remote relationship. Based on the number of members, they respectively belong to multigene clades, oligo-gene clades, and mono-gene clades. The J-protein genes from different clades may function together or separately to constitute a complicated regulatory network. This study provides a constructive viewpoint for J-protein classification and an informative platform for further functional dissection and resistant genes discovery related to genetic improvement of crop plants.
HHsvm: fast and accurate classification of profile–profile matches identified by HHsearch
Dlakić, Mensur
2009-01-01
Motivation: Recently developed profile–profile methods rival structural comparisons in their ability to detect homology between distantly related proteins. Despite this tremendous progress, many genuine relationships between protein families cannot be recognized as comparisons of their profiles result in scores that are statistically insignificant. Results: Using known evolutionary relationships among protein superfamilies in SCOP database, support vector machines were trained on four sets of discriminatory features derived from the output of HHsearch. Upon validation, it was shown that the automatic classification of all profile–profile matches was superior to fixed threshold-based annotation in terms of sensitivity and specificity. The effectiveness of this approach was demonstrated by annotating several domains of unknown function from the Pfam database. Availability: Programs and scripts implementing the methods described in this manuscript are freely available from http://hhsvm.dlakiclab.org/. Contact: mdlakic@montana.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19773335
Complete fold annotation of the human proteome using a novel structural feature space.
Middleton, Sarah A; Illuminati, Joseph; Kim, Junhyong
2017-04-13
Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.
PACSY, a relational database management system for protein structure and chemical shift analysis
Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo
2012-01-01
PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu. PMID:22903636
Functional Classification of Immune Regulatory Proteins
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rubinstein, Rotem; Ramagopal, Udupi A.; Nathenson, Stanley G.
2013-05-01
Members of the immunoglobulin superfamily (IgSF) control innate and adaptive immunity and are prime targets for the treatment of autoimmune diseases, infectious diseases, and malignancies. We describe a computational method, termed the Brotherhood algorithm, which utilizes intermediate sequence information to classify proteins into functionally related families. This approach identifies functional relationships within the IgSF and predicts additional receptor-ligand interactions. As a specific example, we examine the nectin/nectin-like family of cell adhesion and signaling proteins and propose receptor-ligand interactions within this family. We were guided by the Brotherhood approach and present the high-resolution structural characterization of a homophilic interaction involving themore » class-I MHC-restricted T-cell-associated molecule, which we now classify as a nectin-like family member. The Brotherhood algorithm is likely to have a significant impact on structural immunology by identifying those proteins and complexes for which structural characterization will be particularly informative.« less
Complete fold annotation of the human proteome using a novel structural feature space
Middleton, Sarah A.; Illuminati, Joseph; Kim, Junhyong
2017-01-01
Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families. PMID:28406174
Columba: an integrated database of proteins, structures, and annotations.
Trissl, Silke; Rother, Kristian; Müller, Heiko; Steinke, Thomas; Koch, Ina; Preissner, Robert; Frömmel, Cornelius; Leser, Ulf
2005-03-31
Structural and functional research often requires the computation of sets of protein structures based on certain properties of the proteins, such as sequence features, fold classification, or functional annotation. Compiling such sets using current web resources is tedious because the necessary data are spread over many different databases. To facilitate this task, we have created COLUMBA, an integrated database of annotations of protein structures. COLUMBA currently integrates twelve different databases, including PDB, KEGG, Swiss-Prot, CATH, SCOP, the Gene Ontology, and ENZYME. The database can be searched using either keyword search or data source-specific web forms. Users can thus quickly select and download PDB entries that, for instance, participate in a particular pathway, are classified as containing a certain CATH architecture, are annotated as having a certain molecular function in the Gene Ontology, and whose structures have a resolution under a defined threshold. The results of queries are provided in both machine-readable extensible markup language and human-readable format. The structures themselves can be viewed interactively on the web. The COLUMBA database facilitates the creation of protein structure data sets for many structure-based studies. It allows to combine queries on a number of structure-related databases not covered by other projects at present. Thus, information on both many and few protein structures can be used efficiently. The web interface for COLUMBA is available at http://www.columba-db.de.
Geierhaas, Christian D; Salvatella, Xavier; Clarke, Jane; Vendruscolo, Michele
2008-03-01
It has been suggested that Phi-values, which allow structural information about transition states (TSs) for protein folding to be obtained, are most reliably interpreted when divided into three classes (high, medium and low). High Phi-values indicate almost completely folded regions in the TS, intermediate Phi-values regions with a detectable amount of structure and low Phi-values indicate mostly unstructured regions. To explore the extent to which this classification can be used to characterise in detail the structure of TSs for protein folding, we used Phi-values divided into these classes as restraints in molecular dynamics simulations. This type of procedure is related to that used in NMR spectroscopy to define the structure of native proteins from the measurement of inter-proton distances derived from nuclear Overhauser effects. We illustrate this approach by determining the TS ensembles of five proteins and by showing that the results are similar to those obtained by using as restraints the actual numerical Phi-values measured experimentally. Our results indicate that the simultaneous consideration of a set of low-resolution Phi-values can provide sufficient information for characterising the architecture of a TS for folding of a protein.
Geierhaas, Christian D.; Salvatella, Xavier; Clarke, Jane; Vendruscolo, Michele
2008-01-01
It has been suggested that Φ-values, which allow structural information about transition states (TSs) for protein folding to be obtained, are most reliably interpreted when divided into three classes (high, medium and low). High Φ-values indicate almost completely folded regions in the TS, intermediate Φ-values regions with a detectable amount of structure and low Φ-values indicate mostly unstructured regions. To explore the extent to which this classification can be used to characterise in detail the structure of TSs for protein folding, we used Φ-values divided into these classes as restraints in molecular dynamics simulations. This type of procedure is related to that used in NMR spectroscopy to define the structure of native proteins from the measurement of inter-proton distances derived from nuclear Overhauser effects. We illustrate this approach by determining the TS ensembles of five proteins and by showing that the results are similar to those obtained by using as restraints the actual numerical Φ-values measured experimentally. Our results indicate that the simultaneous consideration of a set of low-resolution Φ-values can provide sufficient information for characterising the architecture of a TS for folding of a protein. PMID:18299294
The Role of Protein Loops and Linkers in Conformational Dynamics and Allostery.
Papaleo, Elena; Saladino, Giorgio; Lambrughi, Matteo; Lindorff-Larsen, Kresten; Gervasio, Francesco Luigi; Nussinov, Ruth
2016-06-08
Proteins are dynamic entities that undergo a plethora of conformational changes that may take place on a wide range of time scales. These changes can be as small as the rotation of one or a few side-chain dihedral angles or involve concerted motions in larger portions of the three-dimensional structure; both kinds of motions can be important for biological function and allostery. It is becoming increasingly evident that "connector regions" are important components of the dynamic personality of protein structures. These regions may be either disordered loops, i.e., poorly structured regions connecting secondary structural elements, or linkers that connect entire protein domains. Experimental and computational studies have, however, revealed that these regions are not mere connectors, and their role in allostery and conformational changes has been emerging in the last few decades. Here we provide a detailed overview of the structural properties and classification of loops and linkers, as well as a discussion of the main computational methods employed to investigate their function and dynamical properties. We also describe their importance for protein dynamics and allostery using as examples key proteins in cellular biology and human diseases such as kinases, ubiquitinating enzymes, and transcription factors.
A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.
Ni, Qianwu; Chen, Lei
2017-01-01
Correct prediction of protein structural class is beneficial to investigation on protein functions, regulations and interactions. In recent years, several computational methods have been proposed in this regard. However, based on various features, it is still a great challenge to select proper classification algorithm and extract essential features to participate in classification. In this study, a feature and algorithm selection method was presented for improving the accuracy of protein structural class prediction. The amino acid compositions and physiochemical features were adopted to represent features and thirty-eight machine learning algorithms collected in Weka were employed. All features were first analyzed by a feature selection method, minimum redundancy maximum relevance (mRMR), producing a feature list. Then, several feature sets were constructed by adding features in the list one by one. For each feature set, thirtyeight algorithms were executed on a dataset, in which proteins were represented by features in the set. The predicted classes yielded by these algorithms and true class of each protein were collected to construct a dataset, which were analyzed by mRMR method, yielding an algorithm list. From the algorithm list, the algorithm was taken one by one to build an ensemble prediction model. Finally, we selected the ensemble prediction model with the best performance as the optimal ensemble prediction model. Experimental results indicate that the constructed model is much superior to models using single algorithm and other models that only adopt feature selection procedure or algorithm selection procedure. The feature selection procedure or algorithm selection procedure are really helpful for building an ensemble prediction model that can yield a better performance. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zheng, Meiying; Cooper, David R.; Grossoehme, Nickolas E.
2009-04-01
Here, the crystal structure of TM0439, a GntR regulator with an FCD domain found in the Thermotoga maritima genome, is described. The GntR superfamily of dimeric transcription factors, with more than 6200 members encoded in bacterial genomes, are characterized by N-terminal winged-helix DNA-binding domains and diverse C-terminal regulatory domains which provide a basis for the classification of the constituent families. The largest of these families, FadR, contains nearly 3000 proteins with all-α-helical regulatory domains classified into two related Pfam families: FadR-C and FCD. Only two crystal structures of FadR-family members, those of Escherichia coli FadR protein and LldR from Corynebacteriummore » glutamicum, have been described to date in the literature. Here, the crystal structure of TM0439, a GntR regulator with an FCD domain found in the Thermotoga maritima genome, is described. The FCD domain is similar to that of the LldR regulator and contains a buried metal-binding site. Using atomic absorption spectroscopy and Trp fluorescence, it is shown that the recombinant protein contains bound Ni{sup 2+} ions but that it is able to bind Zn{sup 2+} with K{sub d} < 70 nM. It is concluded that Zn{sup 2+} is the likely physiological metal and that it may perform either structural or regulatory roles or both. Finally, the TM0439 structure is compared with two other FadR-family structures recently deposited by structural genomics consortia. The results call for a revision in the classification of the FadR family of transcription factors.« less
Shin, Jae-Min; Cho, Doo-Ho
2005-01-01
PDB-Ligand (http://www.idrtech.com/PDB-Ligand/) is a three-dimensional structure database of small molecular ligands that are bound to larger biomolecules deposited in the Protein Data Bank (PDB). It is also a database tool that allows one to browse, classify, superimpose and visualize these structures. As of May 2004, there are about 4870 types of small molecular ligands, experimentally determined as a complex with protein or DNA in the PDB. The proteins that a given ligand binds are often homologous and present the same binding structure to the ligand. However, there are also many instances wherein a given ligand binds to two or more unrelated proteins, or to the same or homologous protein in different binding environments. PDB-Ligand serves as an interactive structural analysis and clustering tool for all the ligand-binding structures in the PDB. PDB-Ligand also provides an easier way to obtain a number of different structure alignments of many related ligand-binding structures based on a simple and flexible ligand clustering method. PDB-Ligand will be a good resource for both a better interpretation of ligand-binding structures and the development of better scoring functions to be used in many drug discovery applications.
Transporter taxonomy - a comparison of different transport protein classification schemes.
Viereck, Michael; Gaulton, Anna; Digles, Daniela; Ecker, Gerhard F
2014-06-01
Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.
Computational approaches for the classification of seed storage proteins.
Radhika, V; Rao, V Sree Hari
2015-07-01
Seed storage proteins comprise a major part of the protein content of the seed and have an important role on the quality of the seed. These storage proteins are important because they determine the total protein content and have an effect on the nutritional quality and functional properties for food processing. Transgenic plants are being used to develop improved lines for incorporation into plant breeding programs and the nutrient composition of seeds is a major target of molecular breeding programs. Hence, classification of these proteins is crucial for the development of superior varieties with improved nutritional quality. In this study we have applied machine learning algorithms for classification of seed storage proteins. We have presented an algorithm based on nearest neighbor approach for classification of seed storage proteins and compared its performance with decision tree J48, multilayer perceptron neural (MLP) network and support vector machine (SVM) libSVM. The model based on our algorithm has been able to give higher classification accuracy in comparison to the other methods.
Gold, Nicola D; Jackson, Richard M
2006-02-03
The rapid growth in protein structural data and the emergence of structural genomics projects have increased the need for automatic structure analysis and tools for function prediction. Small molecule recognition is critical to the function of many proteins; therefore, determination of ligand binding site similarity is important for understanding ligand interactions and may allow their functional classification. Here, we present a binding sites database (SitesBase) that given a known protein-ligand binding site allows rapid retrieval of other binding sites with similar structure independent of overall sequence or fold similarity. However, each match is also annotated with sequence similarity and fold information to aid interpretation of structure and functional similarity. Similarity in ligand binding sites can indicate common binding modes and recognition of similar molecules, allowing potential inference of function for an uncharacterised protein or providing additional evidence of common function where sequence or fold similarity is already known. Alternatively, the resource can provide valuable information for detailed studies of molecular recognition including structure-based ligand design and in understanding ligand cross-reactivity. Here, we show examples of atomic similarity between superfamily or more distant fold relatives as well as between seemingly unrelated proteins. Assignment of unclassified proteins to structural superfamiles is also undertaken and in most cases substantiates assignments made using sequence similarity. Correct assignment is also possible where sequence similarity fails to find significant matches, illustrating the potential use of binding site comparisons for newly determined proteins.
Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi
2014-01-01
Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.
Protein Kinase Classification with 2866 Hidden Markov Models and One Support Vector Machine
NASA Technical Reports Server (NTRS)
Weber, Ryan; New, Michael H.; Fonda, Mark (Technical Monitor)
2002-01-01
The main application considered in this paper is predicting true kinases from randomly permuted kinases that share the same length and amino acid distributions as the true kinases. Numerous methods already exist for this classification task, such as HMMs, motif-matchers, and sequence comparison algorithms. We build on some of these efforts by creating a vector from the output of thousands of structurally based HMMs, created offline with Pfam-A seed alignments using SAM-T99, which then must be combined into an overall classification for the protein. Then we use a Support Vector Machine for classifying this large ensemble Pfam-Vector, with a polynomial and chisquared kernel. In particular, the chi-squared kernel SVM performs better than the HMMs and better than the BLAST pairwise comparisons, when predicting true from false kinases in some respects, but no one algorithm is best for all purposes or in all instances so we consider the particular strengths and weaknesses of each.
A periodic table of coiled-coil protein structures.
Moutevelis, Efrosini; Woolfson, Derek N
2009-01-23
Coiled coils are protein structure domains with two or more alpha-helices packed together via interlacing of side chains known as knob-into-hole packing. We analysed and classified a large set of coiled-coil structures using a combination of automated and manual methods. This led to a systematic classification that we termed a "periodic table of coiled coils," which we have made available at http://coiledcoils.chm.bris.ac.uk/ccplus/search/periodic_table. In this table, coiled-coil assemblies are arranged in columns with increasing numbers of alpha-helices and in rows of increased complexity. The table provides a framework for understanding possibilities in and limits on coiled-coil structures and a basis for future prediction, engineering and design studies.
Insights into animal and plant lectins with antimicrobial activities.
Dias, Renata de Oliveira; Machado, Leandro Dos Santos; Migliolo, Ludovico; Franco, Octavio Luiz
2015-01-05
Lectins are multivalent proteins with the ability to recognize and bind diverse carbohydrate structures. The glyco -binding and diverse molecular structures observed in these protein classes make them a large and heterogeneous group with a wide range of biological activities in microorganisms, animals and plants. Lectins from plants and animals are commonly used in direct defense against pathogens and in immune regulation. This review focuses on sources of animal and plant lectins, describing their functional classification and tridimensional structures, relating these properties with biotechnological purposes, including antimicrobial activities. In summary, this work focuses on structural-functional elucidation of diverse lectin groups, shedding some light on host-pathogen interactions; it also examines their emergence as biotechnological tools through gene manipulation and development of new drugs.
Jaspard, Emmanuel; Macherel, David; Hunault, Gilles
2012-01-01
Late Embryogenesis Abundant Proteins (LEAPs) are ubiquitous proteins expected to play major roles in desiccation tolerance. Little is known about their structure - function relationships because of the scarcity of 3-D structures for LEAPs. The previous building of LEAPdb, a database dedicated to LEAPs from plants and other organisms, led to the classification of 710 LEAPs into 12 non-overlapping classes with distinct properties. Using this resource, numerous physico-chemical properties of LEAPs and amino acid usage by LEAPs have been computed and statistically analyzed, revealing distinctive features for each class. This unprecedented analysis allowed a rigorous characterization of the 12 LEAP classes, which differed also in multiple structural and physico-chemical features. Although most LEAPs can be predicted as intrinsically disordered proteins, the analysis indicates that LEAP class 7 (PF03168) and probably LEAP class 11 (PF04927) are natively folded proteins. This study thus provides a detailed description of the structural properties of this protein family opening the path toward further LEAP structure - function analysis. Finally, since each LEAP class can be clearly characterized by a unique set of physico-chemical properties, this will allow development of software to predict proteins as LEAPs. PMID:22615859
Mapping flexible protein domains at subnanometer resolution with the atomic force microscope.
Müller, D J; Fotiadis, D; Engel, A
1998-06-23
The mapping of flexible protein domains with the atomic force microscope is reviewed. Examples discussed are the bacteriorhodopsin from Halobacterium salinarum, the head-tail-connector from phage phi29, and the hexagonally packed intermediate layer from Deinococcus radiodurans which all were recorded in physiological buffer solution. All three proteins undergo reversible structural changes that are reflected in standard deviation maps calculated from aligned topographs of individual protein complexes. Depending on the lateral resolution (up to 0.8 nm) flexible surface regions can ultimately be correlated with individual polypeptide loops. In addition, multivariate statistical classification revealed the major conformations of the protein surface.
A simple atomic-level hydrophobicity scale reveals protein interfacial structure.
Kapcha, Lauren H; Rossky, Peter J
2014-01-23
Many amino acid residue hydrophobicity scales have been created in an effort to better understand and rapidly characterize water-protein interactions based only on protein structure and sequence. There is surprisingly low consistency in the ranking of residue hydrophobicity between scales, and their ability to provide insightful characterization varies substantially across subject proteins. All current scales characterize hydrophobicity based on entire amino acid residue units. We introduce a simple binary but atomic-level hydrophobicity scale that allows for the classification of polar and non-polar moieties within single residues, including backbone atoms. This simple scale is first shown to capture the anticipated hydrophobic character for those whole residues that align in classification among most scales. Examination of a set of protein binding interfaces establishes good agreement between residue-based and atomic-level descriptions of hydrophobicity for five residues, while the remaining residues produce discrepancies. We then show that the atomistic scale properly classifies the hydrophobicity of functionally important regions where residue-based scales fail. To illustrate the utility of the new approach, we show that the atomic-level scale rationalizes the hydration of two hydrophobic pockets and the presence of a void in a third pocket within a single protein and that it appropriately classifies all of the functionally important hydrophilic sites within two otherwise hydrophobic pores. We suggest that an atomic level of detail is, in general, necessary for the reliable depiction of hydrophobicity for all protein surfaces. The present formulation can be implemented simply in a manner no more complex than current residue-based approaches. © 2013.
2012-01-01
Background The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Results Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. Conclusions This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups. PMID:22726767
NASA Astrophysics Data System (ADS)
Jain, Sankalp; Grandits, Melanie; Richter, Lars; Ecker, Gerhard F.
2017-06-01
The bile salt export pump (BSEP) actively transports conjugated monovalent bile acids from the hepatocytes into the bile. This facilitates the formation of micelles and promotes digestion and absorption of dietary fat. Inhibition of BSEP leads to decreased bile flow and accumulation of cytotoxic bile salts in the liver. A number of compounds have been identified to interact with BSEP, which results in drug-induced cholestasis or liver injury. Therefore, in silico approaches for flagging compounds as potential BSEP inhibitors would be of high value in the early stage of the drug discovery pipeline. Up to now, due to the lack of a high-resolution X-ray structure of BSEP, in silico based identification of BSEP inhibitors focused on ligand-based approaches. In this study, we provide a homology model for BSEP, developed using the corrected mouse P-glycoprotein structure (PDB ID: 4M1M). Subsequently, the model was used for docking-based classification of a set of 1212 compounds (405 BSEP inhibitors, 807 non-inhibitors). Using the scoring function ChemScore, a prediction accuracy of 81% on the training set and 73% on two external test sets could be obtained. In addition, the applicability domain of the models was assessed based on Euclidean distance. Further, analysis of the protein-ligand interaction fingerprints revealed certain functional group-amino acid residue interactions that could play a key role for ligand binding. Though ligand-based models, due to their high speed and accuracy, remain the method of choice for classification of BSEP inhibitors, structure-assisted docking models demonstrate reasonably good prediction accuracies while additionally providing information about putative protein-ligand interactions.
Hidden relationships between metalloproteins unveiled by structural comparison of their metal sites
NASA Astrophysics Data System (ADS)
Valasatava, Yana; Andreini, Claudia; Rosato, Antonio
2015-03-01
Metalloproteins account for a substantial fraction of all proteins. They incorporate metal atoms, which are required for their structure and/or function. Here we describe a new computational protocol to systematically compare and classify metal-binding sites on the basis of their structural similarity. These sites are extracted from the MetalPDB database of minimal functional sites (MFSs) in metal-binding biological macromolecules. Structural similarity is measured by the scoring function of the available MetalS2 program. Hierarchical clustering was used to organize MFSs into clusters, for each of which a representative MFS was identified. The comparison of all representative MFSs provided a thorough structure-based classification of the sites analyzed. As examples, the application of the proposed computational protocol to all heme-binding proteins and zinc-binding proteins of known structure highlighted the existence of structural subtypes, validated known evolutionary links and shed new light on the occurrence of similar sites in systems at different evolutionary distances. The present approach thus makes available an innovative viewpoint on metalloproteins, where the functionally crucial metal sites effectively lead the discovery of structural and functional relationships in a largely protein-independent manner.
NMRDSP: an accurate prediction of protein shape strings from NMR chemical shifts and sequence data.
Mao, Wusong; Cong, Peisheng; Wang, Zhiheng; Lu, Longjian; Zhu, Zhongliang; Li, Tonghua
2013-01-01
Shape string is structural sequence and is an extremely important structure representation of protein backbone conformations. Nuclear magnetic resonance chemical shifts give a strong correlation with the local protein structure, and are exploited to predict protein structures in conjunction with computational approaches. Here we demonstrate a novel approach, NMRDSP, which can accurately predict the protein shape string based on nuclear magnetic resonance chemical shifts and structural profiles obtained from sequence data. The NMRDSP uses six chemical shifts (HA, H, N, CA, CB and C) and eight elements of structure profiles as features, a non-redundant set (1,003 entries) as the training set, and a conditional random field as a classification algorithm. For an independent testing set (203 entries), we achieved an accuracy of 75.8% for S8 (the eight states accuracy) and 87.8% for S3 (the three states accuracy). This is higher than only using chemical shifts or sequence data, and confirms that the chemical shift and the structure profile are significant features for shape string prediction and their combination prominently improves the accuracy of the predictor. We have constructed the NMRDSP web server and believe it could be employed to provide a solid platform to predict other protein structures and functions. The NMRDSP web server is freely available at http://cal.tongji.edu.cn/NMRDSP/index.jsp.
Transmembrane proteins in the Protein Data Bank: identification and classification.
Tusnády, Gábor E; Dosztányi, Zsuzsanna; Simon, István
2004-11-22
Integral membrane proteins play important roles in living cells. Although these proteins are estimated to constitute 25% of proteins at a genomic scale, the Protein Data Bank (PDB) contains only a few hundred membrane proteins due to the difficulties with experimental techniques. The presence of transmembrane proteins in the structure data bank, however, is quite invisible, as the annotation of these entries is rather poor. Even if a protein is identified as a transmembrane one, the possible location of the lipid bilayer is not indicated in the PDB because these proteins are crystallized without their natural lipid bilayer, and currently no method is publicly available to detect the possible membrane plane using the atomic coordinates of membrane proteins. Here, we present a new geometrical approach to distinguish between transmembrane and globular proteins using structural information only and to locate the most likely position of the lipid bilayer. An automated algorithm (TMDET) is given to determine the membrane planes relative to the position of atomic coordinates, together with a discrimination function which is able to separate transmembrane and globular proteins even in cases of low resolution or incomplete structures such as fragments or parts of large multi chain complexes. This method can be used for the proper annotation of protein structures containing transmembrane segments and paves the way to an up-to-date database containing the structure of all known transmembrane proteins and fragments (PDB_TM) which can be automatically updated. The algorithm is equally important for the purpose of constructing databases purely of globular proteins.
Micsonai, András; Wien, Frank; Bulyáki, Éva; Kun, Judit; Moussong, Éva; Lee, Young-Ho; Goto, Yuji; Réfrégiers, Matthieu; Kardos, József
2018-06-11
Circular dichroism (CD) spectroscopy is a widely used method to study the protein secondary structure. However, for decades, the general opinion was that the correct estimation of β-sheet content is challenging because of the large spectral and structural diversity of β-sheets. Recently, we showed that the orientation and twisting of β-sheets account for the observed spectral diversity, and developed a new method to estimate accurately the secondary structure (PNAS, 112, E3095). BeStSel web server provides the Beta Structure Selection method to analyze the CD spectra recorded by conventional or synchrotron radiation CD equipment. Both normalized and measured data can be uploaded to the server either as a single spectrum or series of spectra. The originality of BeStSel is that it carries out a detailed secondary structure analysis providing information on eight secondary structure components including parallel-β structure and antiparallel β-sheets with three different groups of twist. Based on these, it predicts the protein fold down to the topology/homology level of the CATH protein fold classification. The server also provides a module to analyze the structures deposited in the PDB for BeStSel secondary structure contents in relation to Dictionary of Secondary Structure of Proteins data. The BeStSel server is freely accessible at http://bestsel.elte.hu.
Protein classification based on text document classification techniques.
Cheng, Betty Yee Man; Carbonell, Jaime G; Klein-Seetharaman, Judith
2005-03-01
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively. Copyright 2005 Wiley-Liss, Inc.
Circuit topology of proteins and nucleic acids.
Mashaghi, Alireza; van Wijk, Roeland J; Tans, Sander J
2014-09-02
Folded biomolecules display a bewildering structural complexity and diversity. They have therefore been analyzed in terms of generic topological features. For instance, folded proteins may be knotted, have beta-strands arranged into a Greek-key motif, or display high contact order. In this perspective, we present a method to formally describe the topology of all folded linear chains and hence provide a general classification and analysis framework for a range of biomolecules. Moreover, by identifying the fundamental rules that intrachain contacts must obey, the method establishes the topological constraints of folded linear chains. We also briefly illustrate how this circuit topology notion can be applied to study the equivalence of folded chains, the engineering of artificial RNA structures and DNA origami, the topological structure of genomes, and the role of topology in protein folding. Copyright © 2014 Elsevier Ltd. All rights reserved.
Abriata, Luciano A; Kinch, Lisa N; Tamò, Giorgio E; Monastyrskyy, Bohdan; Kryshtafovych, Andriy; Dal Peraro, Matteo
2018-03-01
For assessment purposes, CASP targets are split into evaluation units. We herein present the official definition of CASP12 evaluation units (EUs) and their classification into difficulty categories. Each target can be evaluated as one EU (the whole target) or/and several EUs (separate structural domains or groups of structural domains). The specific scenario for a target split is determined by the domain organization of available templates, the difference in server performance on separate domains versus combination of the domains, and visual inspection. In the end, 71 targets were split into 96 EUs. Classification of the EUs into difficulty categories was done semi-automatically with the assistance of metrics provided by the Prediction Center. These metrics account for sequence and structural similarities of the EUs to potential structural templates from the Protein Data Bank, and for the baseline performance of automated server predictions. The metrics readily separate the 96 EUs into 38 EUs that should be straightforward for template-based modeling (TBM) and 39 that are expected to be hard for homology modeling and are thus left for free modeling (FM). The remaining 19 borderline evaluation units were dubbed FM/TBM, and were inspected case by case. The article also overviews structural and evolutionary features of selected targets relevant to our accompanying article presenting the assessment of FM and FM/TBM predictions, and overviews structural features of the hardest evaluation units from the FM category. We finally suggest improvements for the EU definition and classification procedures. © 2017 Wiley Periodicals, Inc.
Probabilistic grammatical model for helix‐helix contact site classification
2013-01-01
Background Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. Results In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. Conclusions We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists. PMID:24350601
Lamin-like analogues in plants: the characterization of NMCP1 in Allium cepa
Moreno Díaz de la Espina, Susana
2013-01-01
The nucleoskeleton of plants contains a peripheral lamina (also called plamina) and, even though lamins are absent in plants, their roles are still fulfilled in plant nuclei. One of the most intriguing topics in plant biology concerns the identity of lamin protein analogues in plants. Good candidates to play lamin functions in plants are the members of the NMCP (nuclear matrix constituent protein) family, which exhibit the typical tripartite structure of lamins. This paper describes a bioinformatics analysis and classification of the NMCP family based on phylogenetic relationships, sequence similarity and the distribution of conserved regions in 76 homologues. In addition, NMCP1 in the monocot Allium cepa characterized by its sequence and structure, biochemical properties, and subnuclear distribution and alterations in its expression throughout the root were identified. The results demonstrate that these proteins exhibit many similarities to lamins (structural organization, conserved regions, subnuclear distribution, and solubility) and that they may fulfil the functions of lamins in plants. These findings significantly advance understanding of the structural proteins of the plant lamina and nucleoskeleton and provide a basis for further investigation of the protein networks forming these structures. PMID:23378381
Lamin-like analogues in plants: the characterization of NMCP1 in Allium cepa.
Ciska, Malgorzata; Masuda, Kiyoshi; Moreno Díaz de la Espina, Susana
2013-04-01
The nucleoskeleton of plants contains a peripheral lamina (also called plamina) and, even though lamins are absent in plants, their roles are still fulfilled in plant nuclei. One of the most intriguing topics in plant biology concerns the identity of lamin protein analogues in plants. Good candidates to play lamin functions in plants are the members of the NMCP (nuclear matrix constituent protein) family, which exhibit the typical tripartite structure of lamins. This paper describes a bioinformatics analysis and classification of the NMCP family based on phylogenetic relationships, sequence similarity and the distribution of conserved regions in 76 homologues. In addition, NMCP1 in the monocot Allium cepa characterized by its sequence and structure, biochemical properties, and subnuclear distribution and alterations in its expression throughout the root were identified. The results demonstrate that these proteins exhibit many similarities to lamins (structural organization, conserved regions, subnuclear distribution, and solubility) and that they may fulfil the functions of lamins in plants. These findings significantly advance understanding of the structural proteins of the plant lamina and nucleoskeleton and provide a basis for further investigation of the protein networks forming these structures.
Atomic analysis of protein-protein interfaces with known inhibitors: the 2P2I database.
Bourgeas, Raphaël; Basse, Marie-Jeanne; Morelli, Xavier; Roche, Philippe
2010-03-09
In the last decade, the inhibition of protein-protein interactions (PPIs) has emerged from both academic and private research as a new way to modulate the activity of proteins. Inhibitors of these original interactions are certainly the next generation of highly innovative drugs that will reach the market in the next decade. However, in silico design of such compounds still remains challenging. Here we describe this particular PPI chemical space through the presentation of 2P2I(DB), a hand-curated database dedicated to the structure of PPIs with known inhibitors. We have analyzed protein/protein and protein/inhibitor interfaces in terms of geometrical parameters, atom and residue properties, buried accessible surface area and other biophysical parameters. The interfaces found in 2P2I(DB) were then compared to those of representative datasets of heterodimeric complexes. We propose a new classification of PPIs with known inhibitors into two classes depending on the number of segments present at the interface and corresponding to either a single secondary structure element or to a more globular interacting domain. 2P2I(DB) complexes share global shape properties with standard transient heterodimer complexes, but their accessible surface areas are significantly smaller. No major conformational changes are seen between the different states of the proteins. The interfaces are more hydrophobic than general PPI's interfaces, with less charged residues and more non-polar atoms. Finally, fifty percent of the complexes in the 2P2I(DB) dataset possess more hydrogen bonds than typical protein-protein complexes. Potential areas of study for the future are proposed, which include a new classification system consisting of specific families and the identification of PPI targets with high druggability potential based on key descriptors of the interaction. 2P2I database stores structural information about PPIs with known inhibitors and provides a useful tool for biologists to assess the potential druggability of their interfaces. The database can be accessed at http://2p2idb.cnrs-mrs.fr.
Sudha, Govindarajan; Srinivasan, Narayanaswamy
2016-09-01
A comprehensive analysis of the quaternary features of distantly related homo-oligomeric proteins is the focus of the current study. This study has been performed at the levels of quaternary state, symmetry, and quaternary structure. Quaternary state and quaternary structure refers to the number of subunits and spatial arrangements of subunits, respectively. Using a large dataset of available 3D structures of biologically relevant assemblies, we show that only 53% of the distantly related homo-oligomeric proteins have the same quaternary state. Considering these homologous homo-oligomers with the same quaternary state, conservation of quaternary structures is observed only in 38% of the pairs. In 36% of the pairs of distantly related homo-oligomers with different quaternary states the larger assembly in a pair shows high structural similarity with the entire quaternary structure of the related protein with lower quaternary state and it is referred as "Russian doll effect." The differences in quaternary state and structure have been suggested to contribute to the functional diversity. Detailed investigations show that even though the gross functions of many distantly related homo-oligomers are the same, finer level differences in molecular functions are manifested by differences in quaternary states and structures. Comparison of structures of biological assemblies in distantly and closely related homo-oligomeric proteins throughout the study differentiates the effects of sequence divergence on the quaternary structures and function. Knowledge inferred from this study can provide insights for improved protein structure classification and function prediction of homo-oligomers. Proteins 2016; 84:1190-1202. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Multiscale Persistent Functions for Biomolecular Structure Characterization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xia, Kelin; Li, Zhiming; Mu, Lin
Here in this paper, we introduce multiscale persistent functions for biomolecular structure characterization. The essential idea is to combine our multiscale rigidity functions (MRFs) with persistent homology analysis, so as to construct a series of multiscale persistent functions, particularly multiscale persistent entropies, for structure characterization. To clarify the fundamental idea of our method, the multiscale persistent entropy (MPE) model is discussed in great detail. Mathematically, unlike the previous persistent entropy (Chintakunta et al. in Pattern Recognit 48(2):391–401, 2015; Merelli et al. in Entropy 17(10):6872–6892, 2015; Rucco et al. in: Proceedings of ECCS 2014, Springer, pp 117–128, 2016), a special resolutionmore » parameter is incorporated into our model. Various scales can be achieved by tuning its value. Physically, our MPE can be used in conformational entropy evaluation. More specifically, it is found that our method incorporates in it a natural classification scheme. This is achieved through a density filtration of an MRF built from angular distributions. To further validate our model, a systematical comparison with the traditional entropy evaluation model is done. Additionally, it is found that our model is able to preserve the intrinsic topological features of biomolecular data much better than traditional approaches, particularly for resolutions in the intermediate range. Moreover, by comparing with traditional entropies from various grid sizes, bond angle-based methods and a persistent homology-based support vector machine method (Cang et al. in Mol Based Math Biol 3:140–162, 2015), we find that our MPE method gives the best results in terms of average true positive rate in a classic protein structure classification test. More interestingly, all-alpha and all-beta protein classes can be clearly separated from each other with zero error only in our model. Finally, a special protein structure index (PSI) is proposed, for the first time, to describe the “regularity” of protein structures. Basically, a protein structure is deemed as regular if it has a consistent and orderly configuration. Our PSI model is tested on a database of 110 proteins; we find that structures with larger portions of loops and intrinsically disorder regions are always associated with larger PSI, meaning an irregular configuration, while proteins with larger portions of secondary structures, i.e., alpha-helix or beta-sheet, have smaller PSI. Essentially, PSI can be used to describe the “regularity” information in any systems.« less
Tandem Repeats in Proteins: Prediction Algorithms and Biological Role.
Pellegrini, Marco
2015-01-01
Tandem repetitions in protein sequence and structure is a fascinating subject of research which has been a focus of study since the late 1990s. In this survey, we give an overview on the multi-faceted aspects of research on protein tandem repeats (PTR for short), including prediction algorithms, databases, early classification efforts, mechanisms of PTR formation and evolution, and synthetic PTR design. We also touch on the rather open issue of the relationship between PTR and flexibility (or disorder) in proteins. Detection of PTR either from protein sequence or structure data is challenging due to inherent high (biological) signal-to-noise ratio that is a key feature of this problem. As early in silico analytic tools have been key enablers for starting this field of study, we expect that current and future algorithmic and statistical breakthroughs will have a high impact on the investigations of the biological role of PTR.
Brown, Peter; Pullan, Wayne; Yang, Yuedong; Zhou, Yaoqi
2016-02-01
The three dimensional tertiary structure of a protein at near atomic level resolution provides insight alluding to its function and evolution. As protein structure decides its functionality, similarity in structure usually implies similarity in function. As such, structure alignment techniques are often useful in the classifications of protein function. Given the rapidly growing rate of new, experimentally determined structures being made available from repositories such as the Protein Data Bank, fast and accurate computational structure comparison tools are required. This paper presents SPalignNS, a non-sequential protein structure alignment tool using a novel asymmetrical greedy search technique. The performance of SPalignNS was evaluated against existing sequential and non-sequential structure alignment methods by performing trials with commonly used datasets. These benchmark datasets used to gauge alignment accuracy include (i) 9538 pairwise alignments implied by the HOMSTRAD database of homologous proteins; (ii) a subset of 64 difficult alignments from set (i) that have low structure similarity; (iii) 199 pairwise alignments of proteins with similar structure but different topology; and (iv) a subset of 20 pairwise alignments from the RIPC set. SPalignNS is shown to achieve greater alignment accuracy (lower or comparable root-mean squared distance with increased structure overlap coverage) for all datasets, and the highest agreement with reference alignments from the challenging dataset (iv) above, when compared with both sequentially constrained alignments and other non-sequential alignments. SPalignNS was implemented in C++. The source code, binary executable, and a web server version is freely available at: http://sparks-lab.org yaoqi.zhou@griffith.edu.au. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Boareto, Marcelo; Yamagishi, Michel E B; Caticha, Nestor; Leite, Vitor B P
2012-10-01
In protein databases there is a substantial number of proteins structurally determined but without function annotation. Understanding the relationship between function and structure can be useful to predict function on a large scale. We have analyzed the similarities in global physicochemical parameters for a set of enzymes which were classified according to the four Enzyme Commission (EC) hierarchical levels. Using relevance theory we introduced a distance between proteins in the space of physicochemical characteristics. This was done by minimizing a cost function of the metric tensor built to reflect the EC classification system. Using an unsupervised clustering method on a set of 1025 enzymes, we obtained no relevant clustering formation compatible with EC classification. The distance distributions between enzymes from the same EC group and from different EC groups were compared by histograms. Such analysis was also performed using sequence alignment similarity as a distance. Our results suggest that global structure parameters are not sufficient to segregate enzymes according to EC hierarchy. This indicates that features essential for function are rather local than global. Consequently, methods for predicting function based on global attributes should not obtain high accuracy in main EC classes prediction without relying on similarities between enzymes from training and validation datasets. Furthermore, these results are consistent with a substantial number of studies suggesting that function evolves fundamentally by recruitment, i.e., a same protein motif or fold can be used to perform different enzymatic functions and a few specific amino acids (AAs) are actually responsible for enzyme activity. These essential amino acids should belong to active sites and an effective method for predicting function should be able to recognize them. Copyright © 2012 Elsevier Ltd. All rights reserved.
An updated evolutionary classification of CRISPR–Cas systems
Makarova, Kira S.; Wolf, Yuri I.; Alkhnbashi, Omer S.; Costa, Fabrizio; Shah, Shiraz A.; Saunders, Sita J.; Barrangou, Rodolphe; Brouns, Stan J. J.; Charpentier, Emmanuelle; Haft, Daniel H.; Horvath, Philippe; Moineau, Sylvain; Mojica, Francisco J. M.; Terns, Rebecca M.; Terns, Michael P.; White, Malcolm F.; Yakunin, Alexander F.; Garrett, Roger A.; van der Oost, John; Backofen, Rolf; Koonin, Eugene V.
2017-01-01
The evolution of CRISPR–cas loci, which encode adaptive immune systems in archaea and bacteria, involves rapid changes, in particular numerous rearrangements of the locus architecture and horizontal transfer of complete loci or individual modules. These dynamics complicate straightforward phylogenetic classification, but here we present an approach combining the analysis of signature protein families and features of the architecture of cas loci that unambiguously partitions most CRISPR–cas loci into distinct classes, types and subtypes. The new classification retains the overall structure of the previous version but is expanded to now encompass two classes, five types and 16 subtypes. The relative stability of the classification suggests that the most prevalent variants of CRISPR–Cas systems are already known. However, the existence of rare, currently unclassifiable variants implies that additional types and subtypes remain to be characterized. PMID:26411297
Protein Kinases in Mammary Gland Development and Carcinogenesis
1998-10-01
conserved features of primary structure and classification of family members. Methods in Enzymology , 200:38-79, 1991. 23. Nairn ACand Picciotto MR... invertase of S. cerevisiae. Molecular and Cellular Biology, 14:2958-2965, 1994. 29. Drewes G, Ebneth A, Preuss U, Mandelkow EMand Mandelkow E. MARK, a novel
Hayat, Maqsood; Tahir, Muhammad
2015-08-01
Membrane protein is a central component of the cell that manages intra and extracellular processes. Membrane proteins execute a diversity of functions that are vital for the survival of organisms. The topology of transmembrane proteins describes the number of transmembrane (TM) helix segments and its orientation. However, owing to the lack of its recognized structures, the identification of TM helix and its topology through experimental methods is laborious with low throughput. In order to identify TM helix segments reliably, accurately, and effectively from topogenic sequences, we propose the PSOFuzzySVM-TMH model. In this model, evolutionary based information position specific scoring matrix and discrete based information 6-letter exchange group are used to formulate transmembrane protein sequences. The noisy and extraneous attributes are eradicated using an optimization selection technique, particle swarm optimization, from both feature spaces. Finally, the selected feature spaces are combined in order to form ensemble feature space. Fuzzy-support vector Machine is utilized as a classification algorithm. Two benchmark datasets, including low and high resolution datasets, are used. At various levels, the performance of the PSOFuzzySVM-TMH model is assessed through 10-fold cross validation test. The empirical results reveal that the proposed framework PSOFuzzySVM-TMH outperforms in terms of classification performance in the examined datasets. It is ascertained that the proposed model might be a useful and high throughput tool for academia and research community for further structure and functional studies on transmembrane proteins.
Suzuki, K; Kirisako, T; Kamada, Y; Mizushima, N; Noda, T; Ohsumi, Y
2001-11-01
Macroautophagy is a bulk degradation process induced by starvation in eukaryotic cells. In yeast, 15 Apg proteins coordinate the formation of autophagosomes. Several key reactions performed by these proteins have been described, but a comprehensive understanding of the overall network is still lacking. Based on Apg protein localization, we have identified a novel structure that functions in autophagosome formation. This pre-autophagosomal structure, containing at least five Apg proteins, i.e. Apg1p, Apg2p, Apg5p, Aut7p/Apg8p and Apg16p, is localized in the vicinity of the vacuole. Analysis of apg mutants revealed that the formation of both a phosphatidylethanolamine-conjugated Aut7p and an Apg12p- Apg5p conjugate is essential for the localization of Aut7p to the pre-autophagosomal structure. Vps30p/Apg6p and Apg14p, components of an autophagy- specific phosphatidylinositol 3-kinase complex, Apg9p and Apg16p are all required for the localization of Apg5p and Aut7p to the structure. The Apg1p protein kinase complex functions in the late stage of autophagosome formation. Here, we present the classification of Apg proteins into three groups that reflect each step of autophagosome formation.
Suzuki, Kuninori; Kirisako, Takayoshi; Kamada, Yoshiaki; Mizushima, Noboru; Noda, Takeshi; Ohsumi, Yoshinori
2001-01-01
Macroautophagy is a bulk degradation process induced by starvation in eukaryotic cells. In yeast, 15 Apg proteins coordinate the formation of autophagosomes. Several key reactions performed by these proteins have been described, but a comprehensive understanding of the overall network is still lacking. Based on Apg protein localization, we have identified a novel structure that functions in autophagosome formation. This pre-autophagosomal structure, containing at least five Apg proteins, i.e. Apg1p, Apg2p, Apg5p, Aut7p/Apg8p and Apg16p, is localized in the vicinity of the vacuole. Analysis of apg mutants revealed that the formation of both a phosphatidylethanolamine-conjugated Aut7p and an Apg12p– Apg5p conjugate is essential for the localization of Aut7p to the pre-autophagosomal structure. Vps30p/Apg6p and Apg14p, components of an autophagy- specific phosphatidylinositol 3-kinase complex, Apg9p and Apg16p are all required for the localization of Apg5p and Aut7p to the structure. The Apg1p protein kinase complex functions in the late stage of autophagosome formation. Here, we present the classification of Apg proteins into three groups that reflect each step of autophagosome formation. PMID:11689437
Measuring and comparing structural fluctuation patterns in large protein datasets.
Fuglebakk, Edvin; Echave, Julián; Reuter, Nathalie
2012-10-01
The function of a protein depends not only on its structure but also on its dynamics. This is at the basis of a large body of experimental and theoretical work on protein dynamics. Further insight into the dynamics-function relationship can be gained by studying the evolutionary divergence of protein motions. To investigate this, we need appropriate comparative dynamics methods. The most used dynamical similarity score is the correlation between the root mean square fluctuations (RMSF) of aligned residues. Despite its usefulness, RMSF is in general less evolutionarily conserved than the native structure. A fundamental issue is whether RMSF is not as conserved as structure because dynamics is less conserved or because RMSF is not the best property to use to study its conservation. We performed a systematic assessment of several scores that quantify the (dis)similarity between protein fluctuation patterns. We show that the best scores perform as well as or better than structural dissimilarity, as assessed by their consistency with the SCOP classification. We conclude that to uncover the full extent of the evolutionary conservation of protein fluctuation patterns, it is important to measure the directions of fluctuations and their correlations between sites. Nathalie.Reuter@mbi.uib.no Supplementary data are available at Bioinformatics Online.
WEBnm@ v2.0: Web server and services for comparing protein flexibility.
Tiwari, Sandhya P; Fuglebakk, Edvin; Hollup, Siv M; Skjærven, Lars; Cragnolini, Tristan; Grindhaug, Svenn H; Tekle, Kidane M; Reuter, Nathalie
2014-12-30
Normal mode analysis (NMA) using elastic network models is a reliable and cost-effective computational method to characterise protein flexibility and by extension, their dynamics. Further insight into the dynamics-function relationship can be gained by comparing protein motions between protein homologs and functional classifications. This can be achieved by comparing normal modes obtained from sets of evolutionary related proteins. We have developed an automated tool for comparative NMA of a set of pre-aligned protein structures. The user can submit a sequence alignment in the FASTA format and the corresponding coordinate files in the Protein Data Bank (PDB) format. The computed normalised squared atomic fluctuations and atomic deformation energies of the submitted structures can be easily compared on graphs provided by the web user interface. The web server provides pairwise comparison of the dynamics of all proteins included in the submitted set using two measures: the Root Mean Squared Inner Product and the Bhattacharyya Coefficient. The Comparative Analysis has been implemented on our web server for NMA, WEBnm@, which also provides recently upgraded functionality for NMA of single protein structures. This includes new visualisations of protein motion, visualisation of inter-residue correlations and the analysis of conformational change using the overlap analysis. In addition, programmatic access to WEBnm@ is now available through a SOAP-based web service. Webnm@ is available at http://apps.cbu.uib.no/webnma . WEBnm@ v2.0 is an online tool offering unique capability for comparative NMA on multiple protein structures. Along with a convenient web interface, powerful computing resources, and several methods for mode analyses, WEBnm@ facilitates the assessment of protein flexibility within protein families and superfamilies. These analyses can give a good view of how the structures move and how the flexibility is conserved over the different structures.
Zheng, Heping; Shabalin, Ivan G.; Handing, Katarzyna B.; Bujnicki, Janusz M.; Minor, Wladek
2015-01-01
The ubiquitous presence of magnesium ions in RNA has long been recognized as a key factor governing RNA folding, and is crucial for many diverse functions of RNA molecules. In this work, Mg2+-binding architectures in RNA were systematically studied using a database of RNA crystal structures from the Protein Data Bank (PDB). Due to the abundance of poorly modeled or incorrectly identified Mg2+ ions, the set of all sites was comprehensively validated and filtered to identify a benchmark dataset of 15 334 ‘reliable’ RNA-bound Mg2+ sites. The normalized frequencies by which specific RNA atoms coordinate Mg2+ were derived for both the inner and outer coordination spheres. A hierarchical classification system of Mg2+ sites in RNA structures was designed and applied to the benchmark dataset, yielding a set of 41 types of inner-sphere and 95 types of outer-sphere coordinating patterns. This classification system has also been applied to describe six previously reported Mg2+-binding motifs and detect them in new RNA structures. Investigation of the most populous site types resulted in the identification of seven novel Mg2+-binding motifs, and all RNA structures in the PDB were screened for the presence of these motifs. PMID:25800744
Quantification of the Spatial Organization of the Nuclear Lamina as a Tool for Cell Classification
Righolt, Christiaan H.; Zatreanu, Diana A.; Raz, Vered
2013-01-01
The nuclear lamina is the structural scaffold of the nuclear envelope that plays multiple regulatory roles in chromatin organization and gene expression as well as a structural role in nuclear stability. The lamina proteins, also referred to as lamins, determine nuclear lamina organization and define the nuclear shape and the structural integrity of the cell nucleus. In addition, lamins are connected with both nuclear and cytoplasmic structures forming a dynamic cellular structure whose shape changes upon external and internal signals. When bound to the nuclear lamina, the lamins are mobile, have an impact on the nuclear envelop structure, and may induce changes in their regulatory functions. Changes in the nuclear lamina shape cause changes in cellular functions. A quantitative description of these structural changes could provide an unbiased description of changes in cellular function. In this review, we describe how changes in the nuclear lamina can be measured from three-dimensional images of lamins at the nuclear envelope, and we discuss how structural changes of the nuclear lamina can be used for cell classification. PMID:27335676
Quantification of the Spatial Organization of the Nuclear Lamina as a Tool for Cell Classification.
Righolt, Christiaan H; Zatreanu, Diana A; Raz, Vered
2013-01-01
The nuclear lamina is the structural scaffold of the nuclear envelope that plays multiple regulatory roles in chromatin organization and gene expression as well as a structural role in nuclear stability. The lamina proteins, also referred to as lamins, determine nuclear lamina organization and define the nuclear shape and the structural integrity of the cell nucleus. In addition, lamins are connected with both nuclear and cytoplasmic structures forming a dynamic cellular structure whose shape changes upon external and internal signals. When bound to the nuclear lamina, the lamins are mobile, have an impact on the nuclear envelop structure, and may induce changes in their regulatory functions. Changes in the nuclear lamina shape cause changes in cellular functions. A quantitative description of these structural changes could provide an unbiased description of changes in cellular function. In this review, we describe how changes in the nuclear lamina can be measured from three-dimensional images of lamins at the nuclear envelope, and we discuss how structural changes of the nuclear lamina can be used for cell classification.
Complete fold annotation of the human proteome using a novel structural feature space
Middleton, Sarah A.; Illuminati, Joseph; Kim, Junhyong
2017-04-13
Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this methodmore » by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Finally, our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.« less
Specificity and non-specificity in RNA–protein interactions
Jankowsky, Eckhard; Harris, Michael E.
2016-01-01
Gene expression is regulated by complex networks of interactions between RNAs and proteins. Proteins that interact with RNA have been traditionally viewed as either specific or non-specific; specific proteins interact preferentially with defined RNA sequence or structure motifs, whereas non-specific proteins interact with RNA sites devoid of such characteristics. Recent studies indicate that the binary “specific vs. non-specific” classification is insufficient to describe the full spectrum of RNA–protein interactions. Here, we review new methods that enable quantitative measurements of protein binding to large numbers of RNA variants, and the concepts aimed as describing resulting binding spectra: affinity distributions, comprehensive binding models and free energy landscapes. We discuss how these new methodologies and associated concepts enable work towards inclusive, quantitative models for specific and non-specific RNA–protein interactions. PMID:26285679
Zebra: a web server for bioinformatic analysis of diverse protein families.
Suplatov, Dmitry; Kirilin, Evgeny; Takhaveev, Vakil; Svedas, Vytas
2014-01-01
During evolution of proteins from a common ancestor, one functional property can be preserved while others can vary leading to functional diversity. A systematic study of the corresponding adaptive mutations provides a key to one of the most challenging problems of modern structural biology - understanding the impact of amino acid substitutions on protein function. The subfamily-specific positions (SSPs) are conserved within functional subfamilies but are different between them and, therefore, seem to be responsible for functional diversity in protein superfamilies. Consequently, a corresponding method to perform the bioinformatic analysis of sequence and structural data has to be implemented in the common laboratory practice to study the structure-function relationship in proteins and develop novel protein engineering strategies. This paper describes Zebra web server - a powerful remote platform that implements a novel bioinformatic analysis algorithm to study diverse protein families. It is the first application that provides specificity determinants at different levels of functional classification, therefore addressing complex functional diversity of large superfamilies. Statistical analysis is implemented to automatically select a set of highly significant SSPs to be used as hotspots for directed evolution or rational design experiments and analyzed studying the structure-function relationship. Zebra results are provided in two ways - (1) as a single all-in-one parsable text file and (2) as PyMol sessions with structural representation of SSPs. Zebra web server is available at http://biokinet.belozersky.msu.ru/zebra .
Chen, Yifei; Sun, Yuxing; Han, Bing-Qing
2015-01-01
Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.
Iterative non-sequential protein structural alignment.
Salem, Saeed; Zaki, Mohammed J; Bystroff, Christopher
2009-06-01
Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at http://www.cs.rpi.edu/~zaki/software/SNAP.
Kikuchi, Y; Tamiya, N
1987-01-01
The proteins in the hinge ligaments of molluscan bivalves were subjected to chemotaxonomic studies according to their amino acid compositions. The hinge-ligament protein is a new class of structure proteins, and this is the first attempt to introduce chemical taxonomy into the systematics of bivalves. The hinge-ligament proteins from morphologically close species, namely mactra (superfamily Mactracea) or scallop (family Pectinidae) species, showed high intraspecific homology in their compositions. On the other hand, inconsistent results were obtained with two types of ligament proteins in pearl oyster species (genus Pinctada). The results of our chemotaxonomic analyses were sometimes in good agreement with the morphological classifications and sometimes inconsistent, implying a complicated phylogenetic relationship among the species. PMID:3593265
Classification of ligand molecules in PDB with fast heuristic graph match algorithm COMPLIG.
Saito, Mihoko; Takemura, Naomi; Shirai, Tsuyoshi
2012-12-14
A fast heuristic graph-matching algorithm, COMPLIG, was devised to classify the small-molecule ligands in the Protein Data Bank (PDB), which are currently not properly classified on structure basis. By concurrently classifying proteins and ligands, we determined the most appropriate parameter for categorizing ligands to be more than 60% identity of atoms and bonds between molecules, and we classified 11,585 types of ligands into 1946 clusters. Although the large clusters were composed of nucleotides or amino acids, a significant presence of drug compounds was also observed. Application of the system to classify the natural ligand status of human proteins in the current database suggested that, at most, 37% of the experimental structures of human proteins were in complex with natural ligands. However, protein homology- and/or ligand similarity-based modeling was implied to provide models of natural interactions for an additional 28% of the total, which might be used to increase the knowledge of intrinsic protein-metabolite interactions. Copyright © 2012 Elsevier Ltd. All rights reserved.
Gregoretti, Francesco; Cesarini, Elisa; Lanzuolo, Chiara; Oliva, Gennaro; Antonelli, Laura
2016-01-01
The large amount of data generated in biological experiments that rely on advanced microscopy can be handled only with automated image analysis. Most analyses require a reliable cell image segmentation eventually capable of detecting subcellular structures.We present an automatic segmentation method to detect Polycomb group (PcG) proteins areas isolated from nuclei regions in high-resolution fluorescent cell image stacks. It combines two segmentation algorithms that use an active contour model and a classification technique serving as a tool to better understand the subcellular three-dimensional distribution of PcG proteins in live cell image sequences. We obtained accurate results throughout several cell image datasets, coming from different cell types and corresponding to different fluorescent labels, without requiring elaborate adjustments to each dataset.
USDA-ARS?s Scientific Manuscript database
Triacylglycerols (TAG) are the major molecules of energy storage in eukaryotes. TAG are packed in subcellular structures called oil bodies or lipid droplets. Oleosins (OLE) are the major proteins in plant oil bodies. Multiple isoforms of OLE are present in plants such as tung tree (Vernicia fordii),...
Caumes, Géraldine; Borrel, Alexandre; Abi Hussein, Hiba; Camproux, Anne-Claude; Regad, Leslie
2017-09-01
Small molecules interact with their protein target on surface cavities known as binding pockets. Pocket-based approaches are very useful in all of the phases of drug design. Their first step is estimating the binding pocket based on protein structure. The available pocket-estimation methods produce different pockets for the same target. The aim of this work is to investigate the effects of different pocket-estimation methods on the results of pocket-based approaches. We focused on the effect of three pocket-estimation methods on a pocket-ligand (PL) classification. This pocket-based approach is useful for understanding the correspondence between the pocket and ligand spaces and to develop pharmacological profiling models. We found pocket-estimation methods yield different binding pockets in terms of boundaries and properties. These differences are responsible for the variation in the PL classification results that can have an impact on the detected correspondence between pocket and ligand profiles. Thus, we highlighted the importance of the pocket-estimation method choice in pocket-based approaches. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Li, Jieyue; Xiong, Liang; Schneider, Jeff; Murphy, Robert F
2012-06-15
Knowledge of the subcellular location of a protein is crucial for understanding its functions. The subcellular pattern of a protein is typically represented as the set of cellular components in which it is located, and an important task is to determine this set from microscope images. In this article, we address this classification problem using confocal immunofluorescence images from the Human Protein Atlas (HPA) project. The HPA contains images of cells stained for many proteins; each is also stained for three reference components, but there are many other components that are invisible. Given one such cell, the task is to classify the pattern type of the stained protein. We first randomly select local image regions within the cells, and then extract various carefully designed features from these regions. This region-based approach enables us to explicitly study the relationship between proteins and different cell components, as well as the interactions between these components. To achieve these two goals, we propose two discriminative models that extend logistic regression with structured latent variables. The first model allows the same protein pattern class to be expressed differently according to the underlying components in different regions. The second model further captures the spatial dependencies between the components within the same cell so that we can better infer these components. To learn these models, we propose a fast approximate algorithm for inference, and then use gradient-based methods to maximize the data likelihood. In the experiments, we show that the proposed models help improve the classification accuracies on synthetic data and real cellular images. The best overall accuracy we report in this article for classifying 942 proteins into 13 classes of patterns is about 84.6%, which to our knowledge is the best so far. In addition, the dependencies learned are consistent with prior knowledge of cell organization. http://murphylab.web.cmu.edu/software/.
Investigating homology between proteins using energetic profiles.
Wrabl, James O; Hilser, Vincent J
2010-03-26
Accumulated experimental observations demonstrate that protein stability is often preserved upon conservative point mutation. In contrast, less is known about the effects of large sequence or structure changes on the stability of a particular fold. Almost completely unknown is the degree to which stability of different regions of a protein is generally preserved throughout evolution. In this work, these questions are addressed through thermodynamic analysis of a large representative sample of protein fold space based on remote, yet accepted, homology. More than 3,000 proteins were computationally analyzed using the structural-thermodynamic algorithm COREX/BEST. Estimated position-specific stability (i.e., local Gibbs free energy of folding) and its component enthalpy and entropy were quantitatively compared between all proteins in the sample according to all-vs.-all pairwise structural alignment. It was discovered that the local stabilities of homologous pairs were significantly more correlated than those of non-homologous pairs, indicating that local stability was indeed generally conserved throughout evolution. However, the position-specific enthalpy and entropy underlying stability were less correlated, suggesting that the overall regional stability of a protein was more important than the thermodynamic mechanism utilized to achieve that stability. Finally, two different types of statistically exceptional evolutionary structure-thermodynamic relationships were noted. First, many homologous proteins contained regions of similar thermodynamics despite localized structure change, suggesting a thermodynamic mechanism enabling evolutionary fold change. Second, some homologous proteins with extremely similar structures nonetheless exhibited different local stabilities, a phenomenon previously observed experimentally in this laboratory. These two observations, in conjunction with the principal conclusion that homologous proteins generally conserved local stability, may provide guidance for a future thermodynamically informed classification of protein homology.
Linear regression models for solvent accessibility prediction in proteins.
Wagner, Michael; Adamczak, Rafał; Porollo, Aleksey; Meller, Jarosław
2005-04-01
The relative solvent accessibility (RSA) of an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. The problem of predicting the RSA from the primary amino acid sequence can therefore be cast as a regression problem. Nevertheless, RSA prediction has so far typically been cast as a classification problem. Consequently, various machine learning techniques have been used within the classification framework to predict whether a given amino acid exceeds some (arbitrary) RSA threshold and would thus be predicted to be "exposed," as opposed to "buried." We have recently developed novel methods for RSA prediction using nonlinear regression techniques which provide accurate estimates of the real-valued RSA and outperform classification-based approaches with respect to commonly used two-class projections. However, while their performance seems to provide a significant improvement over previously published approaches, these Neural Network (NN) based methods are computationally expensive to train and involve several thousand parameters. In this work, we develop alternative regression models for RSA prediction which are computationally much less expensive, involve orders-of-magnitude fewer parameters, and are still competitive in terms of prediction quality. In particular, we investigate several regression models for RSA prediction using linear L1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression. Using rigorously derived validation sets of protein structures and extensive cross-validation analysis, we compare the performance of the SVR with that of LS regression and NN-based methods. In particular, we show that the flexibility of the SVR (as encoded by metaparameters such as the error insensitivity and the error penalization terms) can be very beneficial to optimize the prediction accuracy for buried residues. We conclude that the simple and computationally much more efficient linear SVR performs comparably to nonlinear models and thus can be used in order to facilitate further attempts to design more accurate RSA prediction methods, with applications to fold recognition and de novo protein structure prediction methods.
Molecular classification of liver cirrhosis in a rat model by proteomics and bioinformatics.
Xu, Xiu-Qin; Leow, Chon K; Lu, Xin; Zhang, Xuegong; Liu, Jun S; Wong, Wing-Hung; Asperger, Arndt; Deininger, Sören; Eastwood Leung, Hon-Chiu
2004-10-01
Liver cirrhosis is a worldwide health problem. Reliable, noninvasive methods for early detection of liver cirrhosis are not available. Using a three-step approach, we classified sera from rats with liver cirrhosis following different treatment insults. The approach consisted of: (i) protein profiling using surface-enhanced laser desorption/ionization (SELDI) technology; (ii) selection of a statistically significant serum biomarker set using machine learning algorithms; and (iii) identification of selected serum biomarkers by peptide sequencing. We generated serum protein profiles from three groups of rats: (i) normal (n=8), (ii) thioacetamide-induced liver cirrhosis (n=22), and (iii) bile duct ligation-induced liver fibrosis (n=5) using a weak cation exchanger surface. Profiling data were further analyzed by a recursive support vector machine algorithm to select a panel of statistically significant biomarkers for class prediction. Sensitivity and specificity of classification using the selected protein marker set were higher than 92%. A consistently down-regulated 3495 Da protein in cirrhosis samples was one of the selected significant biomarkers. This 3495 Da protein was purified on-chip and trypsin digested. Further structural characterization of this biomarkers candidate was done by using cross-platform matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) peptide mass fingerprinting (PMF) and matrix-assisted laser desorption/ionization time of flight/time of flight (MALDI-TOF/TOF) tandem mass spectrometry (MS/MS). Combined data from PMF and MS/MS spectra of two tryptic peptides suggested that this 3495 Da protein shared homology to a histidine-rich glycoprotein. These results demonstrated a novel approach to discovery of new biomarkers for early detection of liver cirrhosis and classification of liver diseases.
Pathophysiology of keratinization
Deo, Priya Nimish; Deshmukh, Revati
2018-01-01
Cytoskeleton of a cell is made up of microfilaments, microtubules and intermediate filaments. Keratins are diverse proteins. These intermediate filaments maintain the structural integrity of the keratinocytes. The word keratin covers these intermediate filament-forming proteins within the keratinocytes. They are expressed in a specific pattern and according to the stage of cellular differentiation. They always occur in pairs. Mutations in the genes which regulate the expression of keratin proteins are associated with a number of disorders which show defects in both skin and mucosa. In addition, there are a number of disorders which are seen because of abnormal keratinization. These keratins and keratin-associated proteins have become important markers in diagnostic pathology. This review article discusses the classification, structure, functions, the stains used for the demonstration of keratin and associated pathology. The review describes the physiology of keratinization, pathology behind abnormal keratin formation and various keratin disorders. PMID:29731562
DOE Office of Scientific and Technical Information (OSTI.GOV)
Middleton, Sarah A.; Illuminati, Joseph; Kim, Junhyong
Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this methodmore » by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Finally, our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.« less
URS DataBase: universe of RNA structures and their motifs.
Baulin, Eugene; Yacovlev, Victor; Khachko, Denis; Spirin, Sergei; Roytberg, Mikhail
2016-01-01
The Universe of RNA Structures DataBase (URSDB) stores information obtained from all RNA-containing PDB entries (2935 entries in October 2015). The content of the database is updated regularly. The database consists of 51 tables containing indexed data on various elements of the RNA structures. The database provides a web interface allowing user to select a subset of structures with desired features and to obtain various statistical data for a selected subset of structures or for all structures. In particular, one can easily obtain statistics on geometric parameters of base pairs, on structural motifs (stems, loops, etc.) or on different types of pseudoknots. The user can also view and get information on an individual structure or its selected parts, e.g. RNA-protein hydrogen bonds. URSDB employs a new original definition of loops in RNA structures. That definition fits both pseudoknot-free and pseudoknotted secondary structures and coincides with the classical definition in case of pseudoknot-free structures. To our knowledge, URSDB is the first database supporting searches based on topological classification of pseudoknots and on extended loop classification.Database URL: http://server3.lpm.org.ru/urs/. © The Author(s) 2016. Published by Oxford University Press.
URS DataBase: universe of RNA structures and their motifs
Baulin, Eugene; Yacovlev, Victor; Khachko, Denis; Spirin, Sergei; Roytberg, Mikhail
2016-01-01
The Universe of RNA Structures DataBase (URSDB) stores information obtained from all RNA-containing PDB entries (2935 entries in October 2015). The content of the database is updated regularly. The database consists of 51 tables containing indexed data on various elements of the RNA structures. The database provides a web interface allowing user to select a subset of structures with desired features and to obtain various statistical data for a selected subset of structures or for all structures. In particular, one can easily obtain statistics on geometric parameters of base pairs, on structural motifs (stems, loops, etc.) or on different types of pseudoknots. The user can also view and get information on an individual structure or its selected parts, e.g. RNA–protein hydrogen bonds. URSDB employs a new original definition of loops in RNA structures. That definition fits both pseudoknot-free and pseudoknotted secondary structures and coincides with the classical definition in case of pseudoknot-free structures. To our knowledge, URSDB is the first database supporting searches based on topological classification of pseudoknots and on extended loop classification. Database URL: http://server3.lpm.org.ru/urs/ PMID:27242032
Identification of structural domains in proteins by a graph heuristic.
Wernisch, L; Hunting, M; Wodak, S J
1999-05-15
A novel automatic procedure for identifying domains from protein atomic coordinates is presented. The procedure, termed STRUDL (STRUctural Domain Limits), does not take into account information on secondary structures and handles any number of domains made up of contiguous or non-contiguous chain segments. The core algorithm uses the Kernighan-Lin graph heuristic to partition the protein into residue sets which display minimum interactions between them. These interactions are deduced from the weighted Voronoi diagram. The generated partitions are accepted or rejected on the basis of optimized criteria, representing basic expected physical properties of structural domains. The graph heuristic approach is shown to be very effective, it approximates closely the exact solution provided by a branch and bound algorithm for a number of test proteins. In addition, the overall performance of STRUDL is assessed on a set of 787 representative proteins from the Protein Data Bank by comparison to domain definitions in the CATH protein classification. The domains assigned by STRUDL agree with the CATH assignments in at least 81% of the tested proteins. This result is comparable to that obtained previously using PUU (Holm and Sander, Proteins 1994;9:256-268), the only other available algorithm designed to identify domains with any number of non-contiguous chain segments. A detailed discussion of the structures for which our assignments differ from those in CATH brings to light some clear inconsistencies between the concept of structural domains based on minimizing inter-domain interactions and that of delimiting structural motifs that represent acceptable folding topologies or architectures. Considering both concepts as complementary and combining them in a layered approach might be the way forward.
Structure-functional prediction and analysis of cancer mutation effects in protein kinases.
Dixit, Anshuman; Verkhivker, Gennady M
2014-01-01
A central goal of cancer research is to discover and characterize the functional effects of mutated genes that contribute to tumorigenesis. In this study, we provide a detailed structural classification and analysis of functional dynamics for members of protein kinase families that are known to harbor cancer mutations. We also present a systematic computational analysis that combines sequence and structure-based prediction models to characterize the effect of cancer mutations in protein kinases. We focus on the differential effects of activating point mutations that increase protein kinase activity and kinase-inactivating mutations that decrease activity. Mapping of cancer mutations onto the conformational mobility profiles of known crystal structures demonstrated that activating mutations could reduce a steric barrier for the movement from the basal "low" activity state to the "active" state. According to our analysis, the mechanism of activating mutations reflects a combined effect of partial destabilization of the kinase in its inactive state and a concomitant stabilization of its active-like form, which is likely to drive tumorigenesis at some level. Ultimately, the analysis of the evolutionary and structural features of the major cancer-causing mutational hotspot in kinases can also aid in the correlation of kinase mutation effects with clinical outcomes.
Sequence composition and environment effects on residue fluctuations in protein structures
NASA Astrophysics Data System (ADS)
Ruvinsky, Anatoly M.; Vakser, Ilya A.
2010-10-01
Structure fluctuations in proteins affect a broad range of cell phenomena, including stability of proteins and their fragments, allosteric transitions, and energy transfer. This study presents a statistical-thermodynamic analysis of relationship between the sequence composition and the distribution of residue fluctuations in protein-protein complexes. A one-node-per-residue elastic network model accounting for the nonhomogeneous protein mass distribution and the interatomic interactions through the renormalized inter-residue potential is developed. Two factors, a protein mass distribution and a residue environment, were found to determine the scale of residue fluctuations. Surface residues undergo larger fluctuations than core residues in agreement with experimental observations. Ranking residues over the normalized scale of fluctuations yields a distinct classification of amino acids into three groups: (i) highly fluctuating-Gly, Ala, Ser, Pro, and Asp, (ii) moderately fluctuating-Thr, Asn, Gln, Lys, Glu, Arg, Val, and Cys, and (iii) weakly fluctuating-Ile, Leu, Met, Phe, Tyr, Trp, and His. The structural instability in proteins possibly relates to the high content of the highly fluctuating residues and a deficiency of the weakly fluctuating residues in irregular secondary structure elements (loops), chameleon sequences, and disordered proteins. Strong correlation between residue fluctuations and the sequence composition of protein loops supports this hypothesis. Comparing fluctuations of binding site residues (interface residues) with other surface residues shows that, on average, the interface is more rigid than the rest of the protein surface and Gly, Ala, Ser, Cys, Leu, and Trp have a propensity to form more stable docking patches on the interface. The findings have broad implications for understanding mechanisms of protein association and stability of protein structures.
Protein classification using modified n-grams and skip-grams.
Islam, S M Ashiqul; Heil, Benjamin J; Kearney, Christopher Michel; Baker, Erich J
2018-05-01
Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG). A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists. m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg. erich_baker@baylor.edu. Supplementary data are available at Bioinformatics online.
Kaur, Gurmeet; Subramanian, Srikrishna
2016-08-26
Treble clef (TC) zinc fingers constitute a large fold-group of structural zinc-binding protein domains that mediate numerous cellular functions. We have analysed the sequence, structure, and function relationships among all TCs in the Protein Data Bank. This led to the identification of novel TCs, such as lsr2, YggX and TFIIIC τ 60 kDa subunit, and prediction of a nuclease-like function for the DUF1364 family. The structural malleability of TCs is evident from the many examples with variations to the core structural elements of the fold. We observe domains wherein the structural core of the TC fold is circularly permuted, and also some examples where the overall fold resembles both the TC motif and another unrelated fold. All extant TC families do not share a monophyletic origin, as several TC proteins are known to have been present in the last universal common ancestor and the last eukaryotic common ancestor. We identify several TCs where the zinc-chelating site and residues are not merely responsible for structure stabilization but also perform other functions, such as being redox active in C1B domain of protein kinase C, a nucleophilic acceptor in Ada and catalytic in organomercurial lyase, MerB.
NASA Astrophysics Data System (ADS)
Kaur, Gurmeet; Subramanian, Srikrishna
2016-08-01
Treble clef (TC) zinc fingers constitute a large fold-group of structural zinc-binding protein domains that mediate numerous cellular functions. We have analysed the sequence, structure, and function relationships among all TCs in the Protein Data Bank. This led to the identification of novel TCs, such as lsr2, YggX and TFIIIC τ 60 kDa subunit, and prediction of a nuclease-like function for the DUF1364 family. The structural malleability of TCs is evident from the many examples with variations to the core structural elements of the fold. We observe domains wherein the structural core of the TC fold is circularly permuted, and also some examples where the overall fold resembles both the TC motif and another unrelated fold. All extant TC families do not share a monophyletic origin, as several TC proteins are known to have been present in the last universal common ancestor and the last eukaryotic common ancestor. We identify several TCs where the zinc-chelating site and residues are not merely responsible for structure stabilization but also perform other functions, such as being redox active in C1B domain of protein kinase C, a nucleophilic acceptor in Ada and catalytic in organomercurial lyase, MerB.
Bordner, Andrew J; Gorin, Andrey A
2008-05-12
Protein-protein interactions are ubiquitous and essential for all cellular processes. High-resolution X-ray crystallographic structures of protein complexes can reveal the details of their function and provide a basis for many computational and experimental approaches. Differentiation between biological and non-biological contacts and reconstruction of the intact complex is a challenging computational problem. A successful solution can provide additional insights into the fundamental principles of biological recognition and reduce errors in many algorithms and databases utilizing interaction information extracted from the Protein Data Bank (PDB). We have developed a method for identifying protein complexes in the PDB X-ray structures by a four step procedure: (1) comprehensively collecting all protein-protein interfaces; (2) clustering similar protein-protein interfaces together; (3) estimating the probability that each cluster is relevant based on a diverse set of properties; and (4) combining these scores for each PDB entry in order to predict the complex structure. The resulting clusters of biologically relevant interfaces provide a reliable catalog of evolutionary conserved protein-protein interactions. These interfaces, as well as the predicted protein complexes, are available from the Protein Interface Server (PInS) website (see Availability and requirements section). Our method demonstrates an almost two-fold reduction of the annotation error rate as evaluated on a large benchmark set of complexes validated from the literature. We also estimate relative contributions of each interface property to the accurate discrimination of biologically relevant interfaces and discuss possible directions for further improving the prediction method.
Ribosome-Inactivating and Related Proteins
Schrot, Joachim; Weng, Alexander; Melzig, Matthias F.
2015-01-01
Ribosome-inactivating proteins (RIPs) are toxins that act as N-glycosidases (EC 3.2.2.22). They are mainly produced by plants and classified as type 1 RIPs and type 2 RIPs. There are also RIPs and RIP related proteins that cannot be grouped into the classical type 1 and type 2 RIPs because of their different sizes, structures or functions. In addition, there is still not a uniform nomenclature or classification existing for RIPs. In this review, we give the current status of all known plant RIPs and we make a suggestion about how to unify those RIPs and RIP related proteins that cannot be classified as type 1 or type 2 RIPs. PMID:26008228
Soliton concepts and protein structure
NASA Astrophysics Data System (ADS)
Krokhotin, Andrei; Niemi, Antti J.; Peng, Xubiao
2012-03-01
Structural classification shows that the number of different protein folds is surprisingly small. It also appears that proteins are built in a modular fashion from a relatively small number of components. Here we propose that the modular building blocks are made of the dark soliton solution of a generalized discrete nonlinear Schrödinger equation. We find that practically all protein loops can be obtained simply by scaling the size and by joining together a number of copies of the soliton, one after another. The soliton has only two loop-specific parameters, and we compute their statistical distribution in the Protein Data Bank (PDB). We explicitly construct a collection of 200 sets of parameters, each determining a soliton profile that describes a different short loop. The ensuing profiles cover practically all those proteins in PDB that have a resolution which is better than 2.0 Å, with a precision such that the average root-mean-square distance between the loop and its soliton is less than the experimental B-factor fluctuation distance. We also present two examples that describe how the loop library can be employed both to model and to analyze folded proteins.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tirado-Lee, Leidamarie; Lee, Allen; Rees, Douglas C.
2014-10-02
molA (HI1472) from H. influenzae encodes a periplasmic binding protein (PBP) that delivers substrate to the ABC transporter MolB{sub 2}C{sub 2} (formerly HI1470/71). The structures of MolA with molybdate and tungstate in the binding pocket were solved to 1.6 and 1.7 {angstrom} resolution, respectively. The MolA-binding protein binds molybdate and tungstate, but not other oxyanions such as sulfate and phosphate, making it the first class III molybdate-binding protein structurally solved. The {approx}100 {mu}M binding affinity for tungstate and molybdate is significantly lower than observed for the class II ModA molybdate-binding proteins that have nanomolar to low micromolar affinity for molybdate.more » The presence of two molybdate loci in H. influenzae suggests multiple transport systems for one substrate, with molABC constituting a low-affinity molybdate locus.« less
Zbilut, Joseph P.; Colosimo, Alfredo; Conti, Filippo; Colafranceschi, Mauro; Manetti, Cesare; Valerio, MariaCristina; Webber, Charles L.; Giuliani, Alessandro
2003-01-01
The problem of protein folding vs. aggregation was investigated in acylphosphatase and the amyloid protein Aβ(1–40) by means of nonlinear signal analysis of their chain hydrophobicity. Numerical descriptors of recurrence patterns provided the basis for statistical evaluation of folding/aggregation distinctive features. Static and dynamic approaches were used to elucidate conditions coincident with folding vs. aggregation using comparisons with known protein secondary structure classifications, site-directed mutagenesis studies of acylphosphatase, and molecular dynamics simulations of amyloid protein, Aβ(1–40). The results suggest that a feature derived from principal component space characterized by the smoothness of singular, deterministic hydrophobicity patches plays a significant role in the conditions governing protein aggregation. PMID:14645049
Soft Computing Methods for Disulfide Connectivity Prediction.
Márquez-Chamorro, Alfonso E; Aguilar-Ruiz, Jesús S
2015-01-01
The problem of protein structure prediction (PSP) is one of the main challenges in structural bioinformatics. To tackle this problem, PSP can be divided into several subproblems. One of these subproblems is the prediction of disulfide bonds. The disulfide connectivity prediction problem consists in identifying which nonadjacent cysteines would be cross-linked from all possible candidates. Determining the disulfide bond connectivity between the cysteines of a protein is desirable as a previous step of the 3D PSP, as the protein conformational search space is highly reduced. The most representative soft computing approaches for the disulfide bonds connectivity prediction problem of the last decade are summarized in this paper. Certain aspects, such as the different methodologies based on soft computing approaches (artificial neural network or support vector machine) or features of the algorithms, are used for the classification of these methods.
Integrated proteomic and transcriptomic analysis of the Aedes aegypti eggshell
2014-01-01
Background Mosquito eggshells show remarkable diversity in physical properties and structure consistent with adaptations to the wide variety of environments exploited by these insects. We applied proteomic, transcriptomic, and hybridization in situ techniques to identify gene products and pathways that participate in the assembly of the Aedes aegypti eggshell. Aedes aegypti population density is low during cold and dry seasons and increases immediately after rainfall. The survival of embryos through unfavorable periods is a key factor in the persistence of their populations. The work described here supports integrated vector control approaches that target eggshell formation and result in Ae. aegypti drought-intolerant phenotypes for public health initiatives directed to reduce mosquito-borne diseases. Results A total of 130 proteins were identified from the combined mass spectrometric analyses of eggshell preparations. Conclusions Classification of proteins according to their known and putative functions revealed the complexity of the eggshell structure. Three novel Ae. aegypti vitelline membrane proteins were discovered. Odorant-binding and cysteine-rich proteins that may be structural components of the eggshell were identified. Enzymes with peroxidase, laccase and phenoloxidase activities also were identified, and their likely involvements in cross-linking reactions that stabilize the eggshell structure are discussed. PMID:24707823
Introduction to bioinformatics.
Can, Tolga
2014-01-01
Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution usually involves the following steps: Collect statistics from biological data. Build a computational model. Solve a computational modeling problem. Test and evaluate a computational algorithm. This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bioinformatics problems organized by the types of data sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated problems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding function, and structural alignment. Gene expression data is usually represented as matrices and analysis of microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological networks such as gene regulatory networks, metabolic pathways, and protein-protein interaction networks are usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as construction and analysis of large-scale networks.
A Model Comparison for Characterizing Protein Motions from Structure
NASA Astrophysics Data System (ADS)
David, Charles; Jacobs, Donald
2011-10-01
A comparative study is made using three computational models that characterize native state dynamics starting from known protein structures taken from four distinct SCOP classifications. A geometrical simulation is performed, and the results are compared to the elastic network model and molecular dynamics. The essential dynamics is quantified by a direct analysis of a mode subspace constructed from ANM and a principal component analysis on both the FRODA and MD trajectories using root mean square inner product and principal angles. Relative subspace sizes and overlaps are visualized using the projection of displacement vectors on the model modes. Additionally, a mode subspace is constructed from PCA on an exemplar set of X-ray crystal structures in order to determine similarly with respect to the generated ensembles. Quantitative analysis reveals there is significant overlap across the three model subspaces and the model independent subspace. These results indicate that structure is the key determinant for native state dynamics.
HAMAP in 2013, new developments in the protein family classification and annotation system
Pedruzzi, Ivo; Rivoire, Catherine; Auchincloss, Andrea H.; Coudert, Elisabeth; Keller, Guillaume; de Castro, Edouard; Baratin, Delphine; Cuche, Béatrice A.; Bougueleret, Lydie; Poux, Sylvain; Redaschi, Nicole; Xenarios, Ioannis; Bridge, Alan
2013-01-01
HAMAP (High-quality Automated and Manual Annotation of Proteins—available at http://hamap.expasy.org/) is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated annotation rules that specify annotations that apply to family members. HAMAP was originally developed to support the manual curation of UniProtKB/Swiss-Prot records describing microbial proteins. Here we describe new developments in HAMAP, including the extension of HAMAP to eukaryotic proteins, the use of HAMAP in the automated annotation of UniProtKB/TrEMBL, providing high-quality annotation for millions of protein sequences, and the future integration of HAMAP into a unified system for UniProtKB annotation, UniRule. HAMAP is continuously updated by expert curators with new family profiles and annotation rules as new protein families are characterized. The collection of HAMAP family classification profiles and annotation rules can be browsed and viewed on the HAMAP website, which also provides an interface to scan user sequences against HAMAP profiles. PMID:23193261
Improving Protein Fold Recognition by Deep Learning Networks.
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-04
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl's benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
Zhang, Tong-Liang; Ding, Yong-Sheng; Chou, Kuo-Chen
2008-01-07
Compared with the conventional amino acid (AA) composition, the pseudo-amino acid (PseAA) composition as originally introduced for protein subcellular location prediction can incorporate much more information of a protein sequence, so as to remarkably enhance the power of using a discrete model to predict various attributes of a protein. In this study, based on the concept of PseAA composition, the approximate entropy and hydrophobicity pattern of a protein sequence are used to characterize the PseAA components. Also, the immune genetic algorithm (IGA) is applied to search the optimal weight factors in generating the PseAA composition. Thus, for a given protein sequence sample, a 27-D (dimensional) PseAA composition is generated as its descriptor. The fuzzy K nearest neighbors (FKNN) classifier is adopted as the prediction engine. The results thus obtained in predicting protein structural classification are quite encouraging, indicating that the current approach may also be used to improve the prediction quality of other protein attributes, or at least can play a complimentary role to the existing methods in the relevant areas. Our algorithm is written in Matlab that is available by contacting the corresponding author.
Keef, Thomas; Wardman, Jessica P.; Ranson, Neil A.; Stockley, Peter G.; Twarock, Reidun
2013-01-01
Understanding the fundamental principles of virus architecture is one of the most important challenges in biology and medicine. Crick and Watson were the first to propose that viruses exhibit symmetry in the organization of their protein containers for reasons of genetic economy. Based on this, Caspar and Klug introduced quasi-equivalence theory to predict the relative locations of the coat proteins within these containers and classified virus structure in terms of T-numbers. Here it is shown that quasi-equivalence is part of a wider set of structural constraints on virus structure. These constraints can be formulated using an extension of the underlying symmetry group and this is demonstrated with a number of case studies. This new concept in virus biology provides for the first time predictive information on the structural constraints on coat protein and genome topography, and reveals a previously unrecognized structural interdependence of the shapes and sizes of different viral components. It opens up the possibility of distinguishing the structures of different viruses with the same T-number, suggesting a refined viral structure classification scheme. It can moreover be used as a basis for models of virus function, e.g. to characterize the start and end configurations of a structural transition important for infection. PMID:23403965
Keef, Thomas; Wardman, Jessica P; Ranson, Neil A; Stockley, Peter G; Twarock, Reidun
2013-03-01
Understanding the fundamental principles of virus architecture is one of the most important challenges in biology and medicine. Crick and Watson were the first to propose that viruses exhibit symmetry in the organization of their protein containers for reasons of genetic economy. Based on this, Caspar and Klug introduced quasi-equivalence theory to predict the relative locations of the coat proteins within these containers and classified virus structure in terms of T-numbers. Here it is shown that quasi-equivalence is part of a wider set of structural constraints on virus structure. These constraints can be formulated using an extension of the underlying symmetry group and this is demonstrated with a number of case studies. This new concept in virus biology provides for the first time predictive information on the structural constraints on coat protein and genome topography, and reveals a previously unrecognized structural interdependence of the shapes and sizes of different viral components. It opens up the possibility of distinguishing the structures of different viruses with the same T-number, suggesting a refined viral structure classification scheme. It can moreover be used as a basis for models of virus function, e.g. to characterize the start and end configurations of a structural transition important for infection.
Sun, Chia-Tsen; Chiang, Austin W T; Hwang, Ming-Jing
2017-10-27
Proteome-scale bioinformatics research is increasingly conducted as the number of completely sequenced genomes increases, but analysis of protein domains (PDs) usually relies on similarity in their amino acid sequences and/or three-dimensional structures. Here, we present results from a bi-clustering analysis on presence/absence data for 6,580 unique PDs in 2,134 species with a sequenced genome, thus covering a complete set of proteins, for the three superkingdoms of life, Bacteria, Archaea, and Eukarya. Our analysis revealed eight distinctive PD clusters, which, following an analysis of enrichment of Gene Ontology functions and CATH classification of protein structures, were shown to exhibit structural and functional properties that are taxa-characteristic. For examples, the largest cluster is ubiquitous in all three superkingdoms, constituting a set of 1,472 persistent domains created early in evolution and retained in living organisms and characterized by basic cellular functions and ancient structural architectures, while an Archaea and Eukarya bi-superkingdom cluster suggests its PDs may have existed in the ancestor of the two superkingdoms, and others are single superkingdom- or taxa (e.g. Fungi)-specific. These results contribute to increase our appreciation of PD diversity and our knowledge of how PDs are used in species, yielding implications on species evolution.
Bietz, Stefan; Inhester, Therese; Lauck, Florian; Sommer, Kai; von Behren, Mathias M; Fährrolfes, Rainer; Flachsenberg, Florian; Meyder, Agnes; Nittinger, Eva; Otto, Thomas; Hilbig, Matthias; Schomburg, Karen T; Volkamer, Andrea; Rarey, Matthias
2017-11-10
Nowadays, computational approaches are an integral part of life science research. Problems related to interpretation of experimental results, data analysis, or visualization tasks highly benefit from the achievements of the digital era. Simulation methods facilitate predictions of physicochemical properties and can assist in understanding macromolecular phenomena. Here, we will give an overview of the methods developed in our group that aim at supporting researchers from all life science areas. Based on state-of-the-art approaches from structural bioinformatics and cheminformatics, we provide software covering a wide range of research questions. Our all-in-one web service platform ProteinsPlus (http://proteins.plus) offers solutions for pocket and druggability prediction, hydrogen placement, structure quality assessment, ensemble generation, protein-protein interaction classification, and 2D-interaction visualization. Additionally, we provide a software package that contains tools targeting cheminformatics problems like file format conversion, molecule data set processing, SMARTS editing, fragment space enumeration, and ligand-based virtual screening. Furthermore, it also includes structural bioinformatics solutions for inverse screening, binding site alignment, and searching interaction patterns across structure libraries. The software package is available at http://software.zbh.uni-hamburg.de. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
Structure-Functional Prediction and Analysis of Cancer Mutation Effects in Protein Kinases
Dixit, Anshuman; Verkhivker, Gennady M.
2014-01-01
A central goal of cancer research is to discover and characterize the functional effects of mutated genes that contribute to tumorigenesis. In this study, we provide a detailed structural classification and analysis of functional dynamics for members of protein kinase families that are known to harbor cancer mutations. We also present a systematic computational analysis that combines sequence and structure-based prediction models to characterize the effect of cancer mutations in protein kinases. We focus on the differential effects of activating point mutations that increase protein kinase activity and kinase-inactivating mutations that decrease activity. Mapping of cancer mutations onto the conformational mobility profiles of known crystal structures demonstrated that activating mutations could reduce a steric barrier for the movement from the basal “low” activity state to the “active” state. According to our analysis, the mechanism of activating mutations reflects a combined effect of partial destabilization of the kinase in its inactive state and a concomitant stabilization of its active-like form, which is likely to drive tumorigenesis at some level. Ultimately, the analysis of the evolutionary and structural features of the major cancer-causing mutational hotspot in kinases can also aid in the correlation of kinase mutation effects with clinical outcomes. PMID:24817905
PDBe: Protein Data Bank in Europe
Velankar, Sameer; Alhroub, Younes; Alili, Anaëlle; Best, Christoph; Boutselakis, Harry C.; Caboche, Ségolène; Conroy, Matthew J.; Dana, Jose M.; van Ginkel, Glen; Golovin, Adel; Gore, Swanand P.; Gutmanas, Aleksandras; Haslam, Pauline; Hirshberg, Miriam; John, Melford; Lagerstedt, Ingvar; Mir, Saqib; Newman, Laurence E.; Oldfield, Tom J.; Penkett, Chris J.; Pineda-Castillo, Jorge; Rinaldi, Luana; Sahni, Gaurav; Sawka, Grégoire; Sen, Sanchayita; Slowley, Robert; Sousa da Silva, Alan Wilter; Suarez-Uruena, Antonio; Swaminathan, G. Jawahar; Symmons, Martyn F.; Vranken, Wim F.; Wainwright, Michael; Kleywegt, Gerard J.
2011-01-01
The Protein Data Bank in Europe (PDBe; pdbe.org) is actively involved in managing the international archive of biomacromolecular structure data as one of the partners in the Worldwide Protein Data Bank (wwPDB; wwpdb.org). PDBe also develops new tools to make structural data more widely and more easily available to the biomedical community. PDBe has developed a browser to access and analyze the structural archive using classification systems that are familiar to chemists and biologists. The PDBe web pages that describe individual PDB entries have been enhanced through the introduction of plain-English summary pages and iconic representations of the contents of an entry (PDBprints). In addition, the information available for structures determined by means of NMR spectroscopy has been expanded. Finally, the entire web site has been redesigned to make it substantially easier to use for expert and novice users alike. PDBe works closely with other teams at the European Bioinformatics Institute (EBI) and in the international scientific community to develop new resources with value-added information. The SIFTS initiative is an example of such a collaboration—it provides extensive mapping data between proteins whose structures are available from the PDB and a host of other biomedical databases. SIFTS is widely used by major bioinformatics resources. PMID:21045060
Liu, Ping-Li; Du, Liang; Huang, Yuan; Gao, Shu-Min; Yu, Meng
2017-02-07
Leucine-rich repeat receptor-like protein kinases (LRR-RLKs) are the largest group of receptor-like kinases in plants and play crucial roles in development and stress responses. The evolutionary relationships among LRR-RLK genes have been investigated in flowering plants; however, no comprehensive studies have been performed for these genes in more ancestral groups. The subfamily classification of LRR-RLK genes in plants, the evolutionary history and driving force for the evolution of each LRR-RLK subfamily remain to be understood. We identified 119 LRR-RLK genes in the Physcomitrella patens moss genome, 67 LRR-RLK genes in the Selaginella moellendorffii lycophyte genome, and no LRR-RLK genes in five green algae genomes. Furthermore, these LRR-RLK sequences, along with previously reported LRR-RLK sequences from Arabidopsis thaliana and Oryza sativa, were subjected to evolutionary analyses. Phylogenetic analyses revealed that plant LRR-RLKs belong to 19 subfamilies, eighteen of which were established in early land plants, and one of which evolved in flowering plants. More importantly, we found that the basic structures of LRR-RLK genes for most subfamilies are established in early land plants and conserved within subfamilies and across different plant lineages, but divergent among subfamilies. In addition, most members of the same subfamily had common protein motif compositions, whereas members of different subfamilies showed variations in protein motif compositions. The unique gene structure and protein motif compositions of each subfamily differentiate the subfamily classifications and, more importantly, provide evidence for functional divergence among LRR-RLK subfamilies. Maximum likelihood analyses showed that some sites within four subfamilies were under positive selection. Much of the diversity of plant LRR-RLK genes was established in early land plants. Positive selection contributed to the evolution of a few LRR-RLK subfamilies.
A framework for classification of prokaryotic protein kinases.
Tyagi, Nidhi; Anamika, Krishanpal; Srinivasan, Narayanaswamy
2010-05-26
Overwhelming majority of the Serine/Threonine protein kinases identified by gleaning archaeal and eubacterial genomes could not be classified into any of the well known Hanks and Hunter subfamilies of protein kinases. This is owing to the development of Hanks and Hunter classification scheme based on eukaryotic protein kinases which are highly divergent from their prokaryotic homologues. A large dataset of prokaryotic Serine/Threonine protein kinases recognized from genomes of prokaryotes have been used to develop a classification framework for prokaryotic Ser/Thr protein kinases. We have used traditional sequence alignment and phylogenetic approaches and clustered the prokaryotic kinases which represent 72 subfamilies with at least 4 members in each. Such a clustering enables classification of prokaryotic Ser/Thr kinases and it can be used as a framework to classify newly identified prokaryotic Ser/Thr kinases. After series of searches in a comprehensive sequence database we recognized that 38 subfamilies of prokaryotic protein kinases are associated to a specific taxonomic level. For example 4, 6 and 3 subfamilies have been identified that are currently specific to phylum proteobacteria, cyanobacteria and actinobacteria respectively. Similarly subfamilies which are specific to an order, sub-order, class, family and genus have also been identified. In addition to these, we also identify organism-diverse subfamilies. Members of these clusters are from organisms of different taxonomic levels, such as archaea, bacteria, eukaryotes and viruses. Interestingly, occurrence of several taxonomic level specific subfamilies of prokaryotic kinases contrasts with classification of eukaryotic protein kinases in which most of the popular subfamilies of eukaryotic protein kinases occur diversely in several eukaryotes. Many prokaryotic Ser/Thr kinases exhibit a wide variety of modular organization which indicates a degree of complexity and protein-protein interactions in the signaling pathways in these microbes.
Soliton concepts and protein structure.
Krokhotin, Andrei; Niemi, Antti J; Peng, Xubiao
2012-03-01
Structural classification shows that the number of different protein folds is surprisingly small. It also appears that proteins are built in a modular fashion from a relatively small number of components. Here we propose that the modular building blocks are made of the dark soliton solution of a generalized discrete nonlinear Schrödinger equation. We find that practically all protein loops can be obtained simply by scaling the size and by joining together a number of copies of the soliton, one after another. The soliton has only two loop-specific parameters, and we compute their statistical distribution in the Protein Data Bank (PDB). We explicitly construct a collection of 200 sets of parameters, each determining a soliton profile that describes a different short loop. The ensuing profiles cover practically all those proteins in PDB that have a resolution which is better than 2.0 Å, with a precision such that the average root-mean-square distance between the loop and its soliton is less than the experimental B-factor fluctuation distance. We also present two examples that describe how the loop library can be employed both to model and to analyze folded proteins.
Graph pyramids for protein function prediction
2015-01-01
Background Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Methods Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Results Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data. PMID:26044522
Graph pyramids for protein function prediction.
Sandhan, Tushar; Yoo, Youngjun; Choi, Jin; Kim, Sun
2015-01-01
Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data.
Wlodarski, Tomasz; Kutner, Jan; Towpik, Joanna; Knizewski, Lukasz; Rychlewski, Leszek; Kudlicki, Andrzej; Rowicka, Maga; Dziembowski, Andrzej; Ginalski, Krzysztof
2011-01-01
Methylation is one of the most common chemical modifications of biologically active molecules and it occurs in all life forms. Its functional role is very diverse and involves many essential cellular processes, such as signal transduction, transcriptional control, biosynthesis, and metabolism. Here, we provide further insight into the enzymatic methylation in S. cerevisiae by conducting a comprehensive structural and functional survey of all the methyltransferases encoded in its genome. Using distant homology detection and fold recognition, we found that the S. cerevisiae methyltransferome comprises 86 MTases (53 well-known and 33 putative with unknown substrate specificity). Structural classification of their catalytic domains shows that these enzymes may adopt nine different folds, the most common being the Rossmann-like. We also analyzed the domain architecture of these proteins and identified several new domain contexts. Interestingly, we found that the majority of MTase genes are periodically expressed during yeast metabolic cycle. This finding, together with calculated isoelectric point, fold assignment and cellular localization, was used to develop a novel approach for predicting substrate specificity. Using this approach, we predicted the general substrates for 24 of 33 putative MTases and confirmed these predictions experimentally in both cases tested. Finally, we show that, in S. cerevisiae, methylation is carried out by 34 RNA MTases, 32 protein MTases, eight small molecule MTases, three lipid MTases, and nine MTases with still unknown substrate specificity.
Dual host specificity of phage SP6 is facilitated by tailspike rotation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tu, Jiagang
Bacteriophage SP6 exhibits dual-host adsorption specificity. The SP6 tailspikes are recognized as important in host range determination but the mechanisms underlying dual host specificity are unknown. Cryo-electron tomography and sub-tomogram classification were used to analyze the SP6 virion with a particular focus on the interaction of tailspikes with host membranes. The SP6 tail is surrounded by six V-shaped structures that interconnect in forming a hand-over-hand hexameric garland. Each V-shaped structure consists of two trimeric tailspike proteins: gp46 and gp47, connected through the adaptor protein gp37. SP6 infection of Salmonella enterica serovars Typhimurium and Newport results in distinguishable changes in tailspikemore » orientation, providing the first direct demonstration how tailspikes can confer dual host adsorption specificity. SP6 also infects S. Typhimurium strains lacking O antigen; in these infections tailspikes have no apparent specific role and the phage tail must therefore interact with a distinct host receptor to allow infection. - Highlights: •Cryo-electron tomography reveals the structural basis for dual host specificity. •Sub-tomogram classification reveals distinct orientations of the tailspikes during infection of different hosts. •Tailspike-adaptor modules rotate as they bind different O antigens. •In the absence of any O antigen, tailspikes bind weakly and without specificity to LPS. •Interaction of the phage tail with LPS is essential for infection.« less
NASA Astrophysics Data System (ADS)
Jaenisch, Holger M.; Handley, James W.
2010-04-01
Malware are analogs of viruses. Viruses are comprised of large numbers of polypeptide proteins. The shape and function of the protein strands determines the functionality of the segment, similar to a subroutine in malware. The full combination of subroutines is the malware organism, in analogous fashion as a collection of polypeptides forms protein structures that are information bearing. We propose to apply the methods of Bioinformatics to analyze malware to provide a rich feature set for creating a unique and novel detection and classification scheme that is originally applied to Bioinformatics amino acid sequencing. Our proposed methods enable real time in situ (in contrast to in vivo) detection applications.
Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource
Koike, Asako; Kobayashi, Yoshiyuki; Takagi, Toshihisa
2003-01-01
Protein kinases play a crucial role in the regulation of cellular functions. Various kinds of information about these molecules are important for understanding signaling pathways and organism characteristics. We have developed the Kinase Pathway Database, an integrated database involving major completely sequenced eukaryotes. It contains the classification of protein kinases and their functional conservation, ortholog tables among species, protein–protein, protein–gene, and protein–compound interaction data, domain information, and structural information. It also provides an automatic pathway graphic image interface. The protein, gene, and compound interactions are automatically extracted from abstracts for all genes and proteins by natural-language processing (NLP).The method of automatic extraction uses phrase patterns and the GENA protein, gene, and compound name dictionary, which was developed by our group. With this database, pathways are easily compared among species using data with more than 47,000 protein interactions and protein kinase ortholog tables. The database is available for querying and browsing at http://kinasedb.ontology.ims.u-tokyo.ac.jp/. PMID:12799355
Feature generation and representations for protein-protein interaction classification.
Lan, Man; Tan, Chew Lim; Su, Jian
2009-10-01
Automatic detecting protein-protein interaction (PPI) relevant articles is a crucial step for large-scale biological database curation. The previous work adopted POS tagging, shallow parsing and sentence splitting techniques, but they achieved worse performance than the simple bag-of-words representation. In this paper, we generated and investigated multiple types of feature representations in order to further improve the performance of PPI text classification task. Besides the traditional domain-independent bag-of-words approach and the term weighting methods, we also explored other domain-dependent features, i.e. protein-protein interaction trigger keywords, protein named entities and the advanced ways of incorporating Natural Language Processing (NLP) output. The integration of these multiple features has been evaluated on the BioCreAtIvE II corpus. The experimental results showed that both the advanced way of using NLP output and the integration of bag-of-words and NLP output improved the performance of text classification. Specifically, in comparison with the best performance achieved in the BioCreAtIvE II IAS, the feature-level and classifier-level integration of multiple features improved the performance of classification 2.71% and 3.95%, respectively.
Goessweiner-Mohr, Nikolaus; Grumet, Lukas; Arends, Karsten; Pavkov-Keller, Tea; Gruber, Christian C.; Gruber, Karl; Birner-Gruenberger, Ruth; Kropec-Huebner, Andrea; Huebner, Johannes; Grohmann, Elisabeth; Keller, Walter
2013-01-01
Conjugative plasmid transfer is the most important means of spreading antibiotic resistance and virulence genes among bacteria and therefore presents a serious threat to human health. The process requires direct cell-cell contact made possible by a multiprotein complex that spans cellular membranes and serves as a channel for macromolecular secretion. Thus far, well studied conjugative type IV secretion systems (T4SS) are of Gram-negative (G−) origin. Although many medically relevant pathogens (e.g., enterococci, staphylococci, and streptococci) are Gram-positive (G+), their conjugation systems have received little attention. This study provides structural information for the transfer protein TraM of the G+ broad host range Enterococcus conjugative plasmid pIP501. Immunolocalization demonstrated that the protein localizes to the cell wall. We then used opsonophagocytosis as a novel tool to verify that TraM was exposed on the cell surface. In these assays, antibodies generated to TraM recruited macrophages and enabled killing of pIP501 harboring Enteroccocus faecalis cells. The crystal structure of the C-terminal, surface-exposed domain of TraM was determined to 2.5 Å resolution. The structure, molecular dynamics, and cross-linking studies indicated that a TraM trimer acts as the biological unit. Despite the absence of sequence-based similarity, TraM unexpectedly displayed a fold similar to the T4SS VirB8 proteins from Agrobacterium tumefaciens and Brucella suis (G−) and to the transfer protein TcpC from Clostridium perfringens plasmid pCW3 (G+). Based on the alignments of secondary structure elements of VirB8-like proteins from mobile genetic elements and chromosomally encoded T4SS from G+ and G− bacteria, we propose a new classification scheme of VirB8-like proteins. PMID:23188825
Pasquier, C; Promponas, V J; Hamodrakas, S J
2001-08-15
A cascading system of hierarchical, artificial neural networks (named PRED-CLASS) is presented for the generalized classification of proteins into four distinct classes-transmembrane, fibrous, globular, and mixed-from information solely encoded in their amino acid sequences. The architecture of the individual component networks is kept very simple, reducing the number of free parameters (network synaptic weights) for faster training, improved generalization, and the avoidance of data overfitting. Capturing information from as few as 50 protein sequences spread among the four target classes (6 transmembrane, 10 fibrous, 13 globular, and 17 mixed), PRED-CLASS was able to obtain 371 correct predictions out of a set of 387 proteins (success rate approximately 96%) unambiguously assigned into one of the target classes. The application of PRED-CLASS to several test sets and complete proteomes of several organisms demonstrates that such a method could serve as a valuable tool in the annotation of genomic open reading frames with no functional assignment or as a preliminary step in fold recognition and ab initio structure prediction methods. Detailed results obtained for various data sets and completed genomes, along with a web sever running the PRED-CLASS algorithm, can be accessed over the World Wide Web at http://o2.biol.uoa.gr/PRED-CLASS.
Classification of Domain Movements in Proteins Using Dynamic Contact Graphs
Taylor, Daniel; Cawley, Gavin; Hayward, Steven
2013-01-01
A new method for the classification of domain movements in proteins is described and applied to 1822 pairs of structures from the Protein Data Bank that represent a domain movement in two-domain proteins. The method is based on changes in contacts between residues from the two domains in moving from one conformation to the other. We argue that there are five types of elemental contact changes and that these relate to five model domain movements called: “free”, “open-closed”, “anchored”, “sliding-twist”, and “see-saw.” A directed graph is introduced called the “Dynamic Contact Graph” which represents the contact changes in a domain movement. In many cases a graph, or part of a graph, provides a clear visual metaphor for the movement it represents and is a motif that can be easily recognised. The Dynamic Contact Graphs are often comprised of disconnected subgraphs indicating independent regions which may play different roles in the domain movement. The Dynamic Contact Graph for each domain movement is decomposed into elemental Dynamic Contact Graphs, those that represent elemental contact changes, allowing us to count the number of instances of each type of elemental contact change in the domain movement. This naturally leads to sixteen classes into which the 1822 domain movements are classified. PMID:24260562
Kovacs, Gabor G
2016-02-02
Neurodegenerative diseases (NDDs) are characterized by selective dysfunction and loss of neurons associated with pathologically altered proteins that deposit in the human brain but also in peripheral organs. These proteins and their biochemical modifications can be potentially targeted for therapy or used as biomarkers. Despite a plethora of modifications demonstrated for different neurodegeneration-related proteins, such as amyloid-β, prion protein, tau, α-synuclein, TAR DNA-binding protein 43 (TDP-43), or fused in sarcoma protein (FUS), molecular classification of NDDs relies on detailed morphological evaluation of protein deposits, their distribution in the brain, and their correlation to clinical symptoms together with specific genetic alterations. A further facet of the neuropathology-based classification is the fact that many protein deposits show a hierarchical involvement of brain regions. This has been shown for Alzheimer and Parkinson disease and some forms of tauopathies and TDP-43 proteinopathies. The present paper aims to summarize current molecular classification of NDDs, focusing on the most relevant biochemical and morphological aspects. Since the combination of proteinopathies is frequent, definition of novel clusters of patients with NDDs needs to be considered in the era of precision medicine. Optimally, neuropathological categorizing of NDDs should be translated into in vivo detectable biomarkers to support better prediction of prognosis and stratification of patients for therapy trials.
Kovacs, Gabor G.
2016-01-01
Neurodegenerative diseases (NDDs) are characterized by selective dysfunction and loss of neurons associated with pathologically altered proteins that deposit in the human brain but also in peripheral organs. These proteins and their biochemical modifications can be potentially targeted for therapy or used as biomarkers. Despite a plethora of modifications demonstrated for different neurodegeneration-related proteins, such as amyloid-β, prion protein, tau, α-synuclein, TAR DNA-binding protein 43 (TDP-43), or fused in sarcoma protein (FUS), molecular classification of NDDs relies on detailed morphological evaluation of protein deposits, their distribution in the brain, and their correlation to clinical symptoms together with specific genetic alterations. A further facet of the neuropathology-based classification is the fact that many protein deposits show a hierarchical involvement of brain regions. This has been shown for Alzheimer and Parkinson disease and some forms of tauopathies and TDP-43 proteinopathies. The present paper aims to summarize current molecular classification of NDDs, focusing on the most relevant biochemical and morphological aspects. Since the combination of proteinopathies is frequent, definition of novel clusters of patients with NDDs needs to be considered in the era of precision medicine. Optimally, neuropathological categorizing of NDDs should be translated into in vivo detectable biomarkers to support better prediction of prognosis and stratification of patients for therapy trials. PMID:26848654
Suplatov, Dmitry; Kirilin, Eugeny; Arbatsky, Mikhail; Takhaveev, Vakil; Svedas, Vytas
2014-07-01
The new web-server pocketZebra implements the power of bioinformatics and geometry-based structural approaches to identify and rank subfamily-specific binding sites in proteins by functional significance, and select particular positions in the structure that determine selective accommodation of ligands. A new scoring function has been developed to annotate binding sites by the presence of the subfamily-specific positions in diverse protein families. pocketZebra web-server has multiple input modes to meet the needs of users with different experience in bioinformatics. The server provides on-site visualization of the results as well as off-line version of the output in annotated text format and as PyMol sessions ready for structural analysis. pocketZebra can be used to study structure-function relationship and regulation in large protein superfamilies, classify functionally important binding sites and annotate proteins with unknown function. The server can be used to engineer ligand-binding sites and allosteric regulation of enzymes, or implemented in a drug discovery process to search for potential molecular targets and novel selective inhibitors/effectors. The server, documentation and examples are freely available at http://biokinet.belozersky.msu.ru/pocketzebra and there are no login requirements. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
GPCRdb: an information system for G protein-coupled receptors
Isberg, Vignir; Mordalski, Stefan; Munk, Christian; Rataj, Krzysztof; Harpsøe, Kasper; Hauser, Alexander S.; Vroling, Bas; Bojarski, Andrzej J.; Vriend, Gert; Gloriam, David E.
2016-01-01
Recent developments in G protein-coupled receptor (GPCR) structural biology and pharmacology have greatly enhanced our knowledge of receptor structure-function relations, and have helped improve the scientific foundation for drug design studies. The GPCR database, GPCRdb, serves a dual role in disseminating and enabling new scientific developments by providing reference data, analysis tools and interactive diagrams. This paper highlights new features in the fifth major GPCRdb release: (i) GPCR crystal structure browsing, superposition and display of ligand interactions; (ii) direct deposition by users of point mutations and their effects on ligand binding; (iii) refined snake and helix box residue diagram looks; and (iii) phylogenetic trees with receptor classification colour schemes. Under the hood, the entire GPCRdb front- and back-ends have been re-coded within one infrastructure, ensuring a smooth browsing experience and development. GPCRdb is available at http://www.gpcrdb.org/ and it's open source code at https://bitbucket.org/gpcr/protwis. PMID:26582914
Deep Learning and Its Applications in Biomedicine.
Cao, Chensi; Liu, Feng; Tan, Hai; Song, Deshou; Shu, Wenjie; Li, Weizhong; Zhou, Yiming; Bo, Xiaochen; Xie, Zhi
2018-02-01
Advances in biological and medical technologies have been providing us explosive volumes of biological and physiological data, such as medical images, electroencephalography, genomic and protein sequences. Learning from these data facilitates the understanding of human health and disease. Developed from artificial neural networks, deep learning-based algorithms show great promise in extracting features and learning patterns from complex data. The aim of this paper is to provide an overview of deep learning techniques and some of the state-of-the-art applications in the biomedical field. We first introduce the development of artificial neural network and deep learning. We then describe two main components of deep learning, i.e., deep learning architectures and model optimization. Subsequently, some examples are demonstrated for deep learning applications, including medical image classification, genomic sequence analysis, as well as protein structure classification and prediction. Finally, we offer our perspectives for the future directions in the field of deep learning. Copyright © 2018. Production and hosting by Elsevier B.V.
Wavelet images and Chou's pseudo amino acid composition for protein classification.
Nanni, Loris; Brahnam, Sheryl; Lumini, Alessandra
2012-08-01
The last decade has seen an explosion in the collection of protein data. To actualize the potential offered by this wealth of data, it is important to develop machine systems capable of classifying and extracting features from proteins. Reliable machine systems for protein classification offer many benefits, including the promise of finding novel drugs and vaccines. In developing our system, we analyze and compare several feature extraction methods used in protein classification that are based on the calculation of texture descriptors starting from a wavelet representation of the protein. We then feed these texture-based representations of the protein into an Adaboost ensemble of neural network or a support vector machine classifier. In addition, we perform experiments that combine our feature extraction methods with a standard method that is based on the Chou's pseudo amino acid composition. Using several datasets, we show that our best approach outperforms standard methods. The Matlab code of the proposed protein descriptors is available at http://bias.csr.unibo.it/nanni/wave.rar .
Computational-based structural, functional and phylogenetic analysis of Enterobacter phytases.
Pramanik, Krishnendu; Kundu, Shreyasi; Banerjee, Sandipan; Ghosh, Pallab Kumar; Maiti, Tushar Kanti
2018-06-01
Myo-inositol hexakisphosphate phosphohydrolases (i.e., phytases) are known to be a very important enzyme responsible for solubilization of insoluble phosphates. In the present study, Enterobacter phytases have characterized by different phylogenetic, structural and functional parameters using some standard bio-computational tools. Results showed that majority of the Enterobacter phytases are acidic in nature as most of the isoelectric points were under 7.0. The aliphatic indices predicted for the selected proteins were below 40 indicating their thermostable nature. The average molecular weight of the proteins was 48 kDa. The lower values of GRAVY of the said proteins implied that they have better interactions with water. Secondary structure prediction revealed that alpha-helical content was highest among the other forms such as sheets, coils, etc. Moreover, the predicted 3D structure of Enterobacter phytases divulged that the proteins consisted of four monomeric polypeptide chains i.e., it was a tetrameric protein. The predicted tertiary model of E. aerogenes (A0A0M3HCJ2) was deposited in Protein Model Database (Acc. No.: PM0080561) for further utilization after a thorough quality check from QMEAN and SAVES server. Functional analysis supported their classification as histidine acid phosphatases. Besides, multiple sequence alignment revealed that "DG-DP-LG" was the most highly conserved residues within the Enterobacter phytases. Thus, the present study will be useful in selecting suitable phytase-producing microbe exclusively for using in the animal food industry as a food additive.
Schmidt, Nathan W.; Grigoryan, Gevorg
2017-01-01
Abstract Coiled‐coils are essential components of many protein complexes. First discovered in structural proteins such as keratins, they have since been found to figure largely in the assembly and dynamics required for diverse functions, including membrane fusion, signal transduction and motors. Coiled‐coils have a characteristic repeating seven‐residue geometric and sequence motif, which is sometimes interrupted by the insertion of one or more residues. Such insertions are often highly conserved and critical to interdomain communication in signaling proteins such as bacterial histidine kinases. Here we develop the “accommodation index” as a parameter that allows automatic detection and classification of insertions based on the three dimensional structure of a protein. This method allows precise identification of the type of insertion and the “accommodation length” over which the insertion is structurally accommodated. A simple theory is presented that predicts the structural perturbations of 1, 3, 4 residue insertions as a function of the length over which the insertion is accommodated. Analysis of experimental structures is in good agreement with theory, and shows that short accommodation lengths give rise to greater perturbation of helix packing angles, changes in local helical phase, and increased structural asymmetry relative to long accommodation lengths. Cytoplasmic domains of histidine kinases in different signaling states display large changes in their accommodation lengths, which can now be seen to underlie diverse structural transitions including symmetry/asymmetry and local variations in helical phase that accompany signal transduction. PMID:27977891
Suplatov, Dmitry; Kirilin, Eugeny; Arbatsky, Mikhail; Takhaveev, Vakil; Švedas, Vytas
2014-01-01
The new web-server pocketZebra implements the power of bioinformatics and geometry-based structural approaches to identify and rank subfamily-specific binding sites in proteins by functional significance, and select particular positions in the structure that determine selective accommodation of ligands. A new scoring function has been developed to annotate binding sites by the presence of the subfamily-specific positions in diverse protein families. pocketZebra web-server has multiple input modes to meet the needs of users with different experience in bioinformatics. The server provides on-site visualization of the results as well as off-line version of the output in annotated text format and as PyMol sessions ready for structural analysis. pocketZebra can be used to study structure–function relationship and regulation in large protein superfamilies, classify functionally important binding sites and annotate proteins with unknown function. The server can be used to engineer ligand-binding sites and allosteric regulation of enzymes, or implemented in a drug discovery process to search for potential molecular targets and novel selective inhibitors/effectors. The server, documentation and examples are freely available at http://biokinet.belozersky.msu.ru/pocketzebra and there are no login requirements. PMID:24852248
Optimizing high performance computing workflow for protein functional annotation.
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-09-10
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Optimizing high performance computing workflow for protein functional annotation
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-01-01
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296
Improving Protein Fold Recognition by Deep Learning Networks
NASA Astrophysics Data System (ADS)
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-01
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl’s benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
Raczko, Anna M; Bujnicki, Janusz M; Pawlowski, Marcin; Godlewska, Renata; Lewandowska, Magdalena; Jagusztyn-Krynicka, Elzbieta K
2005-01-01
In Gram-negative bacterial cells, disulfide bond formation occurs in the oxidative environment of the periplasm and is catalysed by Dsb (disulfide bond) proteins found in the periplasm and in the inner membrane. In this report the identification of a new subfamily of disulfide oxidoreductases encoded by a gene denoted dsbI, and functional characterization of DsbI proteins from Campylobacter jejuni and Helicobacter pylori, as well as DsbB from C. jejuni, are described. The N-terminal domain of DsbI is related to DsbB proteins and comprises five predicted transmembrane segments, while the C-terminal domain is predicted to locate to the periplasm and to fold into a beta-propeller structure. The dsbI gene is co-transcribed with a small ORF designated dba (dsbI-accessory). Based on a series of deletion and complementation experiments it is proposed that DsbB can complement the lack of DsbI but not the converse. In the presence of DsbB, the activity of DsbI was undetectable, hence it probably acts only on a subset of possible substrates of DsbB. To reconstruct the principal events in the evolution of DsbB and DsbI proteins, sequences of all their homologues identifiable in databases were analysed. In the course of this study, previously undetected variations on the common thiol-oxidoreductase theme were identified, such as development of an additional transmembrane helix and loss or migration of the second pair of Cys residues between two distinct periplasmic loops. In conjunction with the experimental characterization of two members of the DsbI lineage, this analysis has resulted in the first comprehensive classification of the DsbB/DsbI family based on structural, functional and evolutionary criteria.
RRCRank: a fusion method using rank strategy for residue-residue contact prediction.
Jing, Xiaoyang; Dong, Qiwen; Lu, Ruqian
2017-09-02
In structural biology area, protein residue-residue contacts play a crucial role in protein structure prediction. Some researchers have found that the predicted residue-residue contacts could effectively constrain the conformational search space, which is significant for de novo protein structure prediction. In the last few decades, related researchers have developed various methods to predict residue-residue contacts, especially, significant performance has been achieved by using fusion methods in recent years. In this work, a novel fusion method based on rank strategy has been proposed to predict contacts. Unlike the traditional regression or classification strategies, the contact prediction task is regarded as a ranking task. First, two kinds of features are extracted from correlated mutations methods and ensemble machine-learning classifiers, and then the proposed method uses the learning-to-rank algorithm to predict contact probability of each residue pair. First, we perform two benchmark tests for the proposed fusion method (RRCRank) on CASP11 dataset and CASP12 dataset respectively. The test results show that the RRCRank method outperforms other well-developed methods, especially for medium and short range contacts. Second, in order to verify the superiority of ranking strategy, we predict contacts by using the traditional regression and classification strategies based on the same features as ranking strategy. Compared with these two traditional strategies, the proposed ranking strategy shows better performance for three contact types, in particular for long range contacts. Third, the proposed RRCRank has been compared with several state-of-the-art methods in CASP11 and CASP12. The results show that the RRCRank could achieve comparable prediction precisions and is better than three methods in most assessment metrics. The learning-to-rank algorithm is introduced to develop a novel rank-based method for the residue-residue contact prediction of proteins, which achieves state-of-the-art performance based on the extensive assessment.
Bhattacharyya, Moitrayee; Vishveshwara, Saraswathi
2009-01-01
Background The genome of a wide variety of prokaryotes contains the luxS gene homologue, which encodes for the protein S-ribosylhomocysteinelyase (LuxS). This protein is responsible for the production of the quorum sensing molecule, AI-2 and has been implicated in a variety of functions such as flagellar motility, metabolic regulation, toxin production and even in pathogenicity. A high structural similarity is present in the LuxS structures determined from a few species. In this study, we have modelled the structures from several other species and have investigated their dimer interfaces. We have attempted to correlate the interface features of LuxS with the phenotypic nature of the organisms. Results The protein structure networks (PSN) are constructed and graph theoretical analysis is performed on the structures obtained from X-ray crystallography and on the modelled ones. The interfaces, which are known to contain the active site, are characterized from the PSNs of these homodimeric proteins. The key features presented by the protein interfaces are investigated for the classification of the proteins in relation to their function. From our analysis, structural interface motifs are identified for each class in our dataset, which showed distinctly different pattern at the interface of LuxS for the probiotics and some extremophiles. Our analysis also reveals potential sites of mutation and geometric patterns at the interface that was not evident from conventional sequence alignment studies. Conclusion The structure network approach employed in this study for the analysis of dimeric interfaces in LuxS has brought out certain structural details at the side-chain interaction level, which were elusive from the conventional structure comparison methods. The results from this study provide a better understanding of the relation between the luxS gene and its functional role in the prokaryotes. This study also makes it possible to explore the potential direction towards the design of inhibitors of LuxS and thus towards a wide range of antimicrobials. PMID:19243584
Necci, Marco; Piovesan, Damiano; Tosatto, Silvio C E
2016-12-01
Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large-scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence-based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. © 2016 The Protein Society.
Kandaswamy, Krishna Kumar; Pugalenthi, Ganesan; Möller, Steffen; Hartmann, Enno; Kalies, Kai-Uwe; Suganthan, P N; Martinetz, Thomas
2010-12-01
Apoptosis is an essential process for controlling tissue homeostasis by regulating a physiological balance between cell proliferation and cell death. The subcellular locations of proteins performing the cell death are determined by mostly independent cellular mechanisms. The regular bioinformatics tools to predict the subcellular locations of such apoptotic proteins do often fail. This work proposes a model for the sorting of proteins that are involved in apoptosis, allowing us to both the prediction of their subcellular locations as well as the molecular properties that contributed to it. We report a novel hybrid Genetic Algorithm (GA)/Support Vector Machine (SVM) approach to predict apoptotic protein sequences using 119 sequence derived properties like frequency of amino acid groups, secondary structure, and physicochemical properties. GA is used for selecting a near-optimal subset of informative features that is most relevant for the classification. Jackknife cross-validation is applied to test the predictive capability of the proposed method on 317 apoptosis proteins. Our method achieved 85.80% accuracy using all 119 features and 89.91% accuracy for 25 features selected by GA. Our models were examined by a test dataset of 98 apoptosis proteins and obtained an overall accuracy of 90.34%. The results show that the proposed approach is promising; it is able to select small subsets of features and still improves the classification accuracy. Our model can contribute to the understanding of programmed cell death and drug discovery. The software and dataset are available at http://www.inb.uni-luebeck.de/tools-demos/apoptosis/GASVM.
Predicting β-turns and their types using predicted backbone dihedral angles and secondary structures
2010-01-01
Background β-turns are secondary structure elements usually classified as coil. Their prediction is important, because of their role in protein folding and their frequent occurrence in protein chains. Results We have developed a novel method that predicts β-turns and their types using information from multiple sequence alignments, predicted secondary structures and, for the first time, predicted dihedral angles. Our method uses support vector machines, a supervised classification technique, and is trained and tested on three established datasets of 426, 547 and 823 protein chains. We achieve a Matthews correlation coefficient of up to 0.49, when predicting the location of β-turns, the highest reported value to date. Moreover, the additional dihedral information improves the prediction of β-turn types I, II, IV, VIII and "non-specific", achieving correlation coefficients up to 0.39, 0.33, 0.27, 0.14 and 0.38, respectively. Our results are more accurate than other methods. Conclusions We have created an accurate predictor of β-turns and their types. Our method, called DEBT, is available online at http://comp.chem.nottingham.ac.uk/debt/. PMID:20673368
Kountouris, Petros; Hirst, Jonathan D
2010-07-31
Beta-turns are secondary structure elements usually classified as coil. Their prediction is important, because of their role in protein folding and their frequent occurrence in protein chains. We have developed a novel method that predicts beta-turns and their types using information from multiple sequence alignments, predicted secondary structures and, for the first time, predicted dihedral angles. Our method uses support vector machines, a supervised classification technique, and is trained and tested on three established datasets of 426, 547 and 823 protein chains. We achieve a Matthews correlation coefficient of up to 0.49, when predicting the location of beta-turns, the highest reported value to date. Moreover, the additional dihedral information improves the prediction of beta-turn types I, II, IV, VIII and "non-specific", achieving correlation coefficients up to 0.39, 0.33, 0.27, 0.14 and 0.38, respectively. Our results are more accurate than other methods. We have created an accurate predictor of beta-turns and their types. Our method, called DEBT, is available online at http://comp.chem.nottingham.ac.uk/debt/.
A high level interface to SCOP and ASTRAL implemented in python.
Casbon, James A; Crooks, Gavin E; Saqi, Mansoor A S
2006-01-10
Benchmarking algorithms in structural bioinformatics often involves the construction of datasets of proteins with given sequence and structural properties. The SCOP database is a manually curated structural classification which groups together proteins on the basis of structural similarity. The ASTRAL compendium provides non redundant subsets of SCOP domains on the basis of sequence similarity such that no two domains in a given subset share more than a defined degree of sequence similarity. Taken together these two resources provide a 'ground truth' for assessing structural bioinformatics algorithms. We present a small and easy to use API written in python to enable construction of datasets from these resources. We have designed a set of python modules to provide an abstraction of the SCOP and ASTRAL databases. The modules are designed to work as part of the Biopython distribution. Python users can now manipulate and use the SCOP hierarchy from within python programs, and use ASTRAL to return sequences of domains in SCOP, as well as clustered representations of SCOP from ASTRAL. The modules make the analysis and generation of datasets for use in structural genomics easier and more principled.
Esfahani, Mohammad Shahrokh; Dougherty, Edward R
2015-01-01
Phenotype classification via genomic data is hampered by small sample sizes that negatively impact classifier design. Utilization of prior biological knowledge in conjunction with training data can improve both classifier design and error estimation via the construction of the optimal Bayesian classifier. In the genomic setting, gene/protein signaling pathways provide a key source of biological knowledge. Although these pathways are neither complete, nor regulatory, with no timing associated with them, they are capable of constraining the set of possible models representing the underlying interaction between molecules. The aim of this paper is to provide a framework and the mathematical tools to transform signaling pathways to prior probabilities governing uncertainty classes of feature-label distributions used in classifier design. Structural motifs extracted from the signaling pathways are mapped to a set of constraints on a prior probability on a Multinomial distribution. Being the conjugate prior for the Multinomial distribution, we propose optimization paradigms to estimate the parameters of a Dirichlet distribution in the Bayesian setting. The performance of the proposed methods is tested on two widely studied pathways: mammalian cell cycle and a p53 pathway model.
Roskoski, Robert
2016-01-01
Because dysregulation and mutations of protein kinases play causal roles in human disease, this family of enzymes has become one of the most important drug targets over the past two decades. The X-ray crystal structures of 21 of the 27 FDA-approved small molecule inhibitors bound to their target protein kinases are depicted in this paper. The structure of the enzyme-bound antagonist complex is used in the classification of these inhibitors. Type I inhibitors bind to the active protein kinase conformation (DFG-Asp in, αC-helix in). Type I½ inhibitors bind to a DFG-Asp in inactive conformation while Type II inhibitors bind to a DFG-Asp out inactive conformation. Type I, I½, and type II inhibitors occupy part of the adenine binding pocket and form hydrogen bonds with the hinge region connecting the small and large lobes of the enzyme. Type III inhibitors bind next to the ATP-binding pocket and type IV inhibitors do not bind to the ATP or peptide substrate binding sites. Type III and IV inhibitors are allosteric in nature. Type V inhibitors bind to two different regions of the protein kinase domain and are therefore bivalent inhibitors. The type I-V inhibitors are reversible. In contrast, type VI inhibitors bind covalently to their target enzyme. Type I, I½, and II inhibitors are divided into A and B subtypes. The type A inhibitors bind in the front cleft, the back cleft, and near the gatekeeper residue, all of which occur within the region separating the small and large lobes of the protein kinase. The type B inhibitors bind in the front cleft and gate area but do not extend into the back cleft. An analysis of the limited available data indicates that type A inhibitors have a long residence time (minutes to hours) while the type B inhibitors have a short residence time (seconds to minutes). The catalytic spine includes residues from the small and large lobes and interacts with the adenine ring of ATP. Nearly all of the approved protein kinase inhibitors occupy the adenine-binding pocket; thus it is not surprising that these inhibitors interact with nearby catalytic spine (CS) residues. Moreover, a significant number of approved drugs also interact with regulatory spine (RS) residues. Copyright © 2015 Elsevier Ltd. All rights reserved.
webPIPSA: a web server for the comparison of protein interaction properties
Richter, Stefan; Wenzel, Anne; Stein, Matthias; Gabdoulline, Razif R.; Wade, Rebecca C.
2008-01-01
Protein molecular interaction fields are key determinants of protein functionality. PIPSA (Protein Interaction Property Similarity Analysis) is a procedure to compare and analyze protein molecular interaction fields, such as the electrostatic potential. PIPSA may assist in protein functional assignment, classification of proteins, the comparison of binding properties and the estimation of enzyme kinetic parameters. webPIPSA is a web server that enables the use of PIPSA to compare and analyze protein electrostatic potentials. While PIPSA can be run with downloadable software (see http://projects.eml.org/mcm/software/pipsa), webPIPSA extends and simplifies a PIPSA run. This allows non-expert users to perform PIPSA for their protein datasets. With input protein coordinates, the superposition of protein structures, as well as the computation and analysis of electrostatic potentials, is automated. The results are provided as electrostatic similarity matrices from an all-pairwise comparison of the proteins which can be subjected to clustering and visualized as epograms (tree-like diagrams showing electrostatic potential differences) or heat maps. webPIPSA is freely available at: http://pipsa.eml.org. PMID:18420653
Ma, Yue; Tuskan, Gerald A.
2018-01-01
The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here) from the protein distribution densities in the LD space defined by ln(L) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level. PMID:29686995
Insecticidal activity of plant lectins and potential application in crop protection.
Macedo, Maria Lígia R; Oliveira, Caio F R; Oliveira, Carolina T
2015-01-27
Lectins constitute a complex group of proteins found in different organisms. These proteins constitute an important field for research, as their structural diversity and affinity for several carbohydrates makes them suitable for numerous biological applications. This review addresses the classification and insecticidal activities of plant lectins, providing an overview of the applicability of these proteins in crop protection. The likely target sites in insect tissues, the mode of action of these proteins, as well as the use of lectins as biotechnological tools for pest control are also described. The use of initial bioassays employing artificial diets has led to the most recent advances in this field, such as plant breeding and the construction of fusion proteins, using lectins for targeting the delivery of toxins and to potentiate expected insecticide effects. Based on the data presented, we emphasize the contribution that plant lectins may make as tools for the development of integrated insect pest control strategies.
Dreyer, Chantal; Afchain, Pauline; Trouilloud, Isabelle; André, Thierry
2016-01-01
This review reports 3 of recently published molecular classifications of the 3 main gastro-intestinal cancers: gastric, pancreatic and colorectal adenocarcinoma. In colorectal adenocarcinoma, 6 independent classifications were combined to finally hold 4 molecular sub-groups, Consensus Molecular Subtypes (CMS 1-4), linked to various clinical, molecular and survival data. CMS1 (14% MSI with immune activation); CMS2 (37%: canonical with epithelial differentiation and activation of the WNT/MYC pathway); CMS3 (13% metabolic with epithelial differentiation and RAS mutation); CMS4 (23%: mesenchymal with activation of TGFβ pathway and angiogenesis with stromal invasion). In gastric adenocarcinoma, 4 groups were established: subtype "EBV" (9%, high frequency of PIK3CA mutations, hypermetylation and amplification of JAK2, PD-L1 and PD-L2), subtype "MSI" (22%, high rate of mutation), subtype "genomically stable tumor" (20%, diffuse histology type and mutations of RAS and genes encoding integrins and adhesion proteins including CDH1) and subtype "tumors with chromosomal instability" (50%, intestinal type, aneuploidy and receptor tyrosine kinase amplification). In pancreatic adenocarcinomas, a classification in four sub-groups has been proposed, stable subtype (20%, aneuploidy), locally rearranged subtype (30%, focal event on one or two chromosoms), scattered subtype (36%,<200 structural variation events), and unstable subtype (14%,>200 structural variation events, defects in DNA maintenance). Although currently away from the care of patients, these classifications open the way to "à la carte" treatment depending on molecular biology. Copyright © 2016 Société Française du Cancer. Published by Elsevier Masson SAS. All rights reserved.
Integration of QUARK and I-TASSER for Ab Initio Protein Structure Prediction in CASP11.
Zhang, Wenxuan; Yang, Jianyi; He, Baoji; Walker, Sara Elizabeth; Zhang, Hongjiu; Govindarajoo, Brandon; Virtanen, Jouko; Xue, Zhidong; Shen, Hong-Bin; Zhang, Yang
2016-09-01
We tested two pipelines developed for template-free protein structure prediction in the CASP11 experiment. First, the QUARK pipeline constructs structure models by reassembling fragments of continuously distributed lengths excised from unrelated proteins. Five free-modeling (FM) targets have the model successfully constructed by QUARK with a TM-score above 0.4, including the first model of T0837-D1, which has a TM-score = 0.736 and RMSD = 2.9 Å to the native. Detailed analysis showed that the success is partly attributed to the high-resolution contact map prediction derived from fragment-based distance-profiles, which are mainly located between regular secondary structure elements and loops/turns and help guide the orientation of secondary structure assembly. In the Zhang-Server pipeline, weakly scoring threading templates are re-ordered by the structural similarity to the ab initio folding models, which are then reassembled by I-TASSER based structure assembly simulations; 60% more domains with length up to 204 residues, compared to the QUARK pipeline, were successfully modeled by the I-TASSER pipeline with a TM-score above 0.4. The robustness of the I-TASSER pipeline can stem from the composite fragment-assembly simulations that combine structures from both ab initio folding and threading template refinements. Despite the promising cases, challenges still exist in long-range beta-strand folding, domain parsing, and the uncertainty of secondary structure prediction; the latter of which was found to affect nearly all aspects of FM structure predictions, from fragment identification, target classification, structure assembly, to final model selection. Significant efforts are needed to solve these problems before real progress on FM could be made. Proteins 2016; 84(Suppl 1):76-86. © 2015 Wiley Periodicals, Inc. © 2015 Wiley Periodicals, Inc.
Hohenstein, Kurt; Griesmacher, Andrea; Weigel, Günter; Golderer, Georg; Ott, Helmut Werner
2011-06-01
Blue native electrophoresis (BNE) was applied to analyze the von Willebrand factor (vWF) multimers in their native state and to present a methodology to perform blue native electrophoresis on human plasma proteins, which has not been done before. The major difference between this method and the commonly used SDS-agarose gel electrophoresis is the lack of satellite bands in the high-resolution native gel. To further analyze this phenomenon, a second dimension was performed under denaturing conditions. Thereby, we obtained a pattern in which each protein sub-unit from the first dimension dissociates into three distinct sub-bands. These bands confirm the triplet structure, which consists of an intermediate band and two satellite bands. By introducing the second dimension, our novel method separates the triplet structure into a higher resolution than the commonly used SDS-agarose gel electrophoresis does. This helps considerably in the classification of ambiguous von Willebrand's disease subtypes. In addition, our method has the additional advantage of being able to resolve the triplet structure of platelet vWF multimers, which has not been identified previously through conventional SDS-agarose electrophoresis multimer analysis. This potential enables us to compare the triplet structure from platelet and plasmatic vWF, and may help to find out whether structural abnormalities concern the vWF molecule in the platelet itself, or whether they are due to the physiological processing of vWF shed into circulation. Owing to its resolution and sensitivity, this native separation technique offers a promising tool for the analysis and detection of von Willebrand disorder, and for the classification of von Willebrand's disease subtypes. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Necci, Marco; Piovesan, Damiano
2016-01-01
Abstract Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large‐scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence‐based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. PMID:27636733
Exhaustive comparison and classification of ligand-binding surfaces in proteins
Murakami, Yoichi; Kinoshita, Kengo; Kinjo, Akira R; Nakamura, Haruki
2013-01-01
Many proteins function by interacting with other small molecules (ligands). Identification of ligand-binding sites (LBS) in proteins can therefore help to infer their molecular functions. A comprehensive comparison among local structures of LBSs was previously performed, in order to understand their relationships and to classify their structural motifs. However, similar exhaustive comparison among local surfaces of LBSs (patches) has never been performed, due to computational complexity. To enhance our understanding of LBSs, it is worth performing such comparisons among patches and classifying them based on similarities of their surface configurations and electrostatic potentials. In this study, we first developed a rapid method to compare two patches. We then clustered patches corresponding to the same PDB chemical component identifier for a ligand, and selected a representative patch from each cluster. We subsequently exhaustively as compared the representative patches and clustered them using similarity score, PatSim. Finally, the resultant PatSim scores were compared with similarities of atomic structures of the LBSs and those of the ligand-binding protein sequences and functions. Consequently, we classified the patches into ∼2000 well-characterized clusters. We found that about 63% of these clusters are used in identical protein folds, although about 25% of the clusters are conserved in distantly related proteins and even in proteins with cross-fold similarity. Furthermore, we showed that patches with higher PatSim score have potential to be involved in similar biological processes. PMID:23934772
Protein flexibility: coordinate uncertainties and interpretation of structural differences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rashin, Alexander A., E-mail: alexander-rashin@hotmail.com; LH Baker Center for Bioinformatics and Department of Biochemistry, Biophysics and Molecular Biology, 112 Office and Lab Building, Iowa State University, Ames, IA 50011-3020; Rashin, Abraham H. L.
2009-11-01
Criteria for the interpretability of coordinate differences and a new method for identifying rigid-body motions and nonrigid deformations in protein conformational changes are developed and applied to functionally induced and crystallization-induced conformational changes. Valid interpretations of conformational movements in protein structures determined by X-ray crystallography require that the movement magnitudes exceed their uncertainty threshold. Here, it is shown that such thresholds can be obtained from the distance difference matrices (DDMs) of 1014 pairs of independently determined structures of bovine ribonuclease A and sperm whale myoglobin, with no explanations provided for reportedly minor coordinate differences. The smallest magnitudes of reportedly functionalmore » motions are just above these thresholds. Uncertainty thresholds can provide objective criteria that distinguish between true conformational changes and apparent ‘noise’, showing that some previous interpretations of protein coordinate changes attributed to external conditions or mutations may be doubtful or erroneous. The use of uncertainty thresholds, DDMs, the newly introduced CDDMs (contact distance difference matrices) and a novel simple rotation algorithm allows a more meaningful classification and description of protein motions, distinguishing between various rigid-fragment motions and nonrigid conformational deformations. It is also shown that half of 75 pairs of identical molecules, each from the same asymmetric crystallographic cell, exhibit coordinate differences that range from just outside the coordinate uncertainty threshold to the full magnitude of large functional movements. Thus, crystallization might often induce protein conformational changes that are comparable to those related to or induced by the protein function.« less
Agarwal, Pradeep K; Gupta, Kapil; Lopato, Sergiy; Agarwal, Parinita
2017-04-01
Dehydration responsive element binding (DREB) factors or CRT element binding factors (CBFs) are members of the AP2/ERF family, which comprises a large number of stress-responsive regulatory genes. This review traverses almost two decades of research, from the discovery of DREB/CBF factors to their optimization for application in plant biotechnology. In this review, we describe (i) the discovery, classification, structure, and evolution of DREB genes and proteins; (ii) induction of DREB genes by abiotic stresses and involvement of their products in stress responses; (iii) protein structure and DNA binding selectivity of different groups of DREB proteins; (iv) post-transcriptional and post-translational mechanisms of DREB transcription factor (TF) regulation; and (v) physical and/or functional interaction of DREB TFs with other proteins during plant stress responses. We also discuss existing issues in applications of DREB TFs for engineering of enhanced stress tolerance and improved performance under stress of transgenic crop plants. © The Author 2017. Published by Oxford University Press on behalf of the Society for Experimental Biology. All rights reserved. For permissions, please email: journals.permissions@oup.com.
EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation
Amidi, Afshine; Megalooikonomou, Vasileios; Paragios, Nikos
2018-01-01
During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet. PMID:29740518
EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation.
Amidi, Afshine; Amidi, Shervine; Vlachakis, Dimitrios; Megalooikonomou, Vasileios; Paragios, Nikos; Zacharaki, Evangelia I
2018-01-01
During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet.
Automated Gene Ontology annotation for anonymous sequence data.
Hennig, Steffen; Groth, Detlef; Lehrach, Hans
2003-07-01
Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.
Alva, Vikram; Remmert, Michael; Biegert, Andreas; Lupas, Andrei N; Söding, Johannes
2010-01-01
Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co-occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-05-01
Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-01-01
Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods. PMID:18452616
Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning
Bacik, John-Paul; Wrenbeck, Emily E.; Michalczyk, Ryszard; Whitehead, Timothy A.
2017-01-01
Proteins are marginally stable, and an understanding of the sequence determinants for improved protein solubility is highly desired. For enzymes, it is well known that many mutations that increase protein solubility decrease catalytic activity. These competing effects frustrate efforts to design and engineer stable, active enzymes without laborious high-throughput activity screens. To address the trade-off between enzyme solubility and activity, we performed deep mutational scanning using two different screens/selections that purport to gauge protein solubility for two full-length enzymes. We assayed a TEM-1 beta-lactamase variant and levoglucosan kinase (LGK) using yeast surface display (YSD) screening and a twin-arginine translocation pathway selection. We then compared these scans with published experimental fitness landscapes. Results from the YSD screen could explain 37% of the variance in the fitness landscapes for one enzyme. Five percent to 10% of all single missense mutations improve solubility, matching theoretical predictions of global protein stability. For a given solubility-enhancing mutation, the probability that it would retain wild-type fitness was correlated with evolutionary conservation and distance to active site, and anticorrelated with contact number. Hybrid classification models were developed that could predict solubility-enhancing mutations that maintain wild-type fitness with an accuracy of 90%. The downside of using such classification models is the removal of rare mutations that improve both fitness and solubility. To reveal the biophysical basis of enhanced protein solubility and function, we determined the crystallographic structure of one such LGK mutant. Beyond fundamental insights into trade-offs between stability and activity, these results have potential biotechnological applications. PMID:28196882
Pérot, Stéphanie; Regad, Leslie; Reynès, Christelle; Spérandio, Olivier; Miteva, Maria A; Villoutreix, Bruno O; Camproux, Anne-Claude
2013-01-01
Pockets are today at the cornerstones of modern drug discovery projects and at the crossroad of several research fields, from structural biology to mathematical modeling. Being able to predict if a small molecule could bind to one or more protein targets or if a protein could bind to some given ligands is very useful for drug discovery endeavors, anticipation of binding to off- and anti-targets. To date, several studies explore such questions from chemogenomic approach to reverse docking methods. Most of these studies have been performed either from the viewpoint of ligands or targets. However it seems valuable to use information from both ligands and target binding pockets. Hence, we present a multivariate approach relating ligand properties with protein pocket properties from the analysis of known ligand-protein interactions. We explored and optimized the pocket-ligand pair space by combining pocket and ligand descriptors using Principal Component Analysis and developed a classification engine on this paired space, revealing five main clusters of pocket-ligand pairs sharing specific and similar structural or physico-chemical properties. These pocket-ligand pair clusters highlight correspondences between pocket and ligand topological and physico-chemical properties and capture relevant information with respect to protein-ligand interactions. Based on these pocket-ligand correspondences, a protocol of prediction of clusters sharing similarity in terms of recognition characteristics is developed for a given pocket-ligand complex and gives high performances. It is then extended to cluster prediction for a given pocket in order to acquire knowledge about its expected ligand profile or to cluster prediction for a given ligand in order to acquire knowledge about its expected pocket profile. This prediction approach shows promising results and could contribute to predict some ligand properties critical for binding to a given pocket, and conversely, some key pocket properties for ligand binding.
Reynès, Christelle; Spérandio, Olivier; Miteva, Maria A.; Villoutreix, Bruno O.; Camproux, Anne-Claude
2013-01-01
Pockets are today at the cornerstones of modern drug discovery projects and at the crossroad of several research fields, from structural biology to mathematical modeling. Being able to predict if a small molecule could bind to one or more protein targets or if a protein could bind to some given ligands is very useful for drug discovery endeavors, anticipation of binding to off- and anti-targets. To date, several studies explore such questions from chemogenomic approach to reverse docking methods. Most of these studies have been performed either from the viewpoint of ligands or targets. However it seems valuable to use information from both ligands and target binding pockets. Hence, we present a multivariate approach relating ligand properties with protein pocket properties from the analysis of known ligand-protein interactions. We explored and optimized the pocket-ligand pair space by combining pocket and ligand descriptors using Principal Component Analysis and developed a classification engine on this paired space, revealing five main clusters of pocket-ligand pairs sharing specific and similar structural or physico-chemical properties. These pocket-ligand pair clusters highlight correspondences between pocket and ligand topological and physico-chemical properties and capture relevant information with respect to protein-ligand interactions. Based on these pocket-ligand correspondences, a protocol of prediction of clusters sharing similarity in terms of recognition characteristics is developed for a given pocket-ligand complex and gives high performances. It is then extended to cluster prediction for a given pocket in order to acquire knowledge about its expected ligand profile or to cluster prediction for a given ligand in order to acquire knowledge about its expected pocket profile. This prediction approach shows promising results and could contribute to predict some ligand properties critical for binding to a given pocket, and conversely, some key pocket properties for ligand binding. PMID:23840299
Koua, Dominique; Kuhn-Nentwig, Lucia
2017-01-01
Spider venoms are rich cocktails of bioactive peptides, proteins, and enzymes that are being intensively investigated over the years. In order to provide a better comprehension of that richness, we propose a three-level family classification system for spider venom components. This classification is supported by an exhaustive set of 219 new profile hidden Markov models (HMMs) able to attribute a given peptide to its precise peptide type, family, and group. The proposed classification has the advantages of being totally independent from variable spider taxonomic names and can easily evolve. In addition to the new classifiers, we introduce and demonstrate the efficiency of hmmcompete, a new standalone tool that monitors HMM-based family classification and, after post-processing the result, reports the best classifier when multiple models produce significant scores towards given peptide queries. The combined used of hmmcompete and the new spider venom component-specific classifiers demonstrated 96% sensitivity to properly classify all known spider toxins from the UniProtKB database. These tools are timely regarding the important classification needs caused by the increasing number of peptides and proteins generated by transcriptomic projects. PMID:28786958
Yang, Fan; Xu, Ying-Ying; Shen, Hong-Bin
2014-01-01
Human protein subcellular location prediction can provide critical knowledge for understanding a protein's function. Since significant progress has been made on digital microscopy, automated image-based protein subcellular location classification is urgently needed. In this paper, we aim to investigate more representative image features that can be effectively used for dealing with the multilabel subcellular image samples. We prepared a large multilabel immunohistochemistry (IHC) image benchmark from the Human Protein Atlas database and tested the performance of different local texture features, including completed local binary pattern, local tetra pattern, and the standard local binary pattern feature. According to our experimental results from binary relevance multilabel machine learning models, the completed local binary pattern, and local tetra pattern are more discriminative for describing IHC images when compared to the traditional local binary pattern descriptor. The combination of these two novel local pattern features and the conventional global texture features is also studied. The enhanced performance of final binary relevance classification model trained on the combined feature space demonstrates that different features are complementary to each other and thus capable of improving the accuracy of classification.
Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.; ...
2018-01-01
The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.
The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less
Isaac, Arnold Emerson; Sinha, Sitabhra
2015-10-01
The representation of proteins as networks of interacting amino acids, referred to as protein contact networks (PCN), and their subsequent analyses using graph theoretic tools, can provide novel insights into the key functional roles of specific groups of residues. We have characterized the networks corresponding to the native states of 66 proteins (belonging to different families) in terms of their core-periphery organization. The resulting hierarchical classification of the amino acid constituents of a protein arranges the residues into successive layers - having higher core order - with increasing connection density, ranging from a sparsely linked periphery to a densely intra-connected core (distinct from the earlier concept of protein core defined in terms of the three-dimensional geometry of the native state, which has least solvent accessibility). Our results show that residues in the inner cores are more conserved than those at the periphery. Underlining the functional importance of the network core, we see that the receptor sites for known ligand molecules of most proteins occur in the innermost core. Furthermore, the association of residues with structural pockets and cavities in binding or active sites increases with the core order. From mutation sensitivity analysis, we show that the probability of deleterious or intolerant mutations also increases with the core order. We also show that stabilization centre residues are in the innermost cores, suggesting that the network core is critically important in maintaining the structural stability of the protein. A publicly available Web resource for performing core-periphery analysis of any protein whose native state is known has been made available by us at http://www.imsc.res.in/ ~sitabhra/proteinKcore/index.html.
Classification and Characterization of Therapeutic Antibody Aggregates
Joubert, Marisa K.; Luo, Quanzhou; Nashed-Samuel, Yasser; Wypych, Jette; Narhi, Linda O.
2011-01-01
A host of diverse stress techniques was applied to a monoclonal antibody (IgG2) to yield protein particles with varying attributes and morphologies. Aggregated solutions were evaluated for percent aggregation, particle counts, size distribution, morphology, changes in secondary and tertiary structure, surface hydrophobicity, metal content, and reversibility. Chemical modifications were also identified in a separate report (Luo, Q., Joubert, M. K., Stevenson, R., Narhi, L. O., and Wypych, J. (2011) J. Biol. Chem. 286, 25134–25144). Aggregates were categorized into seven discrete classes, based on the traits described. Several additional molecules (from the IgG1 and IgG2 subtypes as well as intravenous IgG) were stressed and found to be defined with the same classification system. The mechanism of protein aggregation and the type of aggregate formed depends on the nature of the stress applied. Different IgG molecules appear to aggregate by a similar mechanism under the same applied stress. Aggregates created by harsh mechanical stress showed the largest number of subvisible particles, and the class generated by thermal stress displayed the largest number of visible particles. Most classes showed a disruption of the higher order structure, with the degree of disorder depending on the stress process. Particles in all classes (except thermal stress) were at least partially reversible upon dilution in pH 5 buffer. High copper content was detected in isolated metal-catalyzed aggregates, a stress previously shown to produce immunogenic aggregates. In conclusion, protein aggregates can be a very heterogeneous population, whose qualities are the result of the type of stress that was experienced. PMID:21454532
Ceramic nanocarriers: versatile nanosystem for protein and peptide delivery.
Singh, Deependra; Dubey, Pooja; Pradhan, Madhulika; Singh, Manju Rawat
2013-02-01
Proteins and peptides have been established to be the potential drug candidate for various human diseases. But, delivery of these therapeutic protein and peptides is still a challenge due to their several unfavorable properties. Nanotechnology is expanding as a promising tool for the efficient delivery of proteins and peptides. Among numerous nano-based carriers, ceramic nanoparticles have proven themselves as a unique carrier for protein and peptide delivery as they provide a more stable, bioavailable, readily manufacturable, and acceptable proteins and polypeptide formulation. This article provides an overview of the various aspects of ceramic nanoparticles including their classification, methods of preparation, latest advances, and applications as protein and peptide delivery carriers. Ceramic nanocarriers seem to have potential for preserving structural integrity of proteins and peptides, thereby promoting a better therapeutic effect. This approach thus provides pharmaceutical scientists with a new hope for the delivery of proteins and peptides. Still, considerable study on ceramic nanocarrier is necessary with respect to pharmacokinetics, toxicology, and animal studies to confirm their efficiency as well as safety and to establish their clinical usefulness and scale-up to industrial level.
Application of Wavelet Transform for PDZ Domain Classification
Daqrouq, Khaled; Alhmouz, Rami; Balamesh, Ahmed; Memic, Adnan
2015-01-01
PDZ domains have been identified as part of an array of signaling proteins that are often unrelated, except for the well-conserved structural PDZ domain they contain. These domains have been linked to many disease processes including common Avian influenza, as well as very rare conditions such as Fraser and Usher syndromes. Historically, based on the interactions and the nature of bonds they form, PDZ domains have most often been classified into one of three classes (class I, class II and others - class III), that is directly dependent on their binding partner. In this study, we report on three unique feature extraction approaches based on the bigram and trigram occurrence and existence rearrangements within the domain's primary amino acid sequences in assisting PDZ domain classification. Wavelet packet transform (WPT) and Shannon entropy denoted by wavelet entropy (WE) feature extraction methods were proposed. Using 115 unique human and mouse PDZ domains, the existence rearrangement approach yielded a high recognition rate (78.34%), which outperformed our occurrence rearrangements based method. The recognition rate was (81.41%) with validation technique. The method reported for PDZ domain classification from primary sequences proved to be an encouraging approach for obtaining consistent classification results. We anticipate that by increasing the database size, we can further improve feature extraction and correct classification. PMID:25860375
Caprari, Silvia; Metzler, Saskia; Lengauer, Thomas; Kalinina, Olga V.
2015-01-01
The origin and evolution of viruses is a subject of ongoing debate. In this study, we provide a full account of the evolutionary relationships between proteins of significant sequence and structural similarity found in viruses that belong to different classes according to the Baltimore classification. We show that such proteins can be found in viruses from all Baltimore classes. For protein families that include these proteins, we observe two patterns of the taxonomic spread. In the first pattern, they can be found in a large number of viruses from all implicated Baltimore classes. In the other pattern, the instances of the corresponding protein in species from each Baltimore class are restricted to a few compact clades. Proteins with the first pattern of distribution are products of so-called viral hallmark genes reported previously. Additionally, this pattern is displayed by the envelope glycoproteins from Flaviviridae and Bunyaviridae and helicases of superfamilies 1 and 2 that have homologs in cellular organisms. The second pattern can often be explained by horizontal gene transfer from the host or between viruses, an example being Orthomyxoviridae and Coronaviridae hemagglutinin esterases. Another facet of horizontal gene transfer comprises multiple independent introduction events of genes from cellular organisms into otherwise unrelated viruses. PMID:26492264
The SUPERFAMILY database in 2004: additions and improvements.
Madera, Martin; Vogel, Christine; Kummerfeld, Sarah K; Chothia, Cyrus; Gough, Julian
2004-01-01
The SUPERFAMILY database provides structural assignments to protein sequences and a framework for analysis of the results. At the core of the database is a library of profile Hidden Markov Models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent an entire superfamily. We have applied the library to predicted proteins from all completely sequenced genomes (currently 154), the Swiss-Prot and TrEMBL databases and other sequence collections. Close to 60% of all proteins have at least one match, and one half of all residues are covered by assignments. All models and full results are available for download and online browsing at http://supfam.org. Users can study the distribution of their superfamily of interest across all completely sequenced genomes, investigate with which other superfamilies it combines and retrieve proteins in which it occurs. Alternatively, concentrating on a particular genome as a whole, it is possible first, to find out its superfamily composition, and secondly, to compare it with that of other genomes to detect superfamilies that are over- or under-represented. In addition, the webserver provides the following standard services: sequence search; keyword search for genomes, superfamilies and sequence identifiers; and multiple alignment of genomic, PDB and custom sequences.
Zhao, Nan; Han, Jing Ginger; Shyu, Chi-Ren; Korkin, Dmitry
2014-01-01
Single nucleotide polymorphisms (SNPs) are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor). Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1) a 2-class problem (strengthening/weakening PPI mutations), (2) another 2-class problem (mutations that disrupt/preserve a PPI), and (3) a 3-class classification (detrimental/neutral/beneficial mutation effects). In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the rewiring of large-scale protein-protein interaction networks, and can be useful for functional annotation of disease-associated SNPs. SNIP-IN tool is freely accessible as a web-server at http://korkinlab.org/snpintool/. PMID:24784581
Effective Moment Feature Vectors for Protein Domain Structures
Shi, Jian-Yu; Yiu, Siu-Ming; Zhang, Yan-Ning; Chin, Francis Yuk-Lun
2013-01-01
Imaging processing techniques have been shown to be useful in studying protein domain structures. The idea is to represent the pairwise distances of any two residues of the structure in a 2D distance matrix (DM). Features and/or submatrices are extracted from this DM to represent a domain. Existing approaches, however, may involve a large number of features (100–400) or complicated mathematical operations. Finding fewer but more effective features is always desirable. In this paper, based on some key observations on DMs, we are able to decompose a DM image into four basic binary images, each representing the structural characteristics of a fundamental secondary structure element (SSE) or a motif in the domain. Using the concept of moments in image processing, we further derive 45 structural features based on the four binary images. Together with 4 features extracted from the basic images, we represent the structure of a domain using 49 features. We show that our feature vectors can represent domain structures effectively in terms of the following. (1) We show a higher accuracy for domain classification. (2) We show a clear and consistent distribution of domains using our proposed structural vector space. (3) We are able to cluster the domains according to our moment features and demonstrate a relationship between structural variation and functional diversity. PMID:24391828
33 CFR 67.01-15 - Classification of structures.
Code of Federal Regulations, 2010 CFR
2010-07-01
... 33 Navigation and Navigable Waters 1 2010-07-01 2010-07-01 false Classification of structures. 67... AIDS TO NAVIGATION AIDS TO NAVIGATION ON ARTIFICIAL ISLANDS AND FIXED STRUCTURES General Requirements § 67.01-15 Classification of structures. (a) When will structures be assigned to a Class? The District...
Cellular and molecular biology of orphan G protein-coupled receptors.
Oh, Da Young; Kim, Kyungjin; Kwon, Hyuk Bang; Seong, Jae Young
2006-01-01
The superfamily of G protein-coupled receptors (GPCRs) is the largest and most diverse group of membrane-spanning proteins. It plays a variety of roles in pathophysiological processes by transmitting extracellular signals to cells via heterotrimeric G proteins. Completion of the human genome project revealed the presence of approximately 168 genes encoding established nonsensory GPCRs, as well as 207 genes predicted to encode novel GPCRs for which the natural ligands remained to be identified, the so-called orphan GPCRs. Eighty-six of these orphans have now been paired to novel or previously known molecules, and 121 remain to be deorphaned. A better understanding of the GPCR structures and classification; knowledge of the receptor activation mechanism, either dependent on or independent of an agonist; increased understanding of the control of GPCR-mediated signal transduction; and development of appropriate ligand screening systems may improve the probability of discovering novel ligands for the remaining orphan GPCRs.
Unification of [FeFe]-hydrogenases into three structural and functional groups.
Poudel, Saroj; Tokmina-Lukaszewska, Monika; Colman, Daniel R; Refai, Mohammed; Schut, Gerrit J; King, Paul W; Maness, Pin-Ching; Adams, Michael W W; Peters, John W; Bothner, Brian; Boyd, Eric S
2016-09-01
[FeFe]-hydrogenases (Hyd) are structurally diverse enzymes that catalyze the reversible oxidation of hydrogen (H2). Recent biochemical data demonstrate new functional roles for these enzymes, including those that function in electron bifurcation where an exergonic reaction is coupled with an endergonic reaction to drive the reversible oxidation/production of H2. To identify the structural determinants that underpin differences in enzyme functionality, a total of 714 homologous sequences of the catalytic subunit, HydA, were compiled. Bioinformatics approaches informed by biochemical data were then used to characterize differences in inferred quaternary structure, HydA active site protein environment, accessory iron-sulfur clusters in HydA, and regulatory proteins encoded in HydA gene neighborhoods. HydA homologs were clustered into one of three classification groups, Group 1 (G1), Group 2 (G2), and Group 3 (G3). G1 enzymes were predicted to be monomeric while those in G2 and G3 were predicted to be multimeric and include HydB, HydC (G2/G3) and HydD (G3) subunits. Variation in the HydA active site and accessory iron-sulfur clusters did not vary by group type. Group-specific regulatory genes were identified in the gene neighborhoods of both G2 and G3 Hyd. Analyses of purified G2 and G3 enzymes by mass spectrometry strongly suggest that they are post-translationally modified by phosphorylation. These results suggest that bifurcation capability is dictated primarily by the presence of both HydB and HydC in Hyd complexes, rather than by variation in HydA. This classification scheme provides a framework for future biochemical and mutagenesis studies to elucidate the functional role of Hyd enzymes. Copyright © 2016 Elsevier B.V. All rights reserved.
A 3D sequence-independent representation of the protein data bank.
Fischer, D; Tsai, C J; Nussinov, R; Wolfson, H
1995-10-01
Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally non-redundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average approximately 2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 A, or 268 chains including lower resolution entries, NMR entries and models. The resulting set can serve as a basis for extensive structural classification and studies of 3D recurring motifs and of sequence-structure relationships. The clustering algorithm succeeds in classifying into the same structural family chains with no significant sequence homology, e.g. all the globins in one single group, all the trypsin-like serine proteases in another or all the immunoglobulin-like folds into a third. In addition, unexpected structural similarities of interest have been automatically detected between pairs of chains. A cluster analysis of the representative structures demonstrates the way the "structural universe' is populated.
Kanagawa, Motoi; Toda, Tatsushi
2017-01-01
Muscular dystrophy is a group of genetic disorders characterized by progressive muscle weakness. In the early 2000s, a new classification of muscular dystrophy, dystroglycanopathy, was established. Dystroglycanopathy often associates with abnormalities in the central nervous system. Currently, at least eighteen genes have been identified that are responsible for dystroglycanopathy, and despite its genetic heterogeneity, its common biochemical feature is abnormal glycosylation of alpha-dystroglycan. Abnormal glycosylation of alpha-dystroglycan reduces its binding activities to ligand proteins, including laminins. In just the last few years, remarkable progress has been made in determining the sugar chain structures and gene functions associated with dystroglycanopathy. The normal sugar chain contains tandem structures of ribitol-phosphate, a pentose alcohol that was previously unknown in humans. The dystroglycanopathy genes fukutin, fukutin-related protein (FKRP), and isoprenoid synthase domain-containing protein (ISPD) encode essential enzymes for the synthesis of this structure: fukutin and FKRP transfer ribitol-phosphate onto sugar chains of alpha-dystroglycan, and ISPD synthesizes CDP-ribitol, a donor substrate for fukutin and FKRP. These findings resolved long-standing questions and established a disease subgroup that is ribitol-phosphate deficient, which describes a large population of dystroglycanopathy patients. Here, we review the history of dystroglycanopathy, the properties of the sugar chain structure of alpha-dystroglycan, dystroglycanopathy gene functions, and therapeutic strategies. PMID:29081423
Efficacy of function specific 3D-motifs in enzyme classification according to their EC-numbers.
Rahimi, Amir; Madadkar-Sobhani, Armin; Touserkani, Rouzbeh; Goliaei, Bahram
2013-11-07
Due to the increasing number of protein structures with unknown function originated from structural genomics projects, protein function prediction has become an important subject in bioinformatics. Among diverse function prediction methods, exploring known 3D-motifs, which are associated with functional elements in unknown protein structures is one of the most biologically meaningful methods. Homologous enzymes inherit such motifs in their active sites from common ancestors. However, slight differences in the properties of these motifs, results in variation in the reactions and substrates of the enzymes. In this study, we examined the possibility of discriminating highly related active site patterns according to their EC-numbers by 3D-motifs. For each EC-number, the spatial arrangement of an active site, which has minimum average distance to other active sites with the same function, was selected as a representative 3D-motif. In order to characterize the motifs, various points in active site elements were tested. The results demonstrated the possibility of predicting full EC-number of enzymes by 3D-motifs. However, the discriminating power of 3D-motifs varies among different enzyme families and depends on selecting the appropriate points and features. © 2013 Elsevier Ltd. All rights reserved.
Improving protein complex classification accuracy using amino acid composition profile.
Huang, Chien-Hung; Chou, Szu-Yu; Ng, Ka-Lok
2013-09-01
Protein complex prediction approaches are based on the assumptions that complexes have dense protein-protein interactions and high functional similarity between their subunits. We investigated those assumptions by studying the subunits' interaction topology, sequence similarity and molecular function for human and yeast protein complexes. Inclusion of amino acids' physicochemical properties can provide better understanding of protein complex properties. Principal component analysis is carried out to determine the major features. Adopting amino acid composition profile information with the SVM classifier serves as an effective post-processing step for complexes classification. Improvement is based on primary sequence information only, which is easy to obtain. Copyright © 2013 Elsevier Ltd. All rights reserved.
Orientation selectivity based structure for texture classification
NASA Astrophysics Data System (ADS)
Wu, Jinjian; Lin, Weisi; Shi, Guangming; Zhang, Yazhong; Lu, Liu
2014-10-01
Local structure, e.g., local binary pattern (LBP), is widely used in texture classification. However, LBP is too sensitive to disturbance. In this paper, we introduce a novel structure for texture classification. Researches on cognitive neuroscience indicate that the primary visual cortex presents remarkable orientation selectivity for visual information extraction. Inspired by this, we investigate the orientation similarities among neighbor pixels, and propose an orientation selectivity based pattern for local structure description. Experimental results on texture classification demonstrate that the proposed structure descriptor is quite robust to disturbance.
Idkaidek, Nasir M.
2013-01-01
The aim of this commentary is to investigate the interplay of Biopharmaceutics Classification System (BCS), Biopharmaceutics Drug Disposition Classification System (BDDCS) and Salivary Excretion Classification System (SECS). BCS first classified drugs based on permeability and solubility for the purpose of predicting oral drug absorption. Then BDDCS linked permeability with hepatic metabolism and classified drugs based on metabolism and solubility for the purpose of predicting oral drug disposition. On the other hand, SECS classified drugs based on permeability and protein binding for the purpose of predicting the salivary excretion of drugs. The role of metabolism, rather than permeability, on salivary excretion is investigated and the results are not in agreement with BDDCS. Conclusion The proposed Salivary Excretion Classification System (SECS) can be used as a guide for drug salivary excretion based on permeability (not metabolism) and protein binding. PMID:24493977
Predicting residue-wise contact orders in proteins by support vector regression.
Song, Jiangning; Burrage, Kevin
2006-10-03
The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Evolving pharmacology of orphan GPCRs: IUPHAR Commentary.
Davenport, Anthony P; Harmar, Anthony J
2013-10-01
The award of the 2012 Nobel Prize in Chemistry to Robert Lefkowitz and Brian Kobilka for their work on the structure and function of GPCRs, spanning a period of more than 20 years from the cloning of the human β2 -adrenoceptor to determining the crystal structure of the same protein, has earned both researchers a much deserved place in the pantheon of major scientific discoveries. GPCRs comprise one of the largest families of proteins, controlling many major physiological processes and have been a major focus of the International Union of Basic and Clinical Pharmacology Committee on Receptor Nomenclature and Drug Classification (NC-IUPHAR) since its inception in 1987. We report here recent efforts by the British Pharmacological Society and NC-IUPHAR to define the endogenous ligands of 'orphan' GPCRs and to place authoritative and accessible information about these crucial therapeutic targets online. © 2013 The British Pharmacological Society.
Lévesque, Céline; Duplessis, Martin; Labonté, Jessica; Labrie, Steve; Fremaux, Christophe; Tremblay, Denise; Moineau, Sylvain
2005-01-01
The Streptococcus thermophilus virulent pac-type phage 2972 was isolated from a yogurt made in France in 1999. It is a representative of several phages that have emerged with the industrial use of the exopolysaccharide-producing S. thermophilus strain RD534. The genome of phage 2972 has 34,704 bp with an overall G+C content of 40.15%, making it the shortest S. thermophilus phage genome analyzed so far. Forty-four open reading frames (ORFs) encoding putative proteins of 40 or more amino acids were identified, and bioinformatic analyses led to the assignment of putative functions to 23 ORFs. Comparative genomic analysis of phage 2972 with the six other sequenced S. thermophilus phage genomes confirmed that the replication module is conserved and that cos- and pac-type phages have distinct structural and packaging genes. Two group I introns were identified in the genome of 2972. They interrupted the genes coding for the putative endolysin and the terminase large subunit. Phage mRNA splicing was demonstrated for both introns, and the secondary structures were predicted. Eight structural proteins were also identified by N-terminal sequencing and/or matrix-assisted laser desorption ionization—time-of-flight mass spectrometry. Detailed analysis of the putative minor tail proteins ORF19 and ORF21 as well as the putative receptor-binding protein ORF20 showed the following interesting features: (i) ORF19 is a hybrid protein, because it displays significant identity with both pac- and cos-type phages; (ii) ORF20 is unique; and (iii) a protein similar to ORF21 of 2972 was also found in the structure of the cos-type phage DT1, indicating that this structural protein is present in both S. thermophilus phage groups. The implications of these findings for phage classification are discussed. PMID:16000821
Systematic detection of internal symmetry in proteins using CE-Symm.
Myers-Turnbull, Douglas; Bliven, Spencer E; Rose, Peter W; Aziz, Zaid K; Youkharibache, Philippe; Bourne, Philip E; Prlić, Andreas
2014-05-29
Symmetry is an important feature of protein tertiary and quaternary structures that has been associated with protein folding, function, evolution, and stability. Its emergence and ensuing prevalence has been attributed to gene duplications, fusion events, and subsequent evolutionary drift in sequence. This process maintains structural similarity and is further supported by this study. To further investigate the question of how internal symmetry evolved, how symmetry and function are related, and the overall frequency of internal symmetry, we developed an algorithm, CE-Symm, to detect pseudo-symmetry within the tertiary structure of protein chains. Using a large manually curated benchmark of 1007 protein domains, we show that CE-Symm performs significantly better than previous approaches. We use CE-Symm to build a census of symmetry among domain superfamilies in SCOP and note that 18% of all superfamilies are pseudo-symmetric. Our results indicate that more domains are pseudo-symmetric than previously estimated. We establish a number of recurring types of symmetry-function relationships and describe several characteristic cases in detail. With the use of the Enzyme Commission classification, symmetry was found to be enriched in some enzyme classes but depleted in others. CE-Symm thus provides a methodology for a more complete and detailed study of the role of symmetry in tertiary protein structure [availability: CE-Symm can be run from the Web at http://source.rcsb.org/jfatcatserver/symmetry.jsp. Source code and software binaries are also available under the GNU Lesser General Public License (version 2.1) at https://github.com/rcsb/symmetry. An interactive census of domains identified as symmetric by CE-Symm is available from http://source.rcsb.org/jfatcatserver/scopResults.jsp]. Copyright © 2014. Published by Elsevier Ltd.
Khoury, George A; Smadbeck, James; Kieslich, Chris A; Koskosidis, Alexandra J; Guzman, Yannis A; Tamamis, Phanourios; Floudas, Christodoulos A
2017-06-01
Protein structure refinement is the challenging problem of operating on any protein structure prediction to improve its accuracy with respect to the native structure in a blind fashion. Although many approaches have been developed and tested during the last four CASP experiments, a majority of the methods continue to degrade models rather than improve them. Princeton_TIGRESS (Khoury et al., Proteins 2014;82:794-814) was developed previously and utilizes separate sampling and selection stages involving Monte Carlo and molecular dynamics simulations and classification using an SVM predictor. The initial implementation was shown to consistently refine protein structures 76% of the time in our own internal benchmarking on CASP 7-10 targets. In this work, we improved the sampling and selection stages and tested the method in blind predictions during CASP11. We added a decomposition of physics-based and hybrid energy functions, as well as a coordinate-free representation of the protein structure through distance-binning Cα-Cα distances to capture fine-grained movements. We performed parameter estimation to optimize the adjustable SVM parameters to maximize precision while balancing sensitivity and specificity across all cross-validated data sets, finding enrichment in our ability to select models from the populations of similar decoys generated for targets in CASPs 7-10. The MD stage was enhanced such that larger structures could be further refined. Among refinement methods that are currently implemented as web-servers, Princeton_TIGRESS 2.0 demonstrated the most consistent and most substantial net refinement in blind predictions during CASP11. The enhanced refinement protocol Princeton_TIGRESS 2.0 is freely available as a web server at http://atlas.engr.tamu.edu/refinement/. Proteins 2017; 85:1078-1098. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Alvarez-Cabrera, Ana L.; Delgado, Sandra; Gil-Carton, David; Mortuza, Gulnahar B.; Montoya, Guillermo; Sorzano, Carlos O. S.; Tang, Tang K.; Carazo, Jose M.
2017-01-01
Centrosomal P4.1-associated protein (CPAP) is a cell cycle regulated protein fundamental for centrosome assembly and centriole elongation. In humans, the region between residues 897–1338 of CPAP mediates interactions with other proteins and includes a homodimerization domain. CPAP mutations cause primary autosomal recessive microcephaly and Seckel syndrome. Despite of the biological/clinical relevance of CPAP, its mechanistic behavior remains unclear and its C-terminus (the G-box/TCP domain) is the only part whose structure has been solved. This situation is perhaps due in part to the challenges that represent obtaining the protein in a soluble, homogeneous state for structural studies. Our work constitutes a systematic structural analysis on multiple oligomers of HsCPAP897−1338, using single-particle electron microscopy (EM) of negatively stained (NS) samples. Based on image classification into clearly different regular 3D maps (putatively corresponding to dimers and tetramers) and direct observation of individual images representing other complexes of HsCPAP897−1338 (i.e., putative flexible monomers and higher-order multimers), we report a dynamic oligomeric behavior of this protein, where different homo-oligomers coexist in variable proportions. We propose that dimerization of the putative homodimer forms a putative tetramer which could be the structural unit for the scaffold that either tethers the pericentriolar material to centrioles or promotes procentriole elongation. A coarse fitting of atomic models into the NS 3D maps at resolutions around 20 Å is performed only to complement our experimental data, allowing us to hypothesize on the oligomeric composition of the different complexes. In this way, the current EM work represents an initial step toward the structural characterization of different oligomers of CPAP, suggesting further insights to understand how this protein works, contributing to the elucidation of control mechanisms for centriole biogenesis. PMID:28396859
Tian, Sheng; Sun, Huiyong; Pan, Peichen; Li, Dan; Zhen, Xuechu; Li, Youyong; Hou, Tingjun
2014-10-27
In this study, to accommodate receptor flexibility, based on multiple receptor conformations, a novel ensemble docking protocol was developed by using the naïve Bayesian classification technique, and it was evaluated in terms of the prediction accuracy of docking-based virtual screening (VS) of three important targets in the kinase family: ALK, CDK2, and VEGFR2. First, for each target, the representative crystal structures were selected by structural clustering, and the capability of molecular docking based on each representative structure to discriminate inhibitors from non-inhibitors was examined. Then, for each target, 50 ns molecular dynamics (MD) simulations were carried out to generate an ensemble of the conformations, and multiple representative structures/snapshots were extracted from each MD trajectory by structural clustering. On average, the representative crystal structures outperform the representative structures extracted from MD simulations in terms of the capabilities to separate inhibitors from non-inhibitors. Finally, by using the naïve Bayesian classification technique, an integrated VS strategy was developed to combine the prediction results of molecular docking based on different representative conformations chosen from crystal structures and MD trajectories. It was encouraging to observe that the integrated VS strategy yields better performance than the docking-based VS based on any single rigid conformation. This novel protocol may provide an improvement over existing strategies to search for more diverse and promising active compounds for a target of interest.
Marrero-Ponce, Yovani; Contreras-Torres, Ernesto; García-Jacas, César R; Barigye, Stephen J; Cubillán, Néstor; Alvarado, Ysaías J
2015-06-07
In the present study, we introduce novel 3D protein descriptors based on the bilinear algebraic form in the ℝ(n) space on the coulombic matrix. For the calculation of these descriptors, macromolecular vectors belonging to ℝ(n) space, whose components represent certain amino acid side-chain properties, were used as weighting schemes. Generalization approaches for the calculation of inter-amino acidic residue spatial distances based on Minkowski metrics are proposed. The simple- and double-stochastic schemes were defined as approaches to normalize the coulombic matrix. The local-fragment indices for both amino acid-types and amino acid-groups are presented in order to permit characterizing fragments of interest in proteins. On the other hand, with the objective of taking into account specific interactions among amino acids in global or local indices, geometric and topological cut-offs are defined. To assess the utility of global and local indices a classification model for the prediction of the major four protein structural classes, was built with the Linear Discriminant Analysis (LDA) technique. The developed LDA-model correctly classifies the 92.6% and 92.7% of the proteins on the training and test sets, respectively. The obtained model showed high values of the generalized square correlation coefficient (GC(2)) on both the training and test series. The statistical parameters derived from the internal and external validation procedures demonstrate the robustness, stability and the high predictive power of the proposed model. The performance of the LDA-model demonstrates the capability of the proposed indices not only to codify relevant biochemical information related to the structural classes of proteins, but also to yield suitable interpretability. It is anticipated that the current method will benefit the prediction of other protein attributes or functions. Copyright © 2015 Elsevier Ltd. All rights reserved.
Andreev, Victor P; Gillespie, Brenda W; Helfand, Brian T; Merion, Robert M
2016-01-01
Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel. PMID:27524871
Characterization of the interactions between protein and carbon black.
Chen, Tzu-Tao; Chuang, Kai-Jen; Chiang, Ling-Ling; Chen, Chun-Chao; Yeh, Chi-Tai; Wang, Liang-Shun; Gregory, Clive; Jones, Tim; BéruBé, Kelly; Lee, Chun-Nin; Chuang, Hsiao-Chi; Cheng, Tsun-Jen
2014-01-15
A considerable amount of studies have been conducted to investigate the interactions of biological fluids with nanoparticle surfaces, which exhibit a high affinity for proteins and particles. However, the mechanisms underlying these interactions have not been elucidated, particularly as they relate to human health. Using bovine serum albumin (BSA) and mice bronchoalveolar lavage fluid (BALF) as models for protein-particle conjugates, we characterized the physicochemical modifications of carbon blacks (CB) with 23nm or 65nm in diameter after protein treatment. Adsorbed BALF-containing proteins were quantified and identified by pathways, biological analyses and protein classification. Significant modifications of the physicochemistry of CB were induced by the addition of BSA. Enzyme modulators and hydrolase predominately interacted with CB, with protein-to-CB interactions that were associated with the coagulation pathways. Additionally, our results revealed that an acute-phase response could be activated by these proteins. With regard to human health, the present study revealed that the CB can react with proteins (∼55kDa and 70kDa) after inhalation and may modify the functional structures of lung proteins, leading to the activation of acute-inflammatory responses in the lungs. Copyright © 2013 Elsevier B.V. All rights reserved.
Cancer classification in the genomic era: five contemporary problems.
Song, Qingxuan; Merajver, Sofia D; Li, Jun Z
2015-10-19
Classification is an everyday instinct as well as a full-fledged scientific discipline. Throughout the history of medicine, disease classification is central to how we develop knowledge, make diagnosis, and assign treatment. Here, we discuss the classification of cancer and the process of categorizing cancer subtypes based on their observed clinical and biological features. Traditionally, cancer nomenclature is primarily based on organ location, e.g., "lung cancer" designates a tumor originating in lung structures. Within each organ-specific major type, finer subgroups can be defined based on patient age, cell type, histological grades, and sometimes molecular markers, e.g., hormonal receptor status in breast cancer or microsatellite instability in colorectal cancer. In the past 15+ years, high-throughput technologies have generated rich new data regarding somatic variations in DNA, RNA, protein, or epigenomic features for many cancers. These data, collected for increasingly large tumor cohorts, have provided not only new insights into the biological diversity of human cancers but also exciting opportunities to discover previously unrecognized cancer subtypes. Meanwhile, the unprecedented volume and complexity of these data pose significant challenges for biostatisticians, cancer biologists, and clinicians alike. Here, we review five related issues that represent contemporary problems in cancer taxonomy and interpretation. (1) How many cancer subtypes are there? (2) How can we evaluate the robustness of a new classification system? (3) How are classification systems affected by intratumor heterogeneity and tumor evolution? (4) How should we interpret cancer subtypes? (5) Can multiple classification systems co-exist? While related issues have existed for a long time, we will focus on those aspects that have been magnified by the recent influx of complex multi-omics data. Exploration of these problems is essential for data-driven refinement of cancer classification and the successful application of these concepts in precision medicine.
Jaiswal, Mamta; Dvorsky, Radovan; Ahmadian, Mohammad Reza
2013-02-08
The diffuse B-cell lymphoma (Dbl) family of the guanine nucleotide exchange factors is a direct activator of the Rho family proteins. The Rho family proteins are involved in almost every cellular process that ranges from fundamental (e.g. the establishment of cell polarity) to highly specialized processes (e.g. the contraction of vascular smooth muscle cells). Abnormal activation of the Rho proteins is known to play a crucial role in cancer, infectious and cognitive disorders, and cardiovascular diseases. However, the existence of 74 Dbl proteins and 25 Rho-related proteins in humans, which are largely uncharacterized, has led to increasing complexity in identifying specific upstream pathways. Thus, we comprehensively investigated sequence-structure-function-property relationships of 21 representatives of the Dbl protein family regarding their specificities and activities toward 12 Rho family proteins. The meta-analysis approach provides an unprecedented opportunity to broadly profile functional properties of Dbl family proteins, including catalytic efficiency, substrate selectivity, and signaling specificity. Our analysis has provided novel insights into the following: (i) understanding of the relative differences of various Rho protein members in nucleotide exchange; (ii) comparing and defining individual and overall guanine nucleotide exchange factor activities of a large representative set of the Dbl proteins toward 12 Rho proteins; (iii) grouping the Dbl family into functionally distinct categories based on both their catalytic efficiencies and their sequence-structural relationships; (iv) identifying conserved amino acids as fingerprints of the Dbl and Rho protein interaction; and (v) defining amino acid sequences conserved within, but not between, Dbl subfamilies. Therefore, the characteristics of such specificity-determining residues identified the regions or clusters conserved within the Dbl subfamilies.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yuan, Ping; Swanson, Kurt A.; Leser, George P.
2014-10-02
The paramyxovirus hemagglutinin-neuraminidase (HN) protein plays multiple roles in viral entry and egress, including binding to sialic acid receptors, activating the fusion (F) protein to activate membrane fusion and viral entry, and cleaving sialic acid from carbohydrate chains. HN is an oligomeric integral membrane protein consisting of an N-terminal transmembrane domain, a stalk region, and an enzymatically active neuraminidase (NA) domain. Structures of the HN NA domains have been solved previously; however, the structure of the stalk region has remained elusive. The stalk region contains specificity determinants for F interactions and activation, underlying the requirement for homotypic F and HNmore » interactions in viral entry. Mutations of the Newcastle disease virus HN stalk region have been shown to affect both F activation and NA activities, but a structural basis for understanding these dual affects on HN functions has been lacking. Here, we report the structure of the Newcastle disease virus HN ectodomain, revealing dimers of NA domain dimers flanking the N-terminal stalk domain. The stalk forms a parallel tetrameric coiled-coil bundle (4HB) that allows classification of extensive mutational data, providing insight into the functional roles of the stalk region. Mutations that affect both F activation and NA activities map predominantly to the 4HB hydrophobic core, whereas mutations that affect only F-protein activation map primarily to the 4HB surface. Two of four NA domains interact with the 4HB stalk, and residues at this interface in both the stalk and NA domain have been implicated in HN function.« less
The Classification and Evolution of Enzyme Function
Martínez Cuesta, Sergio; Rahman, Syed Asad; Furnham, Nicholas; Thornton, Janet M.
2015-01-01
Enzymes are the proteins responsible for the catalysis of life. Enzymes sharing a common ancestor as defined by sequence and structure similarity are grouped into families and superfamilies. The molecular function of enzymes is defined as their ability to catalyze biochemical reactions; it is manually classified by the Enzyme Commission and robust approaches to quantitatively compare catalytic reactions are just beginning to appear. Here, we present an overview of studies at the interface of the evolution and function of enzymes. PMID:25986631
2011-01-01
SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as Report (SAR) 18 . NUMBER OF PAGES 9 19a. NAME OF RESPONSIBLE PERSON a. REPORT...unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39- 18 sampling is based on...atom distance-scaled ideal-gas reference state (DFIRE-AA) statistical potential func- tion.[ 18 ] The third approach is the Rosetta all-atom energy func
Challenging Residual Contamination of Instruments for Robotic Surgery in Japan.
Saito, Yuhei; Yasuhara, Hiroshi; Murakoshi, Satoshi; Komatsu, Takami; Fukatsu, Kazuhiko; Uetera, Yushi
2017-02-01
BACKGROUND Recently, robotic surgery has been introduced in many hospitals. The structure of robotic instruments is so complex that updating their cleaning methods is a challenge for healthcare professionals. However, there is limited information on the effectiveness of cleaning for instruments for robotic surgery. OBJECTIVE To determine the level of residual contamination of instruments for robotic surgery and to develop a method to evaluate the cleaning efficacy for complex surgical devices. METHODS Surgical instruments were collected immediately after operations and/or after in-house cleaning, and the level of residual protein was measured. Three serial measurements were performed on instruments after cleaning to determine the changes in the level of contamination and the total amount of residual protein. The study took place from September 1, 2013, through June 30, 2015, in Japan. RESULTS The amount of protein released from robotic instruments declined exponentially. The amount after in-house cleaning was 650, 550, and 530 µg/instrument in the 3 serial measurements. The overall level of residual protein in each measurement was much higher for robotic instruments than for ordinary instruments (P<.0001). CONCLUSIONS Our data demonstrated that complete removal of residual protein from surgical instruments is virtually impossible. The pattern of decline differed depending on the instrument type, which reflected the complex structure of the instruments. It might be necessary to establish a new standard for cleaning using a novel classification according to the structural complexity of instruments, especially for those for robotic surgery. Infect Control Hosp Epidemiol 2017;38:143-146.
Hayat, Maqsood; Khan, Asifullah
2013-05-01
Membrane protein is the prime constituent of a cell, which performs a role of mediator between intra and extracellular processes. The prediction of transmembrane (TM) helix and its topology provides essential information regarding the function and structure of membrane proteins. However, prediction of TM helix and its topology is a challenging issue in bioinformatics and computational biology due to experimental complexities and lack of its established structures. Therefore, the location and orientation of TM helix segments are predicted from topogenic sequences. In this regard, we propose WRF-TMH model for effectively predicting TM helix segments. In this model, information is extracted from membrane protein sequences using compositional index and physicochemical properties. The redundant and irrelevant features are eliminated through singular value decomposition. The selected features provided by these feature extraction strategies are then fused to develop a hybrid model. Weighted random forest is adopted as a classification approach. We have used two benchmark datasets including low and high-resolution datasets. tenfold cross validation is employed to assess the performance of WRF-TMH model at different levels including per protein, per segment, and per residue. The success rates of WRF-TMH model are quite promising and are the best reported so far on the same datasets. It is observed that WRF-TMH model might play a substantial role, and will provide essential information for further structural and functional studies on membrane proteins. The accompanied web predictor is accessible at http://111.68.99.218/WRF-TMH/ .
Schwartz, N B; Pirok, E W; Mensch, J R; Domowicz, M S
1999-01-01
Proteoglycans are complex macromolecules, consisting of a polypeptide backbone to which are covalently attached one or more glycosaminoglycan chains. Molecular cloning has allowed identification of the genes encoding the core proteins of various proteoglycans, leading to a better understanding of the diversity of proteoglycan structure and function, as well as to the evolution of a classification of proteoglycans on the basis of emerging gene families that encode the different core proteins. One such family includes several proteoglycans that have been grouped with aggrecan, the large aggregating chondroitin sulfate proteoglycan of cartilage, based on a high number of sequence similarities within the N- and C-terminal domains. Thus far these proteoglycans include versican, neurocan, and brevican. It is now apparent that these proteins, as a group, are truly a gene family with shared structural motifs on the protein and nucleotide (mRNA) levels, and with nearly identical genomic organizations. Clearly a common ancestral origin is indicated for the members of the aggrecan family of proteoglycans. However, differing patterns of amplification and divergence have also occurred within certain exons across species and family members, leading to the class-characteristic protein motifs in the central carbohydrate-rich region exclusively. Thus the overall domain organization strongly suggests that sequence conservation in the terminal globular domains underlies common functions, whereas differences in the central portions of the genes account for functional specialization among the members of this gene family.
Strength Analysis on Ship Ladder Using Finite Element Method
NASA Astrophysics Data System (ADS)
Budianto; Wahyudi, M. T.; Dinata, U.; Ruddianto; Eko P., M. M.
2018-01-01
In designing the ship’s structure, it should refer to the rules in accordance with applicable classification standards. In this case, designing Ladder (Staircase) on a Ferry Ship which is set up, it must be reviewed based on the loads during ship operations, either during sailing or at port operations. The classification rules in ship design refer to the calculation of the structure components described in Classification calculation method and can be analysed using the Finite Element Method. Classification Regulations used in the design of Ferry Ships used BKI (Bureau of Classification Indonesia). So the rules for the provision of material composition in the mechanical properties of the material should refer to the classification of the used vessel. The analysis in this structure used program structure packages based on Finite Element Method. By using structural analysis on Ladder (Ladder), it obtained strength and simulation structure that can withstand load 140 kg both in static condition, dynamic, and impact. Therefore, the result of the analysis included values of safety factors in the ship is to keep the structure safe but the strength of the structure is not excessive.
Freiburg RNA tools: a central online resource for RNA-focused research and teaching.
Raden, Martin; Ali, Syed M; Alkhnbashi, Omer S; Busch, Anke; Costa, Fabrizio; Davis, Jason A; Eggenhofer, Florian; Gelhausen, Rick; Georg, Jens; Heyne, Steffen; Hiller, Michael; Kundu, Kousik; Kleinkauf, Robert; Lott, Steffen C; Mohamed, Mostafa M; Mattheis, Alexander; Miladi, Milad; Richter, Andreas S; Will, Sebastian; Wolff, Joachim; Wright, Patrick R; Backofen, Rolf
2018-05-21
The Freiburg RNA tools webserver is a well established online resource for RNA-focused research. It provides a unified user interface and comprehensive result visualization for efficient command line tools. The webserver includes RNA-RNA interaction prediction (IntaRNA, CopraRNA, metaMIR), sRNA homology search (GLASSgo), sequence-structure alignments (LocARNA, MARNA, CARNA, ExpaRNA), CRISPR repeat classification (CRISPRmap), sequence design (antaRNA, INFO-RNA, SECISDesign), structure aberration evaluation of point mutations (RaSE), and RNA/protein-family models visualization (CMV), and other methods. Open education resources offer interactive visualizations of RNA structure and RNA-RNA interaction prediction as well as basic and advanced sequence alignment algorithms. The services are freely available at http://rna.informatik.uni-freiburg.de.
Choudhary, Kumari S.; Mih, Nathan; Monk, Jonathan; Kavvas, Erol; Yurkovich, James T.; Sakoulas, George; Palsson, Bernhard O.
2018-01-01
Two-component systems (TCSs) consist of a histidine kinase and a response regulator. Here, we evaluated the conservation of the AgrAC TCS among 149 completely sequenced Staphylococcus aureus strains. It is composed of four genes: agrBDCA. We found that: (i) AgrAC system (agr) was found in all but one of the 149 strains, (ii) the agr positive strains were further classified into four agr types based on AgrD protein sequences, (iii) the four agr types not only specified the chromosomal arrangement of the agr genes but also the sequence divergence of AgrC histidine kinase protein, which confers signal specificity, (iv) the sequence divergence was reflected in distinct structural properties especially in the transmembrane region and second extracellular binding domain, and (v) there was a strong correlation between the agr type and the virulence genomic profile of the organism. Taken together, these results demonstrate that bioinformatic analysis of the agr locus leads to a classification system that correlates with the presence of virulence factors and protein structural properties. PMID:29887846
A three-way approach for protein function classification
2017-01-01
The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy. PMID:28234929
A three-way approach for protein function classification.
Ur Rehman, Hafeez; Azam, Nouman; Yao, JingTao; Benso, Alfredo
2017-01-01
The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy.
Wang, Jun; Chen, Wen Feng; Li, Qing X
2012-02-24
The need of quick diagnostics and increasing number of bacterial species isolated necessitate development of a rapid and effective phenotypic identification method. Mass spectrometry (MS) profiling of whole cell proteins has potential to satisfy the requirements. The genus Mycobacterium contains more than 154 species that are taxonomically very close and require use of multiple genes including 16S rDNA for phylogenetic identification and classification. Six strains of five Mycobacterium species were selected as model bacteria in the present study because of their 16S rDNA similarity (98.4-99.8%) and the high similarity of the concatenated 16S rDNA, rpoB and hsp65 gene sequences (95.9-99.9%), requiring high identification resolution. The classification of the six strains by MALDI TOF MS protein barcodes was consistent with, but at much higher resolution than, that of the multi-locus sequence analysis of using 16S rDNA, rpoB and hsp65. The species were well differentiated using MALDI TOF MS and MALDI BioTyper™ software after quick preparation of whole-cell proteins. Several proteins were selected as diagnostic markers for species confirmation. An integration of MALDI TOF MS, MALDI BioTyper™ software and diagnostic protein fragments provides a robust phenotypic approach for bacterial identification and classification. Copyright © 2011 Elsevier B.V. All rights reserved.
JNK Signaling: Regulation and Functions Based on Complex Protein-Protein Partnerships
Zeke, András; Misheva, Mariya
2016-01-01
SUMMARY The c-Jun N-terminal kinases (JNKs), as members of the mitogen-activated protein kinase (MAPK) family, mediate eukaryotic cell responses to a wide range of abiotic and biotic stress insults. JNKs also regulate important physiological processes, including neuronal functions, immunological actions, and embryonic development, via their impact on gene expression, cytoskeletal protein dynamics, and cell death/survival pathways. Although the JNK pathway has been under study for >20 years, its complexity is still perplexing, with multiple protein partners of JNKs underlying the diversity of actions. Here we review the current knowledge of JNK structure and isoforms as well as the partnerships of JNKs with a range of intracellular proteins. Many of these proteins are direct substrates of the JNKs. We analyzed almost 100 of these target proteins in detail within a framework of their classification based on their regulation by JNKs. Examples of these JNK substrates include a diverse assortment of nuclear transcription factors (Jun, ATF2, Myc, Elk1), cytoplasmic proteins involved in cytoskeleton regulation (DCX, Tau, WDR62) or vesicular transport (JIP1, JIP3), cell membrane receptors (BMPR2), and mitochondrial proteins (Mcl1, Bim). In addition, because upstream signaling components impact JNK activity, we critically assessed the involvement of signaling scaffolds and the roles of feedback mechanisms in the JNK pathway. Despite a clarification of many regulatory events in JNK-dependent signaling during the past decade, many other structural and mechanistic insights are just beginning to be revealed. These advances open new opportunities to understand the role of JNK signaling in diverse physiological and pathophysiological states. PMID:27466283
Pan, Yuliang; Wang, Zixiang; Zhan, Weihua; Deng, Lei
2018-05-01
Identifying RNA-binding residues, especially energetically favored hot spots, can provide valuable clues for understanding the mechanisms and functional importance of protein-RNA interactions. Yet, limited availability of experimentally recognized energy hot spots in protein-RNA crystal structures leads to the difficulties in developing empirical identification approaches. Computational prediction of RNA-binding hot spot residues is still in its infant stage. Here, we describe a computational method, PrabHot (Prediction of protein-RNA binding hot spots), that can effectively detect hot spot residues on protein-RNA binding interfaces using an ensemble of conceptually different machine learning classifiers. Residue interaction network features and new solvent exposure characteristics are combined together and selected for classification with the Boruta algorithm. In particular, two new reference datasets (benchmark and independent) have been generated containing 107 hot spots from 47 known protein-RNA complex structures. In 10-fold cross-validation on the training dataset, PrabHot achieves promising performances with an AUC score of 0.86 and a sensitivity of 0.78, which are significantly better than that of the pioneer RNA-binding hot spot prediction method HotSPRing. We also demonstrate the capability of our proposed method on the independent test dataset and gain a competitive advantage as a result. The PrabHot webserver is freely available at http://denglab.org/PrabHot/. leideng@csu.edu.cn. Supplementary data are available at Bioinformatics online.
Johanson, Urban; Karlsson, Maria; Johansson, Ingela; Gustavsson, Sofia; Sjövall, Sara; Fraysse, Laure; Weig, Alfons R.; Kjellbom, Per
2001-01-01
Major intrinsic proteins (MIPs) facilitate the passive transport of small polar molecules across membranes. MIPs constitute a very old family of proteins and different forms have been found in all kinds of living organisms, including bacteria, fungi, animals, and plants. In the genomic sequence of Arabidopsis, we have identified 35 different MIP-encoding genes. Based on sequence similarity, these 35 proteins are divided into four different subfamilies: plasma membrane intrinsic proteins, tonoplast intrinsic proteins, NOD26-like intrinsic proteins also called NOD26-like MIPs, and the recently discovered small basic intrinsic proteins. In Arabidopsis, there are 13 plasma membrane intrinsic proteins, 10 tonoplast intrinsic proteins, nine NOD26-like intrinsic proteins, and three small basic intrinsic proteins. The gene structure in general is conserved within each subfamily, although there is a tendency to lose introns. Based on phylogenetic comparisons of maize (Zea mays) and Arabidopsis MIPs (AtMIPs), it is argued that the general intron patterns in the subfamilies were formed before the split of monocotyledons and dicotyledons. Although the gene structure is unique for each subfamily, there is a common pattern in how transmembrane helices are encoded on the exons in three of the subfamilies. The nomenclature for plant MIPs varies widely between different species but also between subfamilies in the same species. Based on the phylogeny of all AtMIPs, a new and more consistent nomenclature is proposed. The complete set of AtMIPs, together with the new nomenclature, will facilitate the isolation, classification, and labeling of plant MIPs from other species. PMID:11500536
Integration of QUARK and I-TASSER for ab initio protein structure prediction in CASP11
Zhang, Wenxuan; Yang, Jianyi; He, Baoji; Walker, Sara Elizabeth; Zhang, Hongjiu; Govindarajoo, Brandon; Virtanen, Jouko; Xue, Zhidong; Shen, Hong-Bin; Zhang, Yang
2015-01-01
We tested two pipelines developed for template-free protein structure prediction in the CASP11 experiment. First, the QUARK pipeline constructs structure models by reassembling fragments of continuously distributed lengths excised from unrelated proteins. Five free-modeling (FM) targets have the model successfully constructed by QUARK with a TM-score above 0.4, including the first model of T0837-D1, which has a TM-score=0.736 and RMSD=2.9 Å to the native. Detailed analysis showed that the success is partly attributed to the high-resolution contact map prediction derived from fragment-based distance-profiles, which are mainly located between regular secondary structure elements and loops/turns and help guide the orientation of secondary structure assembly. In the Zhang-Server pipeline, weakly scoring threading templates are re-ordered by the structural similarity to the ab initio folding models, which are then reassembled by I-TASSER based structure assembly simulations; 60% more domains with length up to 204 residues, compared to the QUARK pipeline, were successfully modeled by the I-TASSER pipeline with a TM-score above 0.4. The robustness of the I-TASSER pipeline can stem from the composite fragment-assembly simulations that combine structures from both ab initio folding and threading template refinements. Despite the promising cases, challenges still exist in long-range beta-strand folding, domain parsing, and the uncertainty of secondary structure prediction; the latter of which was found to affect nearly all aspects of FM structure predictions, from fragment identification, target classification, structure assembly, to final model selection. Significant efforts are needed to solve these problems before real progress on FM could be made. PMID:26370505
Data Mining Algorithms for Classification of Complex Biomedical Data
ERIC Educational Resources Information Center
Lan, Liang
2012-01-01
In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray…
Wan, Shixiang; Duan, Yucong; Zou, Quan
2017-09-01
Predicting the subcellular localization of proteins is an important and challenging problem. Traditional experimental approaches are often expensive and time-consuming. Consequently, a growing number of research efforts employ a series of machine learning approaches to predict the subcellular location of proteins. There are two main challenges among the state-of-the-art prediction methods. First, most of the existing techniques are designed to deal with multi-class rather than multi-label classification, which ignores connections between multiple labels. In reality, multiple locations of particular proteins imply that there are vital and unique biological significances that deserve special focus and cannot be ignored. Second, techniques for handling imbalanced data in multi-label classification problems are necessary, but never employed. For solving these two issues, we have developed an ensemble multi-label classifier called HPSLPred, which can be applied for multi-label classification with an imbalanced protein source. For convenience, a user-friendly webserver has been established at http://server.malab.cn/HPSLPred. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Mining for class-specific motifs in protein sequence classification
2013-01-01
Background In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class. Results We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks. Conclusion The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms. PMID:23496846
ZifBASE: a database of zinc finger proteins and associated resources.
Jayakanthan, Mannu; Muthukumaran, Jayaraman; Chandrasekar, Sanniyasi; Chawla, Konika; Punetha, Ankita; Sundar, Durai
2009-09-09
Information on the occurrence of zinc finger protein motifs in genomes is crucial to the developing field of molecular genome engineering. The knowledge of their target DNA-binding sequences is vital to develop chimeric proteins for targeted genome engineering and site-specific gene correction. There is a need to develop a computational resource of zinc finger proteins (ZFP) to identify the potential binding sites and its location, which reduce the time of in vivo task, and overcome the difficulties in selecting the specific type of zinc finger protein and the target site in the DNA sequence. ZifBASE provides an extensive collection of various natural and engineered ZFP. It uses standard names and a genetic and structural classification scheme to present data retrieved from UniProtKB, GenBank, Protein Data Bank, ModBase, Protein Model Portal and the literature. It also incorporates specialized features of ZFP including finger sequences and positions, number of fingers, physiochemical properties, classes, framework, PubMed citations with links to experimental structures (PDB, if available) and modeled structures of natural zinc finger proteins. ZifBASE provides information on zinc finger proteins (both natural and engineered ones), the number of finger units in each of the zinc finger proteins (with multiple fingers), the synergy between the adjacent fingers and their positions. Additionally, it gives the individual finger sequence and their target DNA site to which it binds for better and clear understanding on the interactions of adjacent fingers. The current version of ZifBASE contains 139 entries of which 89 are engineered ZFPs, containing 3-7F totaling to 296 fingers. There are 50 natural zinc finger protein entries ranging from 2-13F, totaling to 307 fingers. It has sequences and structures from literature, Protein Data Bank, ModBase and Protein Model Portal. The interface is cross linked to other public databases like UniprotKB, PDB, ModBase and Protein Model Portal and PubMed for making it more informative. A database is established to maintain the information of the sequence features, including the class, framework, number of fingers, residues, position, recognition site and physio-chemical properties (molecular weight, isoelectric point) of both natural and engineered zinc finger proteins and dissociation constant of few. ZifBASE can provide more effective and efficient way of accessing the zinc finger protein sequences and their target binding sites with the links to their three-dimensional structures. All the data and functions are available at the advanced web-based search interface http://web.iitd.ac.in/~sundar/zifbase.
Prediction of change in protein unfolding rates upon point mutations in two state proteins.
Chaudhary, Priyashree; Naganathan, Athi N; Gromiha, M Michael
2016-09-01
Studies on protein unfolding rates are limited and challenging due to the complexity of unfolding mechanism and the larger dynamic range of the experimental data. Though attempts have been made to predict unfolding rates using protein sequence-structure information there is no available method for predicting the unfolding rates of proteins upon specific point mutations. In this work, we have systematically analyzed a set of 790 single mutants and developed a robust method for predicting protein unfolding rates upon mutations (Δlnku) in two-state proteins by combining amino acid properties and knowledge-based classification of mutants with multiple linear regression technique. We obtain a mean absolute error (MAE) of 0.79/s and a Pearson correlation coefficient (PCC) of 0.71 between predicted unfolding rates and experimental observations using jack-knife test. We have developed a web server for predicting protein unfolding rates upon mutation and it is freely available at https://www.iitm.ac.in/bioinfo/proteinunfolding/unfoldingrace.html. Prominent features that determine unfolding kinetics as well as plausible reasons for the observed outliers are also discussed. Copyright © 2016 Elsevier B.V. All rights reserved.
Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier
2003-01-01
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.
ERIC Educational Resources Information Center
Ohio State Univ., Columbus. National Center for Research in Vocational Education.
"Classification Structures for Career Information" was created to provide Career Information Delivery Systems (CIDS) staff with pertinent and useful occupational information arranged according to the Standard Occupational Classification (SOC) structure. Through this publication, the National Occupational Information Coordinating…
ERIC Educational Resources Information Center
Ohio State Univ., Columbus. National Center for Research in Vocational Education.
"Classification Structures for Career Information" was created to provide Career Information Delivery Systems (CIDS) staff with pertinent and useful occupational information arranged according to the Standard Occupational Classification (SOC) structure. Through this publication, the National Occupational Information Coordinating…
ERIC Educational Resources Information Center
Ohio State Univ., Columbus. National Center for Research in Vocational Education.
"Classification Structures for Career Information" was created to provide Career Information Delivery Systems (CIDS) staff with pertinent and useful occupational information arranged according to the Standard Occupational Classification (SOC) structure. Through this publication, the National Occupational Information Coordinating…
[Entification of the Rubella virus genotype 1H in Western Siberia].
Seregin, S V; Babkin, I V; Petrova, I D; Iashina, L N; Malkova, E M; Petrov, V S
2011-01-01
Molecular epidemiological study of novel strain of Rubella virus isolated during the outbreak in Western Siberia in 2004 was described. Detailed phylogenetic analysis performed based upon entire SP-region, which encodes all three Rubella structural proteins (C, E2, and E1), was implemented. This analysis provides characterization of this strain and classifies it as 1H genotype, thereby correcting previous classification of this strain based upon shorter nucleotide sequence, only encoding E1 protein. Therefore, this study identified the genotype of the Rubella virus not previously detected in Western Siberia (and even entire Russian Federation), which highlights the importance of more extensive characterization of genetic variability of the Rubella virus, especially with regard to potential influence of vaccination on the Rubella virus mutagenesis.
PDB-Explorer: a web-based interactive map of the protein data bank in shape space.
Jin, Xian; Awale, Mahendra; Zasso, Michaël; Kostro, Daniel; Patiny, Luc; Reymond, Jean-Louis
2015-10-23
The RCSB Protein Data Bank (PDB) provides public access to experimentally determined 3D-structures of biological macromolecules (proteins, peptides and nucleic acids). While various tools are available to explore the PDB, options to access the global structural diversity of the entire PDB and to perceive relationships between PDB structures remain very limited. A 136-dimensional atom pair 3D-fingerprint for proteins (3DP) counting categorized atom pairs at increasing through-space distances was designed to represent the molecular shape of PDB-entries. Nearest neighbor searches examples were reported exemplifying the ability of 3DP-similarity to identify closely related biomolecules from small peptides to enzyme and large multiprotein complexes such as virus particles. The principle component analysis was used to obtain the visualization of PDB in 3DP-space. The 3DP property space groups proteins and protein assemblies according to their 3D-shape similarity, yet shows exquisite ability to distinguish between closely related structures. An interactive website called PDB-Explorer is presented featuring a color-coded interactive map of PDB in 3DP-space. Each pixel of the map contains one or more PDB-entries which are directly visualized as ribbon diagrams when the pixel is selected. The PDB-Explorer website allows performing 3DP-nearest neighbor searches of any PDB-entry or of any structure uploaded as protein-type PDB file. All functionalities on the website are implemented in JavaScript in a platform-independent manner and draw data from a server that is updated daily with the latest PDB additions, ensuring complete and up-to-date coverage. The essentially instantaneous 3DP-similarity search with the PDB-Explorer provides results comparable to those of much slower 3D-alignment algorithms, and automatically clusters proteins from the same superfamilies in tight groups. A chemical space classification of PDB based on molecular shape was obtained using a new atom-pair 3D-fingerprint for proteins and implemented in a web-based database exploration tool comprising an interactive color-coded map of the PDB chemical space and a nearest neighbor search tool. The PDB-Explorer website is freely available at www.cheminfo.org/pdbexplorer and represents an unprecedented opportunity to interactively visualize and explore the structural diversity of the PDB. ᅟ
HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features.
Zaman, Rianon; Chowdhury, Shahana Yasmin; Rashid, Mahmood A; Sharma, Alok; Dehzangi, Abdollah; Shatabda, Swakkhar
2017-01-01
DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.
Verkhivker, Gennady M
2016-01-01
The human protein kinome presents one of the largest protein families that orchestrate functional processes in complex cellular networks, and when perturbed, can cause various cancers. The abundance and diversity of genetic, structural, and biochemical data underlies the complexity of mechanisms by which targeted and personalized drugs can combat mutational profiles in protein kinases. Coupled with the evolution of system biology approaches, genomic and proteomic technologies are rapidly identifying and charactering novel resistance mechanisms with the goal to inform rationale design of personalized kinase drugs. Integration of experimental and computational approaches can help to bring these data into a unified conceptual framework and develop robust models for predicting the clinical drug resistance. In the current study, we employ a battery of synergistic computational approaches that integrate genetic, evolutionary, biochemical, and structural data to characterize the effect of cancer mutations in protein kinases. We provide a detailed structural classification and analysis of genetic signatures associated with oncogenic mutations. By integrating genetic and structural data, we employ network modeling to dissect mechanisms of kinase drug sensitivities to oncogenic EGFR mutations. Using biophysical simulations and analysis of protein structure networks, we show that conformational-specific drug binding of Lapatinib may elicit resistant mutations in the EGFR kinase that are linked with the ligand-mediated changes in the residue interaction networks and global network properties of key residues that are responsible for structural stability of specific functional states. A strong network dependency on high centrality residues in the conformation-specific Lapatinib-EGFR complex may explain vulnerability of drug binding to a broad spectrum of mutations and the emergence of drug resistance. Our study offers a systems-based perspective on drug design by unravelling complex relationships between robustness of targeted kinase genes and binding specificity of targeted kinase drugs. We discuss how these approaches can exploit advances in chemical biology and network science to develop novel strategies for rationally tailored and robust personalized drug therapies.
Lauber, Chris; Gorbalenya, Alexander E
2012-04-01
Virus taxonomy has received little attention from the research community despite its broad relevance. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3890-3904, 2012), we have introduced a quantitative approach to hierarchically classify viruses of a family using pairwise evolutionary distances (PEDs) as a measure of genetic divergence. When applied to the six most conserved proteins of the Picornaviridae, it clustered 1,234 genome sequences in groups at three hierarchical levels (to which we refer as the "GENETIC classification"). In this study, we compare the GENETIC classification with the expert-based picornavirus taxonomy and outline differences in the underlying frameworks regarding the relation of virus groups and genetic diversity that represent, respectively, the structure and content of a classification. To facilitate the analysis, we introduce two novel diagrams. The first connects the genetic diversity of taxa to both the PED distribution and the phylogeny of picornaviruses. The second depicts a classification and the accommodated genetic diversity in a standardized manner. Generally, we found striking agreement between the two classifications on species and genus taxa. A few disagreements concern the species Human rhinovirus A and Human rhinovirus C and the genus Aphthovirus, which were split in the GENETIC classification. Furthermore, we propose a new supergenus level and universal, level-specific PED thresholds, not reached yet by many taxa. Since the species threshold is approached mostly by taxa with large sampling sizes and those infecting multiple hosts, it may represent an upper limit on divergence, beyond which homologous recombination in the six most conserved genes between two picornaviruses might not give viable progeny.
Deiana, Antonio; Giansanti, Andrea
2010-04-21
Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding. In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins. Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated.
2010-01-01
Background Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding. Results In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins. Conclusions Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated. PMID:20409339
A PDB-wide, evolution-based assessment of protein-protein interfaces.
Baskaran, Kumaran; Duarte, Jose M; Biyani, Nikhil; Bliven, Spencer; Capitani, Guido
2014-10-18
Thanks to the growth in sequence and structure databases, more than 50 million sequences are now available in UniProt and 100,000 structures in the PDB. Rich information about protein-protein interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features. An automated computational pipeline was developed to run our Evolutionary Protein-Protein Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database, currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing about 3000 entries, were automatically generated based on criteria thought to be strong indicators of interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts. BioMany and XtalMany were used to benchmark the EPPIC approach. The performance of EPPIC was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA) program on a PDB-wide scale, finding that the two approaches give the same call in about 88% of PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any PDB entry. Both the datasets and the PyMOL plugin are available at http://www.eppic-web.org/ewui/\\#downloads. Our computational pipeline allows us to analyze protein-protein contacts and their sequence conservation across the entire PDB. Two new benchmark datasets are provided, which are over an order of magnitude larger than existing manually curated ones. These tools enable the comprehensive study of several aspects of protein-protein contacts in the PDB and represent a basis for future, even larger scale studies of protein-protein interactions.
[Landscape classification: research progress and development trend].
Liang, Fa-Chao; Liu, Li-Ming
2011-06-01
Landscape classification is the basis of the researches on landscape structure, process, and function, and also, the prerequisite for landscape evaluation, planning, protection, and management, directly affecting the precision and practicability of landscape research. This paper reviewed the research progress on the landscape classification system, theory, and methodology, and summarized the key problems and deficiencies of current researches. Some major landscape classification systems, e. g. , LANMAP and MUFIC, were introduced and discussed. It was suggested that a qualitative and quantitative comprehensive classification based on the ideology of functional structure shape and on the integral consideration of landscape classification utility, landscape function, landscape structure, physiogeographical factors, and human disturbance intensity should be the major research directions in the future. The integration of mapping, 3S technology, quantitative mathematics modeling, computer artificial intelligence, and professional knowledge to enhance the precision of landscape classification would be the key issues and the development trend in the researches of landscape classification.
Cysteine-rich domains related to Frizzled receptors and Hedgehog-interacting proteins
Pei, Jimin; Grishin, Nick V
2012-01-01
Frizzled and Smoothened are homologous seven-transmembrane proteins functioning in the Wnt and Hedgehog signaling pathways, respectively. They harbor an extracellular cysteine-rich domain (FZ-CRD), a mobile evolutionary unit that has been found in a number of other metazoan proteins and Frizzled-like proteins in Dictyostelium. Domains distantly related to FZ-CRDs, in Hedgehog-interacting proteins (HHIPs), folate receptors and riboflavin-binding proteins (FRBPs), and Niemann-Pick Type C1 proteins (NPC1s), referred to as HFN-CRDs, exhibit similar structures and disulfide connectivity patterns compared with FZ-CRDs. We used computational analyses to expand the homologous set of FZ-CRDs and HFN-CRDs, providing a better understanding of their evolution and classification. First, FZ-CRD-containing proteins with various domain compositions were identified in several major eukaryotic lineages including plants and Chromalveolata, revealing a wider phylogenetic distribution of FZ-CRDs than previously recognized. Second, two new and distinct groups of highly divergent FZ-CRDs were found by sensitive similarity searches. One of them is present in the calcium channel component Mid1 in fungi and the uncharacterized FAM155 proteins in metazoans. Members of the other new FZ-CRD group occur in the metazoan-specific RECK (reversion-inducing-cysteine-rich protein with Kazal motifs) proteins that are putative tumor suppressors acting as inhibitors of matrix metalloproteases. Finally, sequence and three-dimensional structural comparisons helped us uncover a divergent HFN-CRD in glypicans, which are important morphogen-binding heparan sulfate proteoglycans. Such a finding reinforces the evolutionary ties between the Wnt and Hedgehog signaling pathways and underscores the importance of gene duplications in creating essential signaling components in metazoan evolution. PMID:22693159
Signal peptide discrimination and cleavage site identification using SVM and NN.
Kazemian, H B; Yusuf, S A; White, K
2014-02-01
About 15% of all proteins in a genome contain a signal peptide (SP) sequence, at the N-terminus, that targets the protein to intracellular secretory pathways. Once the protein is targeted correctly in the cell, the SP is cleaved, releasing the mature protein. Accurate prediction of the presence of these short amino-acid SP chains is crucial for modelling the topology of membrane proteins, since SP sequences can be confused with transmembrane domains due to similar composition of hydrophobic amino acids. This paper presents a cascaded Support Vector Machine (SVM)-Neural Network (NN) classification methodology for SP discrimination and cleavage site identification. The proposed method utilises a dual phase classification approach using SVM as a primary classifier to discriminate SP sequences from Non-SP. The methodology further employs NNs to predict the most suitable cleavage site candidates. In phase one, a SVM classification utilises hydrophobic propensities as a primary feature vector extraction using symmetric sliding window amino-acid sequence analysis for discrimination of SP and Non-SP. In phase two, a NN classification uses asymmetric sliding window sequence analysis for prediction of cleavage site identification. The proposed SVM-NN method was tested using Uni-Prot non-redundant datasets of eukaryotic and prokaryotic proteins with SP and Non-SP N-termini. Computer simulation results demonstrate an overall accuracy of 0.90 for SP and Non-SP discrimination based on Matthews Correlation Coefficient (MCC) tests using SVM. For SP cleavage site prediction, the overall accuracy is 91.5% based on cross-validation tests using the novel SVM-NN model. © 2013 Published by Elsevier Ltd.
Mansoor, S E; McHaourab, H S; Farrens, D L
1999-12-07
We report an investigation of how much protein structural information could be obtained using a site-directed fluorescence labeling (SDFL) strategy. In our experiments, we used 21 consecutive single-cysteine substitution mutants in T4 lysozyme (residues T115-K135), located in a helix-turn-helix motif. The mutants were labeled with the fluorescent probe monobromobimane and subjected to an array of fluorescence measurements. Thermal stability measurements show that introduction of the label is substantially perturbing only when it is located at buried residue sites. At buried sites (solvent surface accessibility of <40 A(2)), the destabilizations are between 3 and 5.5 kcal/mol, whereas at more exposed sites, DeltaDeltaG values of < or = 1.5 kcal/mol are obtained. Of all the fluorescence parameters that were explored (excitation lambda(max), emission lambda(max), fluorescence lifetime, quantum yield, and steady-state anisotropy), the emission lambda(max) and the steady-state anisotropy values most accurately reflect the solvent surface accessibility at each site as calculated from the crystal structure of cysteine-less T4 lysozyme. The parameters we identify allow the classification of each site as buried, partially buried, or exposed. We find that the variations in these parameters as a function of residue number reflect the sequence-specific secondary structure, the determination of which is a key step for modeling a protein of unknown structure.
Learning about the internal structure of categories through classification and feature inference.
Jee, Benjamin D; Wiley, Jennifer
2014-01-01
Previous research on category learning has found that classification tasks produce representations that are skewed toward diagnostic feature dimensions, whereas feature inference tasks lead to richer representations of within-category structure. Yet, prior studies often measure category knowledge through tasks that involve identifying only the typical features of a category. This neglects an important aspect of a category's internal structure: how typical and atypical features are distributed within a category. The present experiments tested the hypothesis that inference learning results in richer knowledge of internal category structure than classification learning. We introduced several new measures to probe learners' representations of within-category structure. Experiment 1 found that participants in the inference condition learned and used a wider range of feature dimensions than classification learners. Classification learners, however, were more sensitive to the presence of atypical features within categories. Experiment 2 provided converging evidence that classification learners were more likely to incorporate atypical features into their representations. Inference learners were less likely to encode atypical category features, even in a "partial inference" condition that focused learners' attention on the feature dimensions relevant to classification. Overall, these results are contrary to the hypothesis that inference learning produces superior knowledge of within-category structure. Although inference learning promoted representations that included a broad range of category-typical features, classification learning promoted greater sensitivity to the distribution of typical and atypical features within categories.
Genic insights from integrated human proteomics in GeneCards.
Fishilevich, Simon; Zimmerman, Shahar; Kohn, Asher; Iny Stein, Tsippi; Olender, Tsviya; Kolker, Eugene; Safran, Marilyn; Lancet, Doron
2016-01-01
GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/. © The Author(s) 2016. Published by Oxford University Press.
[NiFe] hydrogenases: a common active site for hydrogen metabolism under diverse conditions.
Shafaat, Hannah S; Rüdiger, Olaf; Ogata, Hideaki; Lubitz, Wolfgang
2013-01-01
Hydrogenase proteins catalyze the reversible conversion of molecular hydrogen to protons and electrons. The most abundant hydrogenases contain a [NiFe] active site; these proteins are generally biased towards hydrogen oxidation activity and are reversibly inhibited by oxygen. However, there are [NiFe] hydrogenase that exhibit unique properties, including aerobic hydrogen oxidation and preferential hydrogen production activity; these proteins are highly relevant in the context of biotechnological devices. This review describes four classes of these "nonstandard" [NiFe] hydrogenases and discusses the electrochemical, spectroscopic, and structural studies that have been used to understand the mechanisms behind this exceptional behavior. A revised classification protocol is suggested in the conclusions, particularly with respect to the term "oxygen-tolerance". This article is part of a special issue entitled: metals in bioenergetics and biomimetics systems. Copyright © 2013 Elsevier B.V. All rights reserved.
Evaluating Functional Annotations of Enzymes Using the Gene Ontology.
Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C
2017-01-01
The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
A New Secondary Structure Assignment Algorithm Using Cα Backbone Fragments
Cao, Chen; Wang, Guishen; Liu, An; Xu, Shutan; Wang, Lincong; Zou, Shuxue
2016-01-01
The assignment of secondary structure elements in proteins is a key step in the analysis of their structures and functions. We have developed an algorithm, SACF (secondary structure assignment based on Cα fragments), for secondary structure element (SSE) assignment based on the alignment of Cα backbone fragments with central poses derived by clustering known SSE fragments. The assignment algorithm consists of three steps: First, the outlier fragments on known SSEs are detected. Next, the remaining fragments are clustered to obtain the central fragments for each cluster. Finally, the central fragments are used as a template to make assignments. Following a large-scale comparison of 11 secondary structure assignment methods, SACF, KAKSI and PROSS are found to have similar agreement with DSSP, while PCASSO agrees with DSSP best. SACF and PCASSO show preference to reducing residues in N and C cap regions, whereas KAKSI, P-SEA and SEGNO tend to add residues to the terminals when DSSP assignment is taken as standard. Moreover, our algorithm is able to assign subtle helices (310-helix, π-helix and left-handed helix) and make uniform assignments, as well as to detect rare SSEs in β-sheets or long helices as outlier fragments from other programs. The structural uniformity should be useful for protein structure classification and prediction, while outlier fragments underlie the structure–function relationship. PMID:26978354
AllergenFP: allergenicity prediction by descriptor fingerprints.
Dimitrov, Ivan; Naneva, Lyudmila; Doytchinova, Irini; Bangov, Ivan
2014-03-15
Allergenicity, like antigenicity and immunogenicity, is a property encoded linearly and non-linearly, and therefore the alignment-based approaches are not able to identify this property unambiguously. A novel alignment-free descriptor-based fingerprint approach is presented here and applied to identify allergens and non-allergens. The approach was implemented into a four step algorithm. Initially, the protein sequences are described by amino acid principal properties as hydrophobicity, size, relative abundance, helix and β-strand forming propensities. Then, the generated strings of different length are converted into vectors with equal length by auto- and cross-covariance (ACC). The vectors were transformed into binary fingerprints and compared in terms of Tanimoto coefficient. The approach was applied to a set of 2427 known allergens and 2427 non-allergens and identified correctly 88% of them with Matthews correlation coefficient of 0.759. The descriptor fingerprint approach presented here is universal. It could be applied for any classification problem in computational biology. The set of E-descriptors is able to capture the main structural and physicochemical properties of amino acids building the proteins. The ACC transformation overcomes the main problem in the alignment-based comparative studies arising from the different length of the aligned protein sequences. The conversion of protein ACC values into binary descriptor fingerprints allows similarity search and classification. The algorithm described in the present study was implemented in a specially designed Web site, named AllergenFP (FP stands for FingerPrint). AllergenFP is written in Python, with GIU in HTML. It is freely accessible at http://ddg-pharmfac.net/Allergen FP. idoytchinova@pharmfac.net or ivanbangov@shu-bg.net.
Hierarchical structure for audio-video based semantic classification of sports video sequences
NASA Astrophysics Data System (ADS)
Kolekar, M. H.; Sengupta, S.
2005-07-01
A hierarchical structure for sports event classification based on audio and video content analysis is proposed in this paper. Compared to the event classifications in other games, those of cricket are very challenging and yet unexplored. We have successfully solved cricket video classification problem using a six level hierarchical structure. The first level performs event detection based on audio energy and Zero Crossing Rate (ZCR) of short-time audio signal. In the subsequent levels, we classify the events based on video features using a Hidden Markov Model implemented through Dynamic Programming (HMM-DP) using color or motion as a likelihood function. For some of the game-specific decisions, a rule-based classification is also performed. Our proposed hierarchical structure can easily be applied to any other sports. Our results are very promising and we have moved a step forward towards addressing semantic classification problems in general.
Morgan, Alexander A.; Rubenstein, Edward
2013-01-01
Proline is an anomalous amino acid. Its nitrogen atom is covalently locked within a ring, thus it is the only proteinogenic amino acid with a constrained phi angle. Sequences of three consecutive prolines can fold into polyproline helices, structures that join alpha helices and beta pleats as architectural motifs in protein configuration. Triproline helices are participants in protein-protein signaling interactions. Longer spans of repeat prolines also occur, containing as many as 27 consecutive proline residues. Little is known about the frequency, positioning, and functional significance of these proline sequences. Therefore we have undertaken a systematic bioinformatics study of proline residues in proteins. We analyzed the distribution and frequency of 687,434 proline residues among 18,666 human proteins, identifying single residues, dimers, trimers, and longer repeats. Proline accounts for 6.3% of the 10,882,808 protein amino acids. Of all proline residues, 4.4% are in trimers or longer spans. We detected patterns that influence function based on proline location, spacing, and concentration. We propose a classification based on proline-rich, polyproline-rich, and proline-poor status. Whereas singlet proline residues are often found in proteins that display recurring architectural patterns, trimers or longer proline sequences tend be associated with the absence of repetitive structural motifs. Spans of 6 or more are associated with DNA/RNA processing, actin, and developmental processes. We also suggest a role for proline in Kruppel-type zinc finger protein control of DNA expression, and in the nucleation and translocation of actin by the formin complex. PMID:23372670
MAMPs and MIMPs: proposed classifications for inducers of innate immunity.
Mackey, David; McFall, Aidan J
2006-09-01
Plants encode a sophisticated innate immune system. Resistance against potential pathogens often relies on active responses. Prerequisite to the induction of defences is recognition of the pathogenic threat. Significant advances have been made in our understanding of the non-self molecules that are recognized by plants and the means by which plants perceive them. Established terms describing these recognition events, including microbe-associated molecular pattern (MAMP), MAMP-receptor, effector, and resistance (R) protein, need clarification to represent our current knowledge adequately. In this review we propose criteria to classify inducers of plant defence as either MAMPs or microbe-induced molecular patterns (MIMPs). We refine the definition of MAMP to mean a molecular sequence or structure in ANY pathogen-derived molecule that is perceived via direct interaction with a host defence receptor. MIMPs are modifications of host-derived molecules that are induced by an intrinsic activity of a pathogen-derived effector and are perceived by a host defence receptor. MAMP-receptors have previously been classified separately from R-proteins as a discrete class of surveillance molecules. However, MAMP-receptors and R-proteins cannot be distinguished on the basis of their protein structures or their induced responses. We propose that MAMP-receptors and MIMP-receptors are each a subset of R-proteins. Although our review is based on examples from plant pathogens and plants, the principles discussed might prove applicable to other organisms.
Lin, Yi; Cai, Fu-Ying; Zhang, Guang-Ya
2007-01-01
A quantitative structure-property relationship (QSPR) model in terms of amino acid composition and the activity of Bacillus thuringiensis insecticidal crystal proteins was established. Support vector machine (SVM) is a novel general machine-learning tool based on the structural risk minimization principle that exhibits good generalization when fault samples are few; it is especially suitable for classification, forecasting, and estimation in cases where small amounts of samples are involved such as fault diagnosis; however, some parameters of SVM are selected based on the experience of the operator, which has led to decreased efficiency of SVM in practical application. The uniform design (UD) method was applied to optimize the running parameters of SVM. It was found that the average accuracy rate approached 73% when the penalty factor was 0.01, the epsilon 0.2, the gamma 0.05, and the range 0.5. The results indicated that UD might be used an effective method to optimize the parameters of SVM and SVM and could be used as an alternative powerful modeling tool for QSPR studies of the activity of Bacillus thuringiensis (Bt) insecticidal crystal proteins. Therefore, a novel method for predicting the insecticidal activity of Bt insecticidal crystal proteins was proposed by the authors of this study.
Mocz, G.
1995-01-01
Fuzzy cluster analysis has been applied to the 20 amino acids by using 65 physicochemical properties as a basis for classification. The clustering products, the fuzzy sets (i.e., classical sets with associated membership functions), have provided a new measure of amino acid similarities for use in protein folding studies. This work demonstrates that fuzzy sets of simple molecular attributes, when assigned to amino acid residues in a protein's sequence, can predict the secondary structure of the sequence with reasonable accuracy. An approach is presented for discriminating standard folding states, using near-optimum information splitting in half-overlapping segments of the sequence of assigned membership functions. The method is applied to a nonredundant set of 252 proteins and yields approximately 73% matching for correctly predicted and correctly rejected residues with approximately 60% overall success rate for the correctly recognized ones in three folding states: alpha-helix, beta-strand, and coil. The most useful attributes for discriminating these states appear to be related to size, polarity, and thermodynamic factors. Van der Waals volume, apparent average thickness of surrounding molecular free volume, and a measure of dimensionless surface electron density can explain approximately 95% of prediction results. hydrogen bonding and hydrophobicity induces do not yet enable clear clustering and prediction. PMID:7549882
NASA Astrophysics Data System (ADS)
Eid, Sameh; Saleh, Noureldin; Zalewski, Adam; Vedani, Angelo
2014-12-01
Carbohydrates play a key role in a variety of physiological and pathological processes and, hence, represent a rich source for the development of novel therapeutic agents. Being able to predict binding mode and binding affinity is an essential, yet lacking, aspect of the structure-based design of carbohydrate-based ligands. We assembled a diverse data set comprising 273 carbohydrate-protein crystal structures with known binding affinity and evaluated the prediction accuracy of a large collection of well-established scoring and free-energy functions, as well as combinations thereof. Unfortunately, the tested functions were not capable of reproducing binding affinities in the studied complexes. To simplify the complex free-energy surface of carbohydrate-protein systems, we classified the studied proteins according to the topology and solvent exposure of the carbohydrate-binding site into five distinct categories. A free-energy model based on the proposed classification scheme reproduced binding affinities in the carbohydrate data set with an r 2 of 0.71 and root-mean-squared-error of 1.25 kcal/mol ( N = 236). The improvement in model performance underlines the significance of the differences in the local micro-environments of carbohydrate-binding sites and demonstrates the usefulness of calibrating free-energy functions individually according to binding-site topology and solvent exposure.
JNK Signaling: Regulation and Functions Based on Complex Protein-Protein Partnerships.
Zeke, András; Misheva, Mariya; Reményi, Attila; Bogoyevitch, Marie A
2016-09-01
The c-Jun N-terminal kinases (JNKs), as members of the mitogen-activated protein kinase (MAPK) family, mediate eukaryotic cell responses to a wide range of abiotic and biotic stress insults. JNKs also regulate important physiological processes, including neuronal functions, immunological actions, and embryonic development, via their impact on gene expression, cytoskeletal protein dynamics, and cell death/survival pathways. Although the JNK pathway has been under study for >20 years, its complexity is still perplexing, with multiple protein partners of JNKs underlying the diversity of actions. Here we review the current knowledge of JNK structure and isoforms as well as the partnerships of JNKs with a range of intracellular proteins. Many of these proteins are direct substrates of the JNKs. We analyzed almost 100 of these target proteins in detail within a framework of their classification based on their regulation by JNKs. Examples of these JNK substrates include a diverse assortment of nuclear transcription factors (Jun, ATF2, Myc, Elk1), cytoplasmic proteins involved in cytoskeleton regulation (DCX, Tau, WDR62) or vesicular transport (JIP1, JIP3), cell membrane receptors (BMPR2), and mitochondrial proteins (Mcl1, Bim). In addition, because upstream signaling components impact JNK activity, we critically assessed the involvement of signaling scaffolds and the roles of feedback mechanisms in the JNK pathway. Despite a clarification of many regulatory events in JNK-dependent signaling during the past decade, many other structural and mechanistic insights are just beginning to be revealed. These advances open new opportunities to understand the role of JNK signaling in diverse physiological and pathophysiological states. Copyright © 2016, American Society for Microbiology. All Rights Reserved.
Ravagnani, Adriana; Finan, Christopher L; Young, Michael
2005-03-17
In Micrococcus luteus growth and resuscitation from starvation-induced dormancy is controlled by the production of a secreted growth factor. This autocrine resuscitation-promoting factor (Rpf) is the founder member of a family of proteins found throughout and confined to the actinobacteria (high G + C Gram-positive bacteria). The aim of this work was to search for and characterise a cognate gene family in the firmicutes (low G + C Gram-positive bacteria) and obtain information about how they may control bacterial growth and resuscitation. In silico analysis of the accessory domains of the Rpf proteins permitted their classification into several subfamilies. The RpfB subfamily is related to a group of firmicute proteins of unknown function, represented by YabE of Bacillus subtilis. The actinobacterial RpfB and firmicute YabE proteins have very similar domain structures and genomic contexts, except that in YabE, the actinobacterial Rpf domain is replaced by another domain, which we have called Sps. Although totally unrelated in both sequence and secondary structure, the Rpf and Sps domains fulfil the same function. We propose that these proteins have undergone "non-orthologous domain displacement", a phenomenon akin to "non-orthologous gene displacement" that has been described previously. Proteins containing the Sps domain are widely distributed throughout the firmicutes and they too fall into a number of distinct subfamilies. Comparative analysis of the accessory domains in the Rpf and Sps proteins, together with their weak similarity to lytic transglycosylases, provide clear evidence that they are muralytic enzymes. The results indicate that the firmicute Sps proteins and the actinobacterial Rpf proteins are cognate and that they control bacterial culturability via enzymatic modification of the bacterial cell envelope.
Inherited Congenital Cataract: A Guide to Suspect the Genetic Etiology in the Cataract Genesis
Messina-Baas, Olga; Cuevas-Covarrubias, Sergio A.
2017-01-01
Cataracts are the principal cause of treatable blindness worldwide. Inherited congenital cataract (CC) shows all types of inheritance patterns in a syndromic and nonsyndromic form. There are more than 100 genes associated with cataract with a predominance of autosomal dominant inheritance. A cataract is defined as an opacity of the lens producing a variation of the refractive index of the lens. This variation derives from modifications in the lens structure resulting in light scattering, frequently a consequence of a significant concentration of high-molecular-weight protein aggregates. The aim of this review is to introduce a guide to identify the gene involved in inherited CC. Due to the manifold clinical and genetic heterogeneity, we discarded the cataract phenotype as a cardinal sign; a 4-group classification with the genes implicated in inherited CC is proposed. We consider that this classification will assist in identifying the probable gene involved in inherited CC. PMID:28611546
[Biologics - nomenclature and classification].
Eichbaum, Christine; Haefeli, Walter E
2011-11-01
Biological medicines are a heterogeneous group of drugs that are produced by living organisms using genetic or biological technology. Unlike chemically derived small molecules biologics are structurally complex making characterization and manufacturing difficult. Moreover, biological medicines show a great variety concerning their clinical use. To appropriately consider these particularities, there are other standards and guidelines for approval of similar derivatives of biologics, the so-called biosimilars or follow-on biologics. In contrast to a generic medicinal product containing a chemically identical active ingredient, a biosimilar is only expected to be similar to the innovator drug. Nowadays, monoclonal antibodies, fragments of antibodies, and fusion proteins manufactured by recombinant procedures play an important role. They have been used in many specialties for diagnostic and therapeutic purposes and are subject to continuous further development and improvement. Their nomenclature is based on a classification by the WHO which allows drawing conclusions for class of substance, origin, and pharmacological target.
Predictive Structure-Based Toxicology Approaches To Assess the Androgenic Potential of Chemicals.
Trisciuzzi, Daniela; Alberga, Domenico; Mansouri, Kamel; Judson, Richard; Novellino, Ettore; Mangiatordi, Giuseppe Felice; Nicolotti, Orazio
2017-11-27
We present a practical and easy-to-run in silico workflow exploiting a structure-based strategy making use of docking simulations to derive highly predictive classification models of the androgenic potential of chemicals. Models were trained on a high-quality chemical collection comprising 1689 curated compounds made available within the CoMPARA consortium from the US Environmental Protection Agency and were integrated with a two-step applicability domain whose implementation had the effect of improving both the confidence in prediction and statistics by reducing the number of false negatives. Among the nine androgen receptor X-ray solved structures, the crystal 2PNU (entry code from the Protein Data Bank) was associated with the best performing structure-based classification model. Three validation sets comprising each 2590 compounds extracted by the DUD-E collection were used to challenge model performance and the effectiveness of Applicability Domain implementation. Next, the 2PNU model was applied to screen and prioritize two collections of chemicals. The first is a small pool of 12 representative androgenic compounds that were accurately classified based on outstanding rationale at the molecular level. The second is a large external blind set of 55450 chemicals with potential for human exposure. We show how the use of molecular docking provides highly interpretable models and can represent a real-life option as an alternative nontesting method for predictive toxicology.
Kumar, Keshav; Espaillat, Akbar; Cava, Felipe
2017-01-01
Bacteria cells are protected from osmotic and environmental stresses by an exoskeleton-like polymeric structure called peptidoglycan (PG) or murein sacculus. This structure is fundamental for bacteria’s viability and thus, the mechanisms underlying cell wall assembly and how it is modulated serve as targets for many of our most successful antibiotics. Therefore, it is now more important than ever to understand the genetics and structural chemistry of the bacterial cell walls in order to find new and effective methods of blocking it for the treatment of disease. In the last decades, liquid chromatography and mass spectrometry have been demonstrated to provide the required resolution and sensitivity to characterize the fine chemical structure of PG. However, the large volume of data sets that can be produced by these instruments today are difficult to handle without a proper data analysis workflow. Here, we present PG-metrics, a chemometric based pipeline that allows fast and easy classification of bacteria according to their muropeptide chromatographic profiles and identification of the subjacent PG chemical variability between e.g. bacterial species, growth conditions and, mutant libraries. The pipeline is successfully validated here using PG samples from different bacterial species and mutants in cell wall proteins. The obtained results clearly demonstrated that PG-metrics pipeline is a valuable bioanalytical tool that can lead us to cell wall classification and biomarker discovery. PMID:29040278
Ali, Safdar; Majid, Abdul; Javed, Syed Gibran; Sattar, Mohsin
2016-06-01
Early prediction of breast cancer is important for effective treatment and survival. We developed an effective Cost-Sensitive Classifier with GentleBoost Ensemble (Can-CSC-GBE) for the classification of breast cancer using protein amino acid features. In this work, first, discriminant information of the protein sequences related to breast tissue is extracted. Then, the physicochemical properties hydrophobicity and hydrophilicity of amino acids are employed to generate molecule descriptors in different feature spaces. For comparison, we obtained results by combining Cost-Sensitive learning with conventional ensemble of AdaBoostM1 and Bagging. The proposed Can-CSC-GBE system has effectively reduced the misclassification costs and thereby improved the overall classification performance. Our novel approach has highlighted promising results as compared to the state-of-the-art ensemble approaches. Copyright © 2016 Elsevier Ltd. All rights reserved.
Wang, Shunfang; Liu, Shuhui
2015-12-19
An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.
Wang, Shunfang; Liu, Shuhui
2015-01-01
An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one. PMID:26703574
Reconstruction of the experimentally supported human protein interactome: what can we learn?
Klapa, Maria I; Tsafou, Kalliopi; Theodoridis, Evangelos; Tsakalidis, Athanasios; Moschonas, Nicholas K
2013-10-02
Understanding the topology and dynamics of the human protein-protein interaction (PPI) network will significantly contribute to biomedical research, therefore its systematic reconstruction is required. Several meta-databases integrate source PPI datasets, but the protein node sets of their networks vary depending on the PPI data combined. Due to this inherent heterogeneity, the way in which the human PPI network expands via multiple dataset integration has not been comprehensively analyzed. We aim at assembling the human interactome in a global structured way and exploring it to gain insights of biological relevance. First, we defined the UniProtKB manually reviewed human "complete" proteome as the reference protein-node set and then we mined five major source PPI datasets for direct PPIs exclusively between the reference proteins. We updated the protein and publication identifiers and normalized all PPIs to the UniProt identifier level. The reconstructed interactome covers approximately 60% of the human proteome and has a scale-free structure. No apparent differentiating gene functional classification characteristics were identified for the unrepresented proteins. The source dataset integration augments the network mainly in PPIs. Polyubiquitin emerged as the highest-degree node, but the inclusion of most of its identified PPIs may be reconsidered. The high number (>300) of connections of the subsequent fifteen proteins correlates well with their essential biological role. According to the power-law network structure, the unrepresented proteins should mainly have up to four connections with equally poorly-connected interactors. Reconstructing the human interactome based on the a priori definition of the protein nodes enabled us to identify the currently included part of the human "complete" proteome, and discuss the role of the proteins within the network topology with respect to their function. As the network expansion has to comply with the scale-free theory, we suggest that the core of the human interactome has essentially emerged. Thus, it could be employed in systems biology and biomedical research, despite the considerable number of currently unrepresented proteins. The latter are probably involved in specialized physiological conditions, justifying the scarcity of related PPI information, and their identification can assist in designing relevant functional experiments and targeted text mining algorithms.
The Classification and Evolution of Enzyme Function.
Martínez Cuesta, Sergio; Rahman, Syed Asad; Furnham, Nicholas; Thornton, Janet M
2015-09-15
Enzymes are the proteins responsible for the catalysis of life. Enzymes sharing a common ancestor as defined by sequence and structure similarity are grouped into families and superfamilies. The molecular function of enzymes is defined as their ability to catalyze biochemical reactions; it is manually classified by the Enzyme Commission and robust approaches to quantitatively compare catalytic reactions are just beginning to appear. Here, we present an overview of studies at the interface of the evolution and function of enzymes. Copyright © 2015 Biophysical Society. Published by Elsevier Inc. All rights reserved.
YTPdb: a wiki database of yeast membrane transporters.
Brohée, Sylvain; Barriot, Roland; Moreau, Yves; André, Bruno
2010-10-01
Membrane transporters constitute one of the largest functional categories of proteins in all organisms. In the yeast Saccharomyces cerevisiae, this represents about 300 proteins ( approximately 5% of the proteome). We here present the Yeast Transport Protein database (YTPdb), a user-friendly collaborative resource dedicated to the precise classification and annotation of yeast transporters. YTPdb exploits an evolution of the MediaWiki web engine used for popular collaborative databases like Wikipedia, allowing every registered user to edit the data in a user-friendly manner. Proteins in YTPdb are classified on the basis of functional criteria such as subcellular location or their substrate compounds. These classifications are hierarchical, allowing queries to be performed at various levels, from highly specific (e.g. ammonium as a substrate or the vacuole as a location) to broader (e.g. cation as a substrate or inner membranes as location). Other resources accessible for each transporter via YTPdb include post-translational modifications, K(m) values, a permanently updated bibliography, and a hierarchical classification into families. The YTPdb concept can be extrapolated to other organisms and could even be applied for other functional categories of proteins. YTPdb is accessible at http://homes.esat.kuleuven.be/ytpdb/. Copyright © 2010 Elsevier B.V. All rights reserved.
MFIB: a repository of protein complexes with mutual folding induced by binding.
Fichó, Erzsébet; Reményi, István; Simon, István; Mészáros, Bálint
2017-11-15
It is commonplace that intrinsically disordered proteins (IDPs) are involved in crucial interactions in the living cell. However, the study of protein complexes formed exclusively by IDPs is hindered by the lack of data and such analyses remain sporadic. Systematic studies benefited other types of protein-protein interactions paving a way from basic science to therapeutics; yet these efforts require reliable datasets that are currently lacking for synergistically folding complexes of IDPs. Here we present the Mutual Folding Induced by Binding (MFIB) database, the first systematic collection of complexes formed exclusively by IDPs. MFIB contains an order of magnitude more data than any dataset used in corresponding studies and offers a wide coverage of known IDP complexes in terms of flexibility, oligomeric composition and protein function from all domains of life. The included complexes are grouped using a hierarchical classification and are complemented with structural and functional annotations. MFIB is backed by a firm development team and infrastructure, and together with possible future community collaboration it will provide the cornerstone for structural and functional studies of IDP complexes. MFIB is freely accessible at http://mfib.enzim.ttk.mta.hu/. The MFIB application is hosted by Apache web server and was implemented in PHP. To enrich querying features and to enhance backend performance a MySQL database was also created. simon.istvan@ttk.mta.hu, meszaros.balint@ttk.mta.hu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.
Partial cooperative unfolding in proteins as observed by hydrogen exchange mass spectrometry
Engen, John R.; Wales, Thomas E.; Chen, Shugui; Marzluff, Elaine M.; Hassell, Kerry M.; Weis, David D.; Smithgall, Thomas E.
2013-01-01
Many proteins do not exist in a single rigid conformation. Protein motions, or dynamics, exist and in many cases are important for protein function. The analysis of protein dynamics relies on biophysical techniques that can distinguish simultaneously existing populations of molecules and their rates of interconversion. Hydrogen exchange (HX) detected by mass spectrometry (MS) is contributing to our understanding of protein motions by revealing unfolding and dynamics on a wide timescale, ranging from seconds to hours to days. In this review we discuss HX MS-based analyses of protein dynamics, using our studies of multi-domain kinases as examples. Using HX MS, we have successfully probed protein dynamics and unfolding in the isolated SH3, SH2 and kinase domains of the c-Src and Abl kinase families, as well as the role of inter- and intra-molecular interactions in the global control of kinase function. Coupled with high-resolution structural information, HX MS has proved to be a powerful and versatile tool for the analysis of the conformational dynamics in these kinase systems, and has provided fresh insight regarding the regulatory control of these important signaling proteins. HX MS studies of dynamics are applicable not only to the proteins we illustrate here, but to a very wide range of proteins and protein systems, and should play a role in both classification of and greater understanding of the prevalence of protein motion. PMID:23682200
Kawabata, Takeshi; Nakamura, Haruki
2014-07-28
A protein-bound conformation of a target molecule can be predicted by aligning the target molecule on the reference molecule obtained from the 3D structure of the compound-protein complex. This strategy is called "similarity-based docking". For this purpose, we develop the flexible alignment program fkcombu, which aligns the target molecule based on atomic correspondences with the reference molecule. The correspondences are obtained by the maximum common substructure (MCS) of 2D chemical structures, using our program kcombu. The prediction performance was evaluated using many target-reference pairs of superimposed ligand 3D structures on the same protein in the PDB, with different ranges of chemical similarity. The details of atomic correspondence largely affected the prediction success. We found that topologically constrained disconnected MCS (TD-MCS) with the simple element-based atomic classification provides the best prediction. The crashing potential energy with the receptor protein improved the performance. We also found that the RMSD between the predicted and correct target conformations significantly correlates with the chemical similarities between target-reference molecules. Generally speaking, if the reference and target compounds have more than 70% chemical similarity, then the average RMSD of 3D conformations is <2.0 Å. We compared the performance with a rigid-body molecular alignment program based on volume-overlap scores (ShaEP). Our MCS-based flexible alignment program performed better than the rigid-body alignment program, especially when the target and reference molecules were sufficiently similar.
SELDI-TOF-MS proteomic profiling of serum, urine, and amniotic fluid in neural tube defects.
Liu, Zhenjiang; Yuan, Zhengwei; Zhao, Qun
2014-01-01
Neural tube defects (NTDs) are common birth defects, whose specific biomarkers are needed. The purpose of this pilot study is to determine whether protein profiling in NTD-mothers differ from normal controls using SELDI-TOF-MS. ProteinChip Biomarker System was used to evaluate 82 maternal serum samples, 78 urine samples and 76 amniotic fluid samples. The validity of classification tree was then challenged with a blind test set including another 20 NTD-mothers and 18 controls in serum samples, and another 19 NTD-mothers and 17 controls in urine samples, and another 20 NTD-mothers and 17 controls in amniotic fluid samples. Eight proteins detected in serum samples were up-regulated and four proteins were down-regulated in the NTD group. Four proteins detected in urine samples were up-regulated and one protein was down-regulated in the NTD group. Six proteins detected in amniotic fluid samples were up-regulated and one protein was down-regulated in the NTD group. The classification tree for serum samples separated NTDs from healthy individuals, achieving a sensitivity of 91% and a specificity of 97% in the training set, and achieving a sensitivity of 90% and a specificity of 97% and a positive predictive value of 95% in the test set. The classification tree for urine samples separated NTDs from controls, achieving a sensitivity of 95% and a specificity of 94% in the training set, and achieving a sensitivity of 89% and a specificity of 82% and a positive predictive value of 85% in the test set. The classification tree for amniotic fluid samples separated NTDs from controls, achieving a sensitivity of 93% and a specificity of 89% in the training set, and achieving a sensitivity of 90% and a specificity of 88% and a positive predictive value of 90% in the test set. These suggest that SELDI-TOF-MS is an additional method for NTDs pregnancies detection.
How Does Your Protein Fold? Elucidating the Apomyoglobin Folding Pathway
Dyson, H. Jane; Wright, Peter E.
2017-01-01
Conspectus Although each type of protein fold and in some cases individual proteins within a fold classification can have very different mechanisms of folding, the underlying biophysical and biochemical principles that operate to cause a linear polypeptide chain to fold into a globular structure must be the same. In an aqueous solution, the protein takes up the thermodynamically most stable structure, but the pathway along which the polypeptide proceeds in order to reach that structure is a function of the amino acid sequence, which must be the final determining factor, not only in shaping the final folded structure, but in dictating the folding pathway. A number of groups have focused on a single protein or group of proteins, to determine in detail the factors that influence the rate and mechanism of folding in a defined system, with the hope that hypothesis-driven experiments can elucidate the underlying principles governing the folding process. Our research group has focused on the folding of the globin family of proteins, and in particular on the monomeric protein apomyoglobin. Apomyoglobin (apoMb) folds relatively slowly (~2 seconds) via an ensemble of obligatory intermediates that form rapidly after the initiation of folding. The folding pathway can be dissected using rapid-mixing techniques, which can probe processes in the millisecond time range. Stopped-flow measurements detected by circular dichroism (CD) or fluorescence spectroscopy give information on the rates of folding events. Quench-flow experiments utilize the differential rates of hydrogen-deuterium exchange of amide protons protected in parts of the structure that are folded early; protection of amides can be detected by mass spectrometry or proton nuclear magnetic resonance spectroscopy (NMR). In addition, apoMb forms an intermediate at equilibrium at pH ~ 4, which is sufficiently stable for it to be structurally characterized by solution methods such as CD, fluorescence and NMR spectroscopies, and the conformational ensembles formed in the presence of denaturing agents and low pH can be characterized as models for the unfolded states of the protein. Newer NMR techniques such as measurement of residual dipolar couplings in the various partly folded states, and relaxation dispersion measurements to probe invisible states present at low concentrations, have contributed to providing a detailed picture of the apomyoglobin folding pathway. The research summarized in this review was aimed at characterizing and comparing the equilibrium and kinetic intermediates both structurally and dynamically, as well as delineating the complete folding pathway at a residue-specific level, in order to answer the question “What is it about the amino acid sequence that causes each molecule in the unfolded protein ensemble to start folding, and, once started, to proceed towards the formation of the correctly folded three-dimensional structure?” PMID:28032989
Tran, Tran T; Kulis, Christina; Long, Steven M; Bryant, Darryn; Adams, Peter; Smythe, Mark L
2010-11-01
Medicinal chemists synthesize arrays of molecules by attaching functional groups to scaffolds. There is evidence suggesting that some scaffolds yield biologically active molecules more than others, these are termed privileged substructures. One role of the scaffold is to present its side-chains for molecular recognition, and biologically relevant scaffolds may present side-chains in biologically relevant geometries or shapes. Since drug discovery is primarily focused on the discovery of compounds that bind to proteinaceous targets, we have been deciphering the scaffold shapes that are used for binding proteins as they reflect biologically relevant shapes. To decipher the scaffold architecture that is important for binding protein surfaces, we have analyzed the scaffold architecture of protein loops, which are defined in this context as continuous four residue segments of a protein chain that are not part of an α-helix or β-strand secondary structure. Loops are an important molecular recognition motif of proteins. We have found that 39 clusters reflect the scaffold architecture of 89% of the 23,331 loops in the dataset, with average intra-cluster and inter-cluster RMSD of 0.47 and 1.91, respectively. These protein loop scaffolds all have distinct shapes. We have used these 39 clusters that reflect the scaffold architecture of protein loops as biological descriptors. This involved generation of a small dataset of scaffold-based peptidomimetics. We found that peptidomimetic scaffolds with reported biological activities matched loop scaffold geometries and those peptidomimetic scaffolds with no reported biologically activities did not. This preliminary evidence suggests that organic scaffolds with tight matches to the preferred loop scaffolds of proteins, implies the likelihood of the scaffold to be biologically relevant.
NASA Astrophysics Data System (ADS)
Tran, Tran T.; Kulis, Christina; Long, Steven M.; Bryant, Darryn; Adams, Peter; Smythe, Mark L.
2010-11-01
Medicinal chemists synthesize arrays of molecules by attaching functional groups to scaffolds. There is evidence suggesting that some scaffolds yield biologically active molecules more than others, these are termed privileged substructures. One role of the scaffold is to present its side-chains for molecular recognition, and biologically relevant scaffolds may present side-chains in biologically relevant geometries or shapes. Since drug discovery is primarily focused on the discovery of compounds that bind to proteinaceous targets, we have been deciphering the scaffold shapes that are used for binding proteins as they reflect biologically relevant shapes. To decipher the scaffold architecture that is important for binding protein surfaces, we have analyzed the scaffold architecture of protein loops, which are defined in this context as continuous four residue segments of a protein chain that are not part of an α-helix or β-strand secondary structure. Loops are an important molecular recognition motif of proteins. We have found that 39 clusters reflect the scaffold architecture of 89% of the 23,331 loops in the dataset, with average intra-cluster and inter-cluster RMSD of 0.47 and 1.91, respectively. These protein loop scaffolds all have distinct shapes. We have used these 39 clusters that reflect the scaffold architecture of protein loops as biological descriptors. This involved generation of a small dataset of scaffold-based peptidomimetics. We found that peptidomimetic scaffolds with reported biological activities matched loop scaffold geometries and those peptidomimetic scaffolds with no reported biologically activities did not. This preliminary evidence suggests that organic scaffolds with tight matches to the preferred loop scaffolds of proteins, implies the likelihood of the scaffold to be biologically relevant.
Lisowska-Myjak, B; Skarżyńska, E; Bakun, M
2018-06-01
Intrauterine environmental factors can be associated with perinatal complications and long-term health outcomes although the underlying mechanisms remain poorly defined. Meconium formed exclusively in utero and passed naturally by a neonate may contain proteins which characterise the intrauterine environment. The aim of the study was proteomic analysis of the composition of meconium proteins and their classification by biological function. Proteomic techniques combining isoelectrofocussing fractionation and LC-MS/MS analysis were used to study the protein composition of a meconium sample obtained by pooling 50 serial meconium portions from 10 healthy full-term neonates. The proteins were classified by function based on the literature search for each protein in the PubMed database. A total of 946 proteins were identified in the meconium, including 430 proteins represented by two or more peptides. When the proteins were classified by their biological function the following were identified: immunoglobulin fragments and enzymatic, neutrophil-derived, structural and fetal intestine-specific proteins. Meconium is a rich source of proteins deposited in the fetal intestine during its development in utero. A better understanding of their specific biological functions in the intrauterine environment may help to identify these proteins which may serve as biomarkers associated with specific clinical conditions/diseases with the possible impact on the fetal development and further health consequences in infants, older children and adults.
Predicting Flavonoid UGT Regioselectivity
Jackson, Rhydon; Knisley, Debra; McIntosh, Cecilia; Pfeiffer, Phillip
2011-01-01
Machine learning was applied to a challenging and biologically significant protein classification problem: the prediction of avonoid UGT acceptor regioselectivity from primary sequence. Novel indices characterizing graphical models of residues were proposed and found to be widely distributed among existing amino acid indices and to cluster residues appropriately. UGT subsequences biochemically linked to regioselectivity were modeled as sets of index sequences. Several learning techniques incorporating these UGT models were compared with classifications based on standard sequence alignment scores. These techniques included an application of time series distance functions to protein classification. Time series distances defined on the index sequences were used in nearest neighbor and support vector machine classifiers. Additionally, Bayesian neural network classifiers were applied to the index sequences. The experiments identified improvements over the nearest neighbor and support vector machine classifications relying on standard alignment similarity scores, as well as strong correlations between specific subsequences and regioselectivities. PMID:21747849
NASA Astrophysics Data System (ADS)
Saetchnikov, Vladimir A.; Tcherniavskaia, Elina A.; Schweiger, Gustav; Ostendorf, Andreas
2011-07-01
A novel technique for the label-free analysis of micro and nanoparticles including biomolecules using optical micro cavity resonance of whispering-gallery-type modes is being developed. Various schemes of the method using both standard and specially produced microspheres have been investigated to make further development for microbial application. It was demonstrated that optical resonance under optimal geometry could be detected under the laser power of less 1 microwatt. The sensitivity of developed schemes has been tested by monitoring the spectral shift of the whispering gallery modes. Water solutions of ethanol, ascorbic acid, blood phantoms including albumin and HCl, glucose, biotin, biomarker like C reactive protein so as bacteria and virus phantoms (gels of silica micro and nanoparticles) have been used. Structure of resonance spectra of the solutions was a specific subject of investigation. Probabilistic neural network classifier for biological agents and micro/nano particles classification has been developed. Several parameters of resonance spectra as spectral shift, broadening, diffuseness and others have been used as input parameters to develop a network classifier for micro and nanoparticles and biological agents in solution. Classification probability of approximately 98% for probes under investigation have been achieved. Developed approach have been demonstrated to be a promising technology platform for sensitive, lab-on-chip type sensor which can be used for development of diagnostic tools for different biological molecules, e.g. proteins, oligonucleotides, oligosaccharides, lipids, small molecules, viral particles, cells as well as in different experimental contexts e.g. proteomics, genomics, drug discovery, and membrane studies.
Li, ZhiLiang; Wu, ShiRong; Chen, ZeCong; Ye, Nancy; Yang, ShengXi; Liao, ChunYang; Zhang, MengJun; Yang, Li; Mei, Hu; Yang, Yan; Zhao, Na; Zhou, Yuan; Zhou, Ping; Xiong, Qing; Xu, Hong; Liu, ShuShen; Ling, ZiHua; Chen, Gang; Li, GenRong
2007-10-01
Only from the primary structures of peptides, a new set of descriptors called the molecular electronegativity edge-distance vector (VMED) was proposed and applied to describing and characterizing the molecular structures of oligopeptides and polypeptides, based on the electronegativity of each atom or electronic charge index (ECI) of atomic clusters and the bonding distance between atom-pairs. Here, the molecular structures of antigenic polypeptides were well expressed in order to propose the automated technique for the computerized identification of helper T lymphocyte (Th) epitopes. Furthermore, a modified MED vector was proposed from the primary structures of polypeptides, based on the ECI and the relative bonding distance of the fundamental skeleton groups. The side-chains of each amino acid were here treated as a pseudo-atom. The developed VMED was easy to calculate and able to work. Some quantitative model was established for 28 immunogenic or antigenic polypeptides (AGPP) with 14 (1-14) A(d) and 14 other restricted activities assigned as "1"(+) and "0"(-), respectively. The latter comprised 6 A(b)(15-20), 3 A(k)(21-23), 2 E(k)(24-26), 2 H-2(k)(27 and 28) restricted sequences. Good results were obtained with 90% correct classification (only 2 wrong ones for 20 training samples) and 100% correct prediction (none wrong for 8 testing samples); while contrastively 100% correct classification (none wrong for 20 training samples) and 88% correct classification (1 wrong for 8 testing samples). Both stochastic samplings and cross validations were performed to demonstrate good performance. The described method may also be suitable for estimation and prediction of classes I and II for major histocompatibility antigen (MHC) epitope of human. It will be useful in immune identification and recognition of proteins and genes and in the design and development of subunit vaccines. Several quantitative structure activity relationship (QSAR) models were developed for various oligopeptides and polypeptides including 58 dipeptides and 31 pentapeptides with angiotensin converting enzyme (ACE) inhibition by multiple linear regression (MLR) method. In order to explain the ability to characterize molecular structure of polypeptides, a molecular modeling investigation on QSAR was performed for functional prediction of polypeptide sequences with antigenic activity and heptapeptide sequences with tachykinin activity through quantitative sequence-activity models (QSAMs) by the molecular electronegativity edge-distance vector (VMED). The results showed that VMED exhibited both excellent structural selectivity and good activity prediction. Moreover, the results showed that VMED behaved quite well for both QSAR and QSAM of poly-and oligopeptides, which exhibited both good estimation ability and prediction power, equal to or better than those reported in the previous references. Finally, a preliminary conclusion was drawn: both classical and modified MED vectors were very useful structural descriptors. Some suggestions were proposed for further studies on QSAR/QSAM of proteins in various fields.
Shamim, Thorakkal
2013-09-01
Iatrogenic lesions can affect both hard and soft tissues in the oral cavity, induced by the dentist's activity, manner or therapy. There is no approved simple working classification for the iatrogenic lesions of teeth and associated structures in the oral cavity in the literature. A simple working classification is proposed here for iatrogenic lesions of teeth and associated structures in the oral cavity based on its relation with dental specialities. The dental specialities considered in this classification are conservative dentistry and endodontics, orthodontics, oral and maxillofacial surgery and prosthodontics. This classification will be useful for the dental clinician who is dealing with diseases of oral cavity.
Defining functional distance using manifold embeddings of gene ontology annotations
Lerman, Gilad; Shakhnovich, Boris E.
2007-01-01
Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules. PMID:17595300
The Protein Information Resource: an integrated public resource of functional annotation of proteins
Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.
2002-01-01
The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247
Armutlu, Pelin; Ozdemir, Muhittin E; Uney-Yuksektepe, Fadime; Kavakli, I Halil; Turkay, Metin
2008-10-03
A priori analysis of the activity of drugs on the target protein by computational approaches can be useful in narrowing down drug candidates for further experimental tests. Currently, there are a large number of computational methods that predict the activity of drugs on proteins. In this study, we approach the activity prediction problem as a classification problem and, we aim to improve the classification accuracy by introducing an algorithm that combines partial least squares regression with mixed-integer programming based hyper-boxes classification method, where drug molecules are classified as low active or high active regarding their binding activity (IC50 values) on target proteins. We also aim to determine the most significant molecular descriptors for the drug molecules. We first apply our approach by analyzing the activities of widely known inhibitor datasets including Acetylcholinesterase (ACHE), Benzodiazepine Receptor (BZR), Dihydrofolate Reductase (DHFR), Cyclooxygenase-2 (COX-2) with known IC50 values. The results at this stage proved that our approach consistently gives better classification accuracies compared to 63 other reported classification methods such as SVM, Naïve Bayes, where we were able to predict the experimentally determined IC50 values with a worst case accuracy of 96%. To further test applicability of this approach we first created dataset for Cytochrome P450 C17 inhibitors and then predicted their activities with 100% accuracy. Our results indicate that this approach can be utilized to predict the inhibitory effects of inhibitors based on their molecular descriptors. This approach will not only enhance drug discovery process, but also save time and resources committed.
A novel and efficient technique for identification and classification of GPCRs.
Gupta, Ravi; Mittal, Ankush; Singh, Kuldip
2008-07-01
G-protein coupled receptors (GPCRs) play a vital role in different biological processes, such as regulation of growth, death, and metabolism of cells. GPCRs are the focus of significant amount of current pharmaceutical research since they interact with more than 50% of prescription drugs. The dipeptide-based support vector machine (SVM) approach is the most accurate technique to identify and classify the GPCRs. However, this approach has two major disadvantages. First, the dimension of dipeptide-based feature vector is equal to 400. The large dimension makes the classification task computationally and memory wise inefficient. Second, it does not consider the biological properties of protein sequence for identification and classification of GPCRs. In this paper, we present a novel-feature-based SVM classification technique. The novel features are derived by applying wavelet-based time series analysis approach on protein sequences. The proposed feature space summarizes the variance information of seven important biological properties of amino acids in a protein sequence. In addition, the dimension of the feature vector for proposed technique is equal to 35. Experiments were performed on GPCRs protein sequences available at GPCRs Database. Our approach achieves an accuracy of 99.9%, 98.06%, 97.78%, and 94.08% for GPCR superfamily, families, subfamilies, and subsubfamilies (amine group), respectively, when evaluated using fivefold cross-validation. Further, an accuracy of 99.8%, 97.26%, and 97.84% was obtained when evaluated on unseen or recall datasets of GPCR superfamily, families, and subfamilies, respectively. Comparison with dipeptide-based SVM technique shows the effectiveness of our approach.
Lauber, Chris
2012-01-01
Virus taxonomy has received little attention from the research community despite its broad relevance. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3890–3904, 2012), we have introduced a quantitative approach to hierarchically classify viruses of a family using pairwise evolutionary distances (PEDs) as a measure of genetic divergence. When applied to the six most conserved proteins of the Picornaviridae, it clustered 1,234 genome sequences in groups at three hierarchical levels (to which we refer as the “GENETIC classification”). In this study, we compare the GENETIC classification with the expert-based picornavirus taxonomy and outline differences in the underlying frameworks regarding the relation of virus groups and genetic diversity that represent, respectively, the structure and content of a classification. To facilitate the analysis, we introduce two novel diagrams. The first connects the genetic diversity of taxa to both the PED distribution and the phylogeny of picornaviruses. The second depicts a classification and the accommodated genetic diversity in a standardized manner. Generally, we found striking agreement between the two classifications on species and genus taxa. A few disagreements concern the species Human rhinovirus A and Human rhinovirus C and the genus Aphthovirus, which were split in the GENETIC classification. Furthermore, we propose a new supergenus level and universal, level-specific PED thresholds, not reached yet by many taxa. Since the species threshold is approached mostly by taxa with large sampling sizes and those infecting multiple hosts, it may represent an upper limit on divergence, beyond which homologous recombination in the six most conserved genes between two picornaviruses might not give viable progeny. PMID:22278238
Topological side-chain classification of beta-turns: ideal motifs for peptidomimetic development.
Tran, Tran Trung; McKie, Jim; Meutermans, Wim D F; Bourne, Gregory T; Andrews, Peter R; Smythe, Mark L
2005-08-01
Beta-turns are important topological motifs for biological recognition of proteins and peptides. Organic molecules that sample the side chain positions of beta-turns have shown broad binding capacity to multiple different receptors, for example benzodiazepines. Beta-turns have traditionally been classified into various types based on the backbone dihedral angles (phi2, psi2, phi3 and psi3). Indeed, 57-68% of beta-turns are currently classified into 8 different backbone families (Type I, Type II, Type I', Type II', Type VIII, Type VIa1, Type VIa2 and Type VIb and Type IV which represents unclassified beta-turns). Although this classification of beta-turns has been useful, the resulting beta-turn types are not ideal for the design of beta-turn mimetics as they do not reflect topological features of the recognition elements, the side chains. To overcome this, we have extracted beta-turns from a data set of non-homologous and high-resolution protein crystal structures. The side chain positions, as defined by C(alpha)-C(beta) vectors, of these turns have been clustered using the kth nearest neighbor clustering and filtered nearest centroid sorting algorithms. Nine clusters were obtained that cluster 90% of the data, and the average intra-cluster RMSD of the four C(alpha)-C(beta) vectors is 0.36. The nine clusters therefore represent the topology of the side chain scaffold architecture of the vast majority of beta-turns. The mean structures of the nine clusters are useful for the development of beta-turn mimetics and as biological descriptors for focusing combinatorial chemistry towards biologically relevant topological space.
Sevel, Landrew S; Boissoneault, Jeff; Letzen, Janelle E; Robinson, Michael E; Staud, Roland
2018-05-30
Chronic fatigue syndrome (CFS) is a disorder associated with fatigue, pain, and structural/functional abnormalities seen during magnetic resonance brain imaging (MRI). Therefore, we evaluated the performance of structural MRI (sMRI) abnormalities in the classification of CFS patients versus healthy controls and compared it to machine learning (ML) classification based upon self-report (SR). Participants included 18 CFS patients and 15 healthy controls (HC). All subjects underwent T1-weighted sMRI and provided visual analogue-scale ratings of fatigue, pain intensity, anxiety, depression, anger, and sleep quality. sMRI data were segmented using FreeSurfer and 61 regions based on functional and structural abnormalities previously reported in patients with CFS. Classification was performed in RapidMiner using a linear support vector machine and bootstrap optimism correction. We compared ML classifiers based on (1) 61 a priori sMRI regional estimates and (2) SR ratings. The sMRI model achieved 79.58% classification accuracy. The SR (accuracy = 95.95%) outperformed both sMRI models. Estimates from multiple brain areas related to cognition, emotion, and memory contributed strongly to group classification. This is the first ML-based group classification of CFS. Our findings suggest that sMRI abnormalities are useful for discriminating CFS patients from HC, but SR ratings remain most effective in classification tasks.
Literature classification for semi-automated updating of biological knowledgebases
2013-01-01
Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403
NASA Astrophysics Data System (ADS)
Zhao, Bei; Zhong, Yanfei; Zhang, Liangpei
2016-06-01
Land-use classification of very high spatial resolution remote sensing (VHSR) imagery is one of the most challenging tasks in the field of remote sensing image processing. However, the land-use classification is hard to be addressed by the land-cover classification techniques, due to the complexity of the land-use scenes. Scene classification is considered to be one of the expected ways to address the land-use classification issue. The commonly used scene classification methods of VHSR imagery are all derived from the computer vision community that mainly deal with terrestrial image recognition. Differing from terrestrial images, VHSR images are taken by looking down with airborne and spaceborne sensors, which leads to the distinct light conditions and spatial configuration of land cover in VHSR imagery. Considering the distinct characteristics, two questions should be answered: (1) Which type or combination of information is suitable for the VHSR imagery scene classification? (2) Which scene classification algorithm is best for VHSR imagery? In this paper, an efficient spectral-structural bag-of-features scene classifier (SSBFC) is proposed to combine the spectral and structural information of VHSR imagery. SSBFC utilizes the first- and second-order statistics (the mean and standard deviation values, MeanStd) as the statistical spectral descriptor for the spectral information of the VHSR imagery, and uses dense scale-invariant feature transform (SIFT) as the structural feature descriptor. From the experimental results, the spectral information works better than the structural information, while the combination of the spectral and structural information is better than any single type of information. Taking the characteristic of the spatial configuration into consideration, SSBFC uses the whole image scene as the scope of the pooling operator, instead of the scope generated by a spatial pyramid (SP) commonly used in terrestrial image classification. The experimental results show that the whole image as the scope of the pooling operator performs better than the scope generated by SP. In addition, SSBFC codes and pools the spectral and structural features separately to avoid mutual interruption between the spectral and structural features. The coding vectors of spectral and structural features are then concatenated into a final coding vector. Finally, SSBFC classifies the final coding vector by support vector machine (SVM) with a histogram intersection kernel (HIK). Compared with the latest scene classification methods, the experimental results with three VHSR datasets demonstrate that the proposed SSBFC performs better than the other classification methods for VHSR image scenes.
Exarchos, Konstantinos P; Exarchos, Themis P; Rigas, Georgios; Papaloukas, Costas; Fotiadis, Dimitrios I
2011-05-10
In peptides and proteins, only a small percentile of peptide bonds adopts the cis configuration. Especially in the case of amide peptide bonds, the amount of cis conformations is quite limited thus hampering systematic studies, until recently. However, lately the emerging population of databases with more 3D structures of proteins has produced a considerable number of sequences containing non-proline cis formations (cis-nonPro). In our work, we extract regular expression-type patterns that are descriptive of regions surrounding the cis-nonPro formations. For this purpose, three types of pattern discovery are performed: i) exact pattern discovery, ii) pattern discovery using a chemical equivalency set, and iii) pattern discovery using a structural equivalency set. Afterwards, using each pattern as predicate, we search the Eukaryotic Linear Motif (ELM) resource to identify potential functional implications of regions with cis-nonPro peptide bonds. The patterns extracted from each type of pattern discovery are further employed, in order to formulate a pattern-based classifier, which is used to discriminate between cis-nonPro and trans-nonPro formations. In terms of functional implications, we observe a significant association of cis-nonPro peptide bonds towards ligand/binding functionalities. As for the pattern-based classification scheme, the highest results were obtained using the structural equivalency set, which yielded 70% accuracy, 77% sensitivity and 63% specificity.
LiCABEDS II. Modeling of ligand selectivity for G-protein-coupled cannabinoid receptors.
Ma, Chao; Wang, Lirong; Yang, Peng; Myint, Kyaw Z; Xie, Xiang-Qun
2013-01-28
The cannabinoid receptor subtype 2 (CB2) is a promising therapeutic target for blood cancer, pain relief, osteoporosis, and immune system disease. The recent withdrawal of Rimonabant, which targets another closely related cannabinoid receptor (CB1), accentuates the importance of selectivity for the development of CB2 ligands in order to minimize their effects on the CB1 receptor. In our previous study, LiCABEDS (Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps) was reported as a generic ligand classification algorithm for the prediction of categorical molecular properties. Here, we report extension of the application of LiCABEDS to the modeling of cannabinoid ligand selectivity with molecular fingerprints as descriptors. The performance of LiCABEDS was systematically compared with another popular classification algorithm, support vector machine (SVM), according to prediction precision and recall rate. In addition, the examination of LiCABEDS models revealed the difference in structure diversity of CB1 and CB2 selective ligands. The structure determination from data mining could be useful for the design of novel cannabinoid lead compounds. More importantly, the potential of LiCABEDS was demonstrated through successful identification of newly synthesized CB2 selective compounds.
Urasaki, Yasuyo; Fiscus, Ronald R; Le, Thuc T
2016-04-01
We describe an alternative approach to classifying fatty liver by profiling protein post-translational modifications (PTMs) with high-throughput capillary isoelectric focusing (cIEF) immunoassays. Four strains of mice were studied, with fatty livers induced by different causes, such as ageing, genetic mutation, acute drug usage, and high-fat diet. Nutrient-sensitive PTMs of a panel of 12 liver metabolic and signalling proteins were simultaneously evaluated with cIEF immunoassays, using nanograms of total cellular protein per assay. Changes to liver protein acetylation, phosphorylation, and O-N-acetylglucosamine glycosylation were quantified and compared between normal and diseased states. Fatty liver tissues could be distinguished from one another by distinctive protein PTM profiles. Fatty liver is currently classified by morphological assessment of lipid droplets, without identifying the underlying molecular causes. In contrast, high-throughput profiling of protein PTMs has the potential to provide molecular classification of fatty liver. Copyright © 2016 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd.
Aamir, Mohd; Singh, Vinay K.; Meena, Mukesh; Upadhyay, Ram S.; Gupta, Vijai K.; Singh, Surendra
2017-01-01
The WRKY transcription factors (TFs), play crucial role in plant defense response against various abiotic and biotic stresses. The role of WRKY3 and WRKY4 genes in plant defense response against necrotrophic pathogens is well-reported. However, their functional annotation in tomato is largely unknown. In the present work, we have characterized the structural and functional attributes of the two identified tomato WRKY transcription factors, WRKY3 (SlWRKY3), and WRKY4 (SlWRKY4) using computational approaches. Arabidopsis WRKY3 (AtWRKY3: NP_178433) and WRKY4 (AtWRKY4: NP_172849) protein sequences were retrieved from TAIR database and protein BLAST was done for finding their sequential homologs in tomato. Sequence alignment, phylogenetic classification, and motif composition analysis revealed the remarkable sequential variation between, these two WRKYs. The tomato WRKY3 and WRKY4 clusters with Solanum pennellii showing the monophyletic origin and evolution from their wild homolog. The functional domain region responsible for sequence specific DNA-binding occupied in both proteins were modeled [using AtWRKY4 (PDB ID:1WJ2) and AtWRKY1 (PDBID:2AYD) as template protein structures] through homology modeling using Discovery Studio 3.0. The generated models were further evaluated for their accuracy and reliability based on qualitative and quantitative parameters. The modeled proteins were found to satisfy all the crucial energy parameters and showed acceptable Ramachandran statistics when compared to the experimentally resolved NMR solution structures and/or X-Ray diffracted crystal structures (templates). The superimposition of the functional WRKY domains from SlWRKY3 and SlWRKY4 revealed remarkable structural similarity. The sequence specific DNA binding for two WRKYs was explored through DNA-protein interaction using Hex Docking server. The interaction studies found that SlWRKY4 binds with the W-box DNA through WRKYGQK with Tyr408, Arg409, and Lys419 with the initial flanking sequences also get involved in binding. In contrast, the SlWRKY3 made interaction with RKYGQK along with the residues from zinc finger motifs. Protein-protein interactions studies were done using STRING version 10.0 to explore all the possible protein partners involved in associative functional interaction networks. The Gene ontology enrichment analysis revealed the functional dimension and characterized the identified WRKYs based on their functional annotation. PMID:28611792
A structural classification for inland northwest forest vegetation.
Kevin L. O' Hara; Penelope A. Latham; Paul Hessburg; Bradley G. Smith
1996-01-01
Existing approaches to vegetation classification range from those bassed on potential vegetation to others based on existing vegetation composition, or existing structural or physiognomic characteristics. Examples of these classifications are numerous, and in some cases, date back hundreds of years (Mueller-Dumbois and Ellenberg 1974). Small-scale or stand level...
Towards a Collaborative Intelligent Tutoring System Classification Scheme
ERIC Educational Resources Information Center
Harsley, Rachel
2014-01-01
This paper presents a novel classification scheme for Collaborative Intelligent Tutoring Systems (CITS), an emergent research field. The three emergent classifications of CITS are unstructured, semi-structured, and fully structured. While all three types of CITS offer opportunities to improve student learning gains, the full extent to which these…
König, Caroline; Alquézar, René; Vellido, Alfredo; Giraldo, Jesús
2018-03-01
G-protein-coupled receptors (GPCRs) are a large and diverse super-family of eukaryotic cell membrane proteins that play an important physiological role as transmitters of extracellular signal. In this paper, we investigate Class C, a member of this super-family that has attracted much attention in pharmacology. The limited knowledge about the complete 3D crystal structure of Class C receptors makes necessary the use of their primary amino acid sequences for analytical purposes. Here, we provide a systematic analysis of distinct receptor sequence segments with regard to their ability to differentiate between seven class C GPCR subtypes according to their topological location in the extracellular, transmembrane, or intracellular domains. We build on the results from the previous research that provided preliminary evidence of the potential use of separated domains of complete class C GPCR sequences as the basis for subtype classification. The use of the extracellular N-terminus domain alone was shown to result in a minor decrease in subtype discrimination in comparison with the complete sequence, despite discarding much of the sequence information. In this paper, we describe the use of Support Vector Machine-based classification models to evaluate the subtype-discriminating capacity of the specific topological sequence segments.
Marini, Joan C; Forlino, Antonella; Bächinger, Hans Peter; Bishop, Nick J; Byers, Peter H; Paepe, Anne De; Fassier, Francois; Fratzl-Zelman, Nadja; Kozloff, Kenneth M; Krakow, Deborah; Montpetit, Kathleen; Semler, Oliver
2017-08-18
Skeletal deformity and bone fragility are the hallmarks of the brittle bone dysplasia osteogenesis imperfecta. The diagnosis of osteogenesis imperfecta usually depends on family history and clinical presentation characterized by a fracture (or fractures) during the prenatal period, at birth or in early childhood; genetic tests can confirm diagnosis. Osteogenesis imperfecta is caused by dominant autosomal mutations in the type I collagen coding genes (COL1A1 and COL1A2) in about 85% of individuals, affecting collagen quantity or structure. In the past decade, (mostly) recessive, dominant and X-linked defects in a wide variety of genes encoding proteins involved in type I collagen synthesis, processing, secretion and post-translational modification, as well as in proteins that regulate the differentiation and activity of bone-forming cells have been shown to cause osteogenesis imperfecta. The large number of causative genes has complicated the classic classification of the disease, and although a new genetic classification system is widely used, it is still debated. Phenotypic manifestations in many organs, in addition to bone, are reported, such as abnormalities in the cardiovascular and pulmonary systems, skin fragility, muscle weakness, hearing loss and dentinogenesis imperfecta. Management involves surgical and medical treatment of skeletal abnormalities, and treatment of other complications. More innovative approaches based on gene and cell therapy, and signalling pathway alterations, are under investigation.
Xu, Ruirui; Liu, Caiyun; Li, Ning; Zhang, Shizhong
2016-12-01
Argonaute (AGO) proteins, which are found in yeast, animals, and plants, are the core molecules of the RNA-induced silencing complex. These proteins play important roles in plant growth, development, and responses to biotic stresses. The complete analysis and classification of the AGO gene family have been recently reported in different plants. Nevertheless, systematic analysis and expression profiling of these genes have not been performed in apple (Malus domestica). Approximately 15 AGO genes were identified in the apple genome. The phylogenetic tree, chromosome location, conserved protein motifs, gene structure, and expression of the AGO gene family in apple were analyzed for gene prediction. All AGO genes were phylogenetically clustered into four groups (i.e., AGO1, AGO4, MEL1/AGO5, and ZIPPY/AGO7) with the AGO genes of Arabidopsis. These groups of the AGO gene family were statistically analyzed and compared among 31 plant species. The predicted apple AGO genes are distributed across nine chromosomes at different densities and include three segment duplications. Expression studies indicated that 15 AGO genes exhibit different expression patterns in at least one of the tissues tested. Additionally, analysis of gene expression levels indicated that the genes are mostly involved in responses to NaCl, PEG, heat, and low-temperature stresses. Hence, several candidate AGO genes are involved in different aspects of physiological and developmental processes and may play an important role in abiotic stress responses in apple. To the best of our knowledge, this study is the first to report a comprehensive analysis of the apple AGO gene family. Our results provide useful information to understand the classification and putative functions of these proteins, especially for gene members that may play important roles in abiotic stress responses in M. hupehensis.
Predicting permanent and transient protein-protein interfaces.
La, David; Kong, Misun; Hoffman, William; Choi, Youn Im; Kihara, Daisuke
2013-05-01
Protein-protein interactions (PPIs) are involved in diverse functions in a cell. To optimize functional roles of interactions, proteins interact with a spectrum of binding affinities. Interactions are conventionally classified into permanent and transient, where the former denotes tight binding between proteins that result in strong complexes, whereas the latter compose of relatively weak interactions that can dissociate after binding to regulate functional activity at specific time point. Knowing the type of interactions has significant implications for understanding the nature and function of PPIs. In this study, we constructed amino acid substitution models that capture mutation patterns at permanent and transient type of protein interfaces, which were found to be different with statistical significance. Using the substitution models, we developed a novel computational method that predicts permanent and transient protein binding interfaces (PBIs) in protein surfaces. Without knowledge of the interacting partner, the method uses a single query protein structure and a multiple sequence alignment of the sequence family. Using a large dataset of permanent and transient proteins, we show that our method, BindML+, performs very well in protein interface classification. A very high area under the curve (AUC) value of 0.957 was observed when predicted protein binding sites were classified. Remarkably, near prefect accuracy was achieved with an AUC of 0.991 when actual binding sites were classified. The developed method will be also useful for protein design of permanent and transient PBIs. Copyright © 2013 Wiley Periodicals, Inc.
Neuwald, Andrew F
2009-08-01
The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. Supplementary data are available at Bioinformatics online.
Pinaud, Fabien; Michalet, Xavier; Iyer, Gopal; Margeat, Emmanuel; Moore, Hsiao-Ping; Weiss, Shimon
2009-01-01
Recent experimental developments have led to a revision of the classical fluid mosaic model proposed by Singer and Nicholson 35 years ago. In particular, it is now well established that lipids and proteins diffuse heterogeneously in cell plasma membranes. Their complex motion patterns reflect the dynamic structure and composition of the membrane itself, as well as the presence of the underlying cytoskeleton scaffold and that of the extracellular matrix. How the structural organization of plasma membranes influences the diffusion of individual proteins remains a challenging, yet central question for cell signaling and its regulation. Here we have developed a raft-associated glycosylphosphatidyl Inositol-anchored avidin test probe (Av-GPI), whose diffusion patterns indirectly reports on the structure and dynamics of putative raft microdomains in the membrane of HeLa cells. Labeling with quantum dots (qdots) allowed high-resolution and long-term tracking of individual Av-GPI and the classification of their various diffusive behaviors. Using dual-color total internal reflection fluorescence (TIRF) microscopy, we studied the correlation between the diffusion of individual Av-GPI and the location of glycosphingolipid GM1-rich microdomains and caveolae. We show that Av-GPI exhibit a fast and a slow diffusion regime in different membrane regions, and that slowing down of their diffusion is correlated with entry in GM1-rich microdomains located in close proximity to, but distinct, from caveolae. We further show that Av-GPI dynamically partition in and out of these microdomains in a cholesterol-dependent manner. Our results provide direct evidence that cholesterol/sphingolipid-rich microdomains can compartmentalize the diffusion of GPI-anchored proteins in living cells and that the dynamic partitioning raft model appropriately describes the diffusive behavior of some raft-associated proteins across the plasma membrane. PMID:19416475
Pinaud, Fabien; Michalet, Xavier; Iyer, Gopal; Margeat, Emmanuel; Moore, Hsiao-Ping; Weiss, Shimon
2009-06-01
Recent experimental developments have led to a revision of the classical fluid mosaic model proposed by Singer and Nicholson more than 35 years ago. In particular, it is now well established that lipids and proteins diffuse heterogeneously in cell plasma membranes. Their complex motion patterns reflect the dynamic structure and composition of the membrane itself, as well as the presence of the underlying cytoskeleton scaffold and that of the extracellular matrix. How the structural organization of plasma membranes influences the diffusion of individual proteins remains a challenging, yet central, question for cell signaling and its regulation. Here we have developed a raft-associated glycosyl-phosphatidyl-inositol-anchored avidin test probe (Av-GPI), whose diffusion patterns indirectly report on the structure and dynamics of putative raft microdomains in the membrane of HeLa cells. Labeling with quantum dots (qdots) allowed high-resolution and long-term tracking of individual Av-GPI and the classification of their various diffusive behaviors. Using dual-color total internal reflection fluorescence (TIRF) microscopy, we studied the correlation between the diffusion of individual Av-GPI and the location of glycosphingolipid GM1-rich microdomains and caveolae. We show that Av-GPI exhibit a fast and a slow diffusion regime in different membrane regions, and that slowing down of their diffusion is correlated with entry in GM1-rich microdomains located in close proximity to, but distinct, from caveolae. We further show that Av-GPI dynamically partition in and out of these microdomains in a cholesterol-dependent manner. Our results provide direct evidence that cholesterol-/sphingolipid-rich microdomains can compartmentalize the diffusion of GPI-anchored proteins in living cells and that the dynamic partitioning raft model appropriately describes the diffusive behavior of some raft-associated proteins across the plasma membrane.
Pollock, Samuel B; Hu, Amy; Mou, Yun; Martinko, Alexander J; Julien, Olivier; Hornsby, Michael; Ploder, Lynda; Adams, Jarrett J; Geng, Huimin; Müschen, Markus; Sidhu, Sachdev S; Moffat, Jason; Wells, James A
2018-03-13
Human cells express thousands of different surface proteins that can be used for cell classification, or to distinguish healthy and disease conditions. A method capable of profiling a substantial fraction of the surface proteome simultaneously and inexpensively would enable more accurate and complete classification of cell states. We present a highly multiplexed and quantitative surface proteomic method using genetically barcoded antibodies called phage-antibody next-generation sequencing (PhaNGS). Using 144 preselected antibodies displayed on filamentous phage (Fab-phage) against 44 receptor targets, we assess changes in B cell surface proteins after the development of drug resistance in a patient with acute lymphoblastic leukemia (ALL) and in adaptation to oncogene expression in a Myc-inducible Burkitt lymphoma model. We further show PhaNGS can be applied at the single-cell level. Our results reveal that a common set of proteins including FLT3, NCR3LG1, and ROR1 dominate the response to similar oncogenic perturbations in B cells. Linking high-affinity, selective, genetically encoded binders to NGS enables direct and highly multiplexed protein detection, comparable to RNA-sequencing for mRNA. PhaNGS has the potential to profile a substantial fraction of the surface proteome simultaneously and inexpensively to enable more accurate and complete classification of cell states. Copyright © 2018 the Author(s). Published by PNAS.
Functional annotation from the genome sequence of the giant panda.
Huo, Tong; Zhang, Yinjie; Lin, Jianping
2012-08-01
The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.
FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds.
Abbasi, Elham; Ghatee, Mehdi; Shiri, M E
2013-09-01
In this paper, an intelligent hyper framework is proposed to recognize protein folds from its amino acid sequence which is a fundamental problem in bioinformatics. This framework includes some statistical and intelligent algorithms for proteins classification. The main components of the proposed framework are the Fuzzy Resource-Allocating Network (FRAN) and the Radial Bases Function based on Particle Swarm Optimization (RBF-PSO). FRAN applies a dynamic method to tune up the RBF network parameters. Due to the patterns complexity captured in protein dataset, FRAN classifies the proteins under fuzzy conditions. Also, RBF-PSO applies PSO to tune up the RBF classifier. Experimental results demonstrate that FRAN improves prediction accuracy up to 51% and achieves acceptable multi-class results for protein fold prediction. Although RBF-PSO provides reasonable results for protein fold recognition up to 48%, it is weaker than FRAN in some cases. However the proposed hyper framework provides an opportunity to use a great range of intelligent methods and can learn from previous experiences. Thus it can avoid the weakness of some intelligent methods in terms of memory, computational time and static structure. Furthermore, the performance of this system can be enhanced throughout the system life-cycle. Copyright © 2013 Elsevier Ltd. All rights reserved.
Analysis of deep learning methods for blind protein contact prediction in CASP12.
Wang, Sheng; Sun, Siqi; Xu, Jinbo
2018-03-01
Here we present the results of protein contact prediction achieved in CASP12 by our RaptorX-Contact server, which is an early implementation of our deep learning method for contact prediction. On a set of 38 free-modeling target domains with a median family size of around 58 effective sequences, our server obtained an average top L/5 long- and medium-range contact accuracy of 47% and 44%, respectively (L = length). A complete implementation has an average accuracy of 59% and 57%, respectively. Our deep learning method formulates contact prediction as a pixel-level image labeling problem and simultaneously predicts all residue pairs of a protein using a combination of two deep residual neural networks, taking as input the residue conservation information, predicted secondary structure and solvent accessibility, contact potential, and coevolution information. Our approach differs from existing methods mainly in (1) formulating contact prediction as a pixel-level image labeling problem instead of an image-level classification problem; (2) simultaneously predicting all contacts of an individual protein to make effective use of contact occurrence patterns; and (3) integrating both one-dimensional and two-dimensional deep convolutional neural networks to effectively learn complex sequence-structure relationship including high-order residue correlation. This paper discusses the RaptorX-Contact pipeline, both contact prediction and contact-based folding results, and finally the strength and weakness of our method. © 2017 Wiley Periodicals, Inc.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Constantinidis, I.; Satterlee, J.D.; Pandey, R.K.
1988-04-19
This work indicates a high degree of purity for our preparations of all three of the primary Glycera dibranchiata monomer hemoglobins and details assignments of the heme methyl and vinyl protons in the hyperfine shift region of the ferric (aquo.) protein forms. The assignments were carried out by reconstituting the apoproteins of each component with selectively deuteriated hemes. The results indicate that even though the individual component preparations consist of essentially a single protein, the proton NMR spectra indicate spectroscopic heterogeneity. Evidence is presented for identification and classification of major and minor protein forms that are present in solutions ofmore » each component. Finally, in contrast to previous results, a detailed analysis of the proton hyperfine shift patterns of the major and minor forms of each component, in comparison to the major and minor forms of metmyoglobin, leads to the conclusions that the corresponding forms of the proteins from each species have strikingly similar heme-globin contacts and display nearly identical heme electronic structures and coordination numbers.« less
Zhao, Jie
2010-01-01
Arabinogalactan proteins (AGPs) comprise a family of hydroxyproline-rich glycoproteins that are implicated in plant growth and development. In this study, 69 AGPs are identified from the rice genome, including 13 classical AGPs, 15 arabinogalactan (AG) peptides, three non-classical AGPs, three early nodulin-like AGPs (eNod-like AGPs), eight non-specific lipid transfer protein-like AGPs (nsLTP-like AGPs), and 27 fasciclin-like AGPs (FLAs). The results from expressed sequence tags, microarrays, and massively parallel signature sequencing tags are used to analyse the expression of AGP-encoding genes, which is confirmed by real-time PCR. The results reveal that several rice AGP-encoding genes are predominantly expressed in anthers and display differential expression patterns in response to abscisic acid, gibberellic acid, and abiotic stresses. Based on the results obtained from this analysis, an attempt has been made to link the protein structures and expression patterns of rice AGP-encoding genes to their functions. Taken together, the genome-wide identification and expression analysis of the rice AGP gene family might facilitate further functional studies of rice AGPs. PMID:20423940
NASA Astrophysics Data System (ADS)
Wan, Yi
2011-06-01
Chinese wines can be classification or graded by the micrographs. Micrographs of Chinese wines show floccules, stick and granule of variant shape and size. Different wines have variant microstructure and micrographs, we study the classification of Chinese wines based on the micrographs. Shape and structure of wines' particles in microstructure is the most important feature for recognition and classification of wines. So we introduce a feature extraction method which can describe the structure and region shape of micrograph efficiently. First, the micrographs are enhanced using total variation denoising, and segmented using a modified Otsu's method based on the Rayleigh Distribution. Then features are extracted using proposed method in the paper based on area, perimeter and traditional shape feature. Eight kinds total 26 features are selected. Finally, Chinese wine classification system based on micrograph using combination of shape and structure features and BP neural network have been presented. We compare the recognition results for different choices of features (traditional shape features or proposed features). The experimental results show that the better classification rate have been achieved using the combinational features proposed in this paper.
MultiSeq: unifying sequence and structure data for evolutionary analysis
Roberts, Elijah; Eargle, John; Wright, Dan; Luthey-Schulten, Zaida
2006-01-01
Background Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Results Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary biology, the environment is also flexible enough to accommodate other usage patterns. The evolutionary approach is supported by the use of predefined metadata, adherence to standard ontological mappings, and the ability for the user to adjust these classifications using an electronic notebook. MultiSeq contains a new algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of a homologous group of distantly related proteins. The method, based on the multidimensional QR factorization of multiple sequence and structure alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. Conclusion MultiSeq is a major extension of the Multiple Alignment tool that is provided as part of VMD, a structural visualization program for analyzing molecular dynamics simulations. Both are freely distributed by the NIH Resource for Macromolecular Modeling and Bioinformatics and MultiSeq is included with VMD starting with version 1.8.5. The MultiSeq website has details on how to download and use the software: PMID:16914055
Classification of Ancient Mammal Individuals Using Dental Pulp MALDI-TOF MS Peptide Profiling
Tran, Thi-Nguyen-Ny; Aboudharam, Gérard; Gardeisen, Armelle; Davoust, Bernard; Bocquet-Appel, Jean-Pierre; Flaudrops, Christophe; Belghazi, Maya; Raoult, Didier; Drancourt, Michel
2011-01-01
Background The classification of ancient animal corpses at the species level remains a challenging task for forensic scientists and anthropologists. Severe damage and mixed, tiny pieces originating from several skeletons may render morphological classification virtually impossible. Standard approaches are based on sequencing mitochondrial and nuclear targets. Methodology/Principal Findings We present a method that can accurately classify mammalian species using dental pulp and mass spectrometry peptide profiling. Our work was organized into three successive steps. First, after extracting proteins from the dental pulp collected from 37 modern individuals representing 13 mammalian species, trypsin-digested peptides were used for matrix-assisted laser desorption/ionization time-of-flight mass spectrometry analysis. The resulting peptide profiles accurately classified every individual at the species level in agreement with parallel cytochrome b gene sequencing gold standard. Second, using a 279–modern spectrum database, we blindly classified 33 of 37 teeth collected in 37 modern individuals (89.1%). Third, we classified 10 of 18 teeth (56%) collected in 15 ancient individuals representing five mammal species including human, from five burial sites dating back 8,500 years. Further comparison with an upgraded database comprising ancient specimen profiles yielded 100% classification in ancient teeth. Peptide sequencing yield 4 and 16 different non-keratin proteins including collagen (alpha-1 type I and alpha-2 type I) in human ancient and modern dental pulp, respectively. Conclusions/Significance Mass spectrometry peptide profiling of the dental pulp is a new approach that can be added to the arsenal of species classification tools for forensics and anthropology as a complementary method to DNA sequencing. The dental pulp is a new source for collagen and other proteins for the species classification of modern and ancient mammal individuals. PMID:21364886
Wang, Shunfang; Nie, Bing; Yue, Kun; Fei, Yu; Li, Wenjia; Xu, Dongshu
2017-12-15
Kernel discriminant analysis (KDA) is a dimension reduction and classification algorithm based on nonlinear kernel trick, which can be novelly used to treat high-dimensional and complex biological data before undergoing classification processes such as protein subcellular localization. Kernel parameters make a great impact on the performance of the KDA model. Specifically, for KDA with the popular Gaussian kernel, to select the scale parameter is still a challenging problem. Thus, this paper introduces the KDA method and proposes a new method for Gaussian kernel parameter selection depending on the fact that the differences between reconstruction errors of edge normal samples and those of interior normal samples should be maximized for certain suitable kernel parameters. Experiments with various standard data sets of protein subcellular localization show that the overall accuracy of protein classification prediction with KDA is much higher than that without KDA. Meanwhile, the kernel parameter of KDA has a great impact on the efficiency, and the proposed method can produce an optimum parameter, which makes the new algorithm not only perform as effectively as the traditional ones, but also reduce the computational time and thus improve efficiency.
MIPS: a database for protein sequences, homology data and yeast genome information.
Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F
1997-01-01
The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498
NASA Astrophysics Data System (ADS)
Dondeyne, Stefaan; Juilleret, Jérôme; Vancampenhout, Karen; Deckers, Jozef; Hissler, Christophe
2017-04-01
Classification of soils in both World Reference Base for soil resources (WRB) and Soil Taxonomy hinges on the identification of diagnostic horizons and characteristics. However as these features often occur within the first 100 cm, these classification systems convey little information on subsoil characteristics. An integrated knowledge of the soil, soil-to-substratum and deeper substratum continuum is required when dealing with environmental issues such as vegetation ecology, water quality or the Critical Zone in general. Therefore, we recently proposed a classification system of the subsolum complementing current soil classification systems. By reflecting on the structure of the subsoil classification system which is inspired by WRB, we aim at fostering a discussion on some potential future developments of WRB. For classifying the subsolum we define Regolite, Saprolite, Saprock and Bedrock as four Subsolum Reference Groups each corresponding to different weathering stages of the subsoil. Principal qualifiers can be used to categorize intergrades of these Subsoil Reference Groups while morphologic and lithologic characteristics can be presented with supplementary qualifiers. We argue that adopting a low hierarchical structure - akin to WRB and in contrast to a strong hierarchical structure as in Soil Taxonomy - offers the advantage of having an open classification system avoiding the need for a priori knowledge of all possible combinations which may be encountered in the field. Just as in WRB we also propose to use principal and supplementary qualifiers as a second level of classification. However, in contrast to WRB we propose to reserve the principal qualifiers for intergrades and to regroup the supplementary qualifiers into thematic categories (morphologic or lithologic). Structuring the qualifiers in this manner should facilitate the integration and handling of both soil and subsoil classification units into soil information systems and calls for paying attention to these structural issues in future developments of WRB.
77 FR 44456 - Classification of Two Steroids, Prostanozol
Federal Register 2010, 2011, 2012, 2013, 2014
2012-07-30
... by positive nitrogen balance and protein metabolism, resulting in increases in protein synthesis and... activity by means of nitrogen balance and androgenic activity based on weight changes of the ventral...
Classification of polytype structures of zinc sulfide
DOE Office of Scientific and Technical Information (OSTI.GOV)
Laptev, V.I.
1994-12-31
It is suggested that the existing classification of polytype structures of zinc sulfide be supplemented with an additional criterion: the characteristic of regular point systems (Wyckoff positions) including their type, number, and multiplicity. The consideration of the Wyckoff positions allowed the establishment of construction principles of known polytype series of different symmetries and the systematization (for the first time) of the polytypes with the same number of differently packed layers. the classification suggested for polytype structures of zinc sulfide is compact and provides a basis for creating search systems. The classification table obtained can also be used for numerous siliconmore » carbide polytypes. 8 refs., 4 tabs.« less
Novel functions of CCM1 delimit the relationship of PTB/PH domains.
Zhang, Jun; Dubey, Pallavi; Padarti, Akhil; Zhang, Aileen; Patel, Rinkal; Patel, Vipulkumar; Cistola, David; Badr, Ahmed
2017-10-01
Three NPXY motifs and one FERM domain in CCM1 makes it a versatile scaffold protein for tethering the signaling components together within the CCM signaling complex (CSC). The cellular role of CCM1 protein remains inadequately expounded. Both phosphotyrosine binding (PTB) and pleckstrin homology (PH) domains were recognized as structurally related but functionally distinct domains. By utilizing molecular cloning, protein binding assays and RT-qPCR to identify novel cellular partners of CCM1 and its cellular expression patterns; by screening candidate PTB/PH proteins and subsequently structurally simulation in combining with current X-ray crystallography and NMR data to defined the essential structure of PTB/PH domain for NPXY-binding and the relationship among PTB, PH and FERM domain(s). We identified a group of 28 novel cellular partners of CCM1, all of which contain either PTB or PH domain(s), and developed a novel classification system for these PTB/PH proteins based on their relationship with different NPXY motifs of CCM1. Our results demonstrated that CCM1 has a wide spectrum of binding to different PTB/PH proteins and perpetuates their specificity to interact with certain PTB/PH domains through selective combination of three NPXY motifs. We also demonstrated that CCM1 can be assembled into oligomers through intermolecular interaction between its F3 lobe in FERM domain and one of the three NPXY motifs. Despite being embedded in FERM domain as F3 lobe, F3 module acts as a fully functional PH domain to interact with NPXY motif. The most salient feature of the study was that both PTB and PH domains are structurally and functionally comparable, suggesting that PTB domain is likely evolved from PH domain with polymorphic structural additions at its N-terminus. A new β1A-strand of the PTB domain was discovered and new minimum structural requirement of PTB/PH domain for NPXY motif-binding was determined. Based on our data, a novel theory of structure, function and relationship of PTB, PH and FERM domains has been proposed, which extends the importance of the NPXY-PTB/PH interaction on the CSC signaling and/or other cell receptors with great potential pointing to new therapeutic strategies. The study provides new insight into the structural characteristics of PTB/PH domains, essential structural elements of PTB/PH domain required for NPXY motif-binding, and function and relationship among PTB, PH and FERM domains. Copyright © 2017 Elsevier B.V. All rights reserved.
Brindley, Amanda A; Raux, Evelyne; Leech, Helen K; Schubert, Heidi L; Warren, Martin J
2003-06-20
The cobaltochelatase required for the synthesis of vitamin B12 (cobalamin) in the archaeal kingdom has been identified as CbiX through similarity searching with the CbiX from Bacillus megaterium. However, the CbiX proteins in the archaea are much shorter than the CbiX proteins found in eubacteria, typically containing less than half the number of amino acids in their primary structure. For this reason the shorter CbiX proteins have been termed CbiXS and the longer versions CbiXL. The CbiXS proteins from Methanosarcina barkeri and Methanobacter thermoautotrophicum were overproduced in Escherichia coli as recombinant proteins and characterized. Through complementation studies of a defined chelatase-deficient strain of E. coli and by direct in vitro assays the function of CbiXS as a sirohydrochlorin cobaltochelatase has been demonstrated. On the basis of sequence alignments and conserved active site residues we suggest that CbiXS may represent a primordial chelatase, giving rise to larger chelatases such as CbiXL, SirB, CbiK, and HemH through gene duplication and subsequent variation and selection. A classification scheme for chelatases is proposed.
Detecting similarities among distant homologous proteins by comparison of domain flexibilities.
Pandini, Alessandro; Mauri, Giancarlo; Bordogna, Annalisa; Bonati, Laura
2007-06-01
Aim of this work is to assess the informativeness of protein dynamics in the detection of similarities among distant homologous proteins. To this end, an approach to perform large-scale comparisons of protein domain flexibilities is proposed. CONCOORD is confirmed as a reliable method for fast conformational sampling. The root mean square fluctuation of alpha carbon positions in the essential dynamics subspace is employed as a measure of local flexibility and a synthetic index of similarity is presented. The dynamics of a large collection of protein domains from ASTRAL/SCOP40 is analyzed and the possibility to identify relationships, at both the family and the superfamily levels, on the basis of the dynamical features is discussed. The obtained picture is in agreement with the SCOP classification, and furthermore suggests the presence of a distinguishable familiar trend in the flexibility profiles. The results support the complementarity of the dynamical and the structural information, suggesting that information from dynamics analysis can arise from functional similarities, often partially hidden by a static comparison. On the basis of this first test, flexibility annotation can be expected to help in automatically detecting functional similarities otherwise unrecoverable.
EKPD: a hierarchical database of eukaryotic protein kinases and protein phosphatases.
Wang, Yongbo; Liu, Zexian; Cheng, Han; Gao, Tianshun; Pan, Zhicheng; Yang, Qing; Guo, Anyuan; Xue, Yu
2014-01-01
We present here EKPD (http://ekpd.biocuckoo.org), a hierarchical database of eukaryotic protein kinases (PKs) and protein phosphatases (PPs), the key molecules responsible for the reversible phosphorylation of proteins that are involved in almost all aspects of biological processes. As extensive experimental and computational efforts have been carried out to identify PKs and PPs, an integrative resource with detailed classification and annotation information would be of great value for both experimentalists and computational biologists. In this work, we first collected 1855 PKs and 347 PPs from the scientific literature and various public databases. Based on previously established rationales, we classified all of the known PKs and PPs into a hierarchical structure with three levels, i.e. group, family and individual PK/PP. There are 10 groups with 149 families for the PKs and 10 groups with 33 families for the PPs. We constructed 139 and 27 Hidden Markov Model profiles for PK and PP families, respectively. Then we systematically characterized ∼50,000 PKs and >10,000 PPs in eukaryotes. In addition, >500 PKs and >400 PPs were computationally identified by ortholog search. Finally, the online service of the EKPD database was implemented in PHP + MySQL + JavaScript.
The COG database: a tool for genome-scale analysis of protein functions and evolution
Tatusov, Roman L.; Galperin, Michael Y.; Natale, Darren A.; Koonin, Eugene V.
2000-01-01
Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on a phylogenetic classification of the proteins encoded in 21 complete genomes of bacteria, archaea and eukaryotes (http://www.ncbi.nlm.nih.gov/COG ). The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The database comprises 2091 COGs that include 56–83% of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. The COG database is accompanied by the COGNITOR program that is used to fit new proteins into the COGs and can be applied to functional and phylogenetic annotation of newly sequenced genomes. PMID:10592175
Discovering the intelligence in molecular biology.
Uberbacher, E
1995-12-01
The Third International Conference on Intelligent Systems in Molecular Biology was truly an outstanding event. Computational methods in molecular biology have reached a new level of maturity and utility, resulting in many high-impact applications. The success of this meeting bodes well for the rapid and continuing development of computational methods, intelligent systems and information-based approaches for the biosciences. The basic technology, originally most often applied to 'feasibility' problems, is now dealing effectively with the most difficult real-world problems. Significant progress has been made in understanding protein-structure information, structural classification, and how functional information and the relevant features of active-site geometry can be gleaned from structures by automated computational approaches. The value and limits of homology-based methods, and the ability to classify proteins by structure in the absence of homology, have reached a new level of sophistication. New methods for covariation analysis in the folding of large structures such as RNAs have shown remarkably good results, indicating the long-term potential to understand very complicated molecules and multimolecular complexes using computational means. Novel methods, such as HMMs, context-free grammars and the uses of mutual information theory, have taken center stage as highly valuable tools in our quest to represent and characterize biological information. A focus on creative uses of intelligent systems technologies and the trend toward biological application will undoubtedly continue and grow at the 1996 ISMB meeting in St Louis.
Torrens, Francisco; Castellano, Gloria
2014-06-05
Pesticide residues in wine were analyzed by liquid chromatography-tandem mass spectrometry. Retentions are modelled by structure-property relationships. Bioplastic evolution is an evolutionary perspective conjugating effect of acquired characters and evolutionary indeterminacy-morphological determination-natural selection principles; its application to design co-ordination index barely improves correlations. Fractal dimensions and partition coefficient differentiate pesticides. Classification algorithms are based on information entropy and its production. Pesticides allow a structural classification by nonplanarity, and number of O, S, N and Cl atoms and cycles; different behaviours depend on number of cycles. The novelty of the approach is that the structural parameters are related to retentions. Classification algorithms are based on information entropy. When applying procedures to moderate-sized sets, excessive results appear compatible with data suffering a combinatorial explosion. However, equipartition conjecture selects criterion resulting from classification between hierarchical trees. Information entropy permits classifying compounds agreeing with principal component analyses. Periodic classification shows that pesticides in the same group present similar properties; those also in equal period, maximum resemblance. The advantage of the classification is to predict the retentions for molecules not included in the categorization. Classification extends to phenyl/sulphonylureas and the application will be to predict their retentions.
Reconstruction of the experimentally supported human protein interactome: what can we learn?
2013-01-01
Background Understanding the topology and dynamics of the human protein-protein interaction (PPI) network will significantly contribute to biomedical research, therefore its systematic reconstruction is required. Several meta-databases integrate source PPI datasets, but the protein node sets of their networks vary depending on the PPI data combined. Due to this inherent heterogeneity, the way in which the human PPI network expands via multiple dataset integration has not been comprehensively analyzed. We aim at assembling the human interactome in a global structured way and exploring it to gain insights of biological relevance. Results First, we defined the UniProtKB manually reviewed human “complete” proteome as the reference protein-node set and then we mined five major source PPI datasets for direct PPIs exclusively between the reference proteins. We updated the protein and publication identifiers and normalized all PPIs to the UniProt identifier level. The reconstructed interactome covers approximately 60% of the human proteome and has a scale-free structure. No apparent differentiating gene functional classification characteristics were identified for the unrepresented proteins. The source dataset integration augments the network mainly in PPIs. Polyubiquitin emerged as the highest-degree node, but the inclusion of most of its identified PPIs may be reconsidered. The high number (>300) of connections of the subsequent fifteen proteins correlates well with their essential biological role. According to the power-law network structure, the unrepresented proteins should mainly have up to four connections with equally poorly-connected interactors. Conclusions Reconstructing the human interactome based on the a priori definition of the protein nodes enabled us to identify the currently included part of the human “complete” proteome, and discuss the role of the proteins within the network topology with respect to their function. As the network expansion has to comply with the scale-free theory, we suggest that the core of the human interactome has essentially emerged. Thus, it could be employed in systems biology and biomedical research, despite the considerable number of currently unrepresented proteins. The latter are probably involved in specialized physiological conditions, justifying the scarcity of related PPI information, and their identification can assist in designing relevant functional experiments and targeted text mining algorithms. PMID:24088582
Regulation of IAP (Inhibitor of Apoptosis) Gene Expression by the p53 Tumor Suppressor Protein
2005-05-01
adenovirus, gene therapy, polymorphism, 31 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20...averaged results of three inde- pendent experiments, with standard error. Right panel: Level of p53 in infected cells using the antibody Ab-6 (Calbiochem...with highly purified mitochondria as described in (2). The arrow marks oligomerized BAK. The right _ -. panel depicts the purity of BMH CrosIinked Mito
21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.
Code of Federal Regulations, 2010 CFR
2010-04-01
... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...
21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.
Code of Federal Regulations, 2012 CFR
2012-04-01
... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...
21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.
Code of Federal Regulations, 2011 CFR
2011-04-01
... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...
21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.
Code of Federal Regulations, 2013 CFR
2013-04-01
... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...
21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.
Code of Federal Regulations, 2014 CFR
2014-04-01
... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...
Engwegen, Judith Y M N; Helgason, Helgi H; Cats, Annemieke; Harris, Nathan; Bonfrer, Johannes M G; Schellens, Jan H M; Beijnen, Jos H
2006-03-14
To detect the new serum biomarkers for colorectal cancer (CRC) by serum protein profiling with surface-enhanced laser desorption ionisation--time of flight mass spectrometry (SELDI-TOF MS). Two independent serum sample sets were analysed separately with the ProteinChip technology (set A: 40 CRC+49 healthy controls; set B: 37 CRC+31 healthy controls), using chips with a weak cation exchange moiety and buffer pH 5. Discriminative power of differentially expressed proteins was assessed with a classification tree algorithm. Sensitivities and specificities of the generated classification trees were obtained by blindly applying data from set A to the generated trees from set B and vice versa. CRC serum protein profiles were also compared with those from breast, ovarian, prostate, and non-small cell lung cancer. Mass-to-charge ratios (m/z) 3.1x10(3), 3.3x10(3), 4.5x10(3), 6.6x10(3) and 28x10(3) were used as classifiers in the best-performing classification trees. Tree sensitivities and specificities were between 65% and 90%. Most of these discriminative m/z values were also different in the other tumour types investigated. M/z 3.3x10(3), main classifier in most trees, was a doubly charged form of the 6.6x10(3)-Da protein. The latter was identified as apolipoprotein C-I. M/z 3.1x10(3) was identified as an N-terminal fragment of albumin, and m/z 28x10(3) as apolipoprotein A-I. SELDI-TOF MS followed by classification tree pattern analysis is a suitable technique for finding new serum markers for CRC. Biomarkers can be identified and reproducibly detected in independent sample sets with high sensitivities and specificities. Although not specific for CRC, these biomarkers have a potential role in disease and treatment monitoring.
Classification of Phylogenetic Profiles for Protein Function Prediction: An SVM Approach
NASA Astrophysics Data System (ADS)
Kotaru, Appala Raju; Joshi, Ramesh C.
Predicting the function of an uncharacterized protein is a major challenge in post-genomic era due to problems complexity and scale. Having knowledge of protein function is a crucial link in the development of new drugs, better crops, and even the development of biochemicals such as biofuels. Recently numerous high-throughput experimental procedures have been invented to investigate the mechanisms leading to the accomplishment of a protein’s function and Phylogenetic profile is one of them. Phylogenetic profile is a way of representing a protein which encodes evolutionary history of proteins. In this paper we proposed a method for classification of phylogenetic profiles using supervised machine learning method, support vector machine classification along with radial basis function as kernel for identifying functionally linked proteins. We experimentally evaluated the performance of the classifier with the linear kernel, polynomial kernel and compared the results with the existing tree kernel. In our study we have used proteins of the budding yeast saccharomyces cerevisiae genome. We generated the phylogenetic profiles of 2465 yeast genes and for our study we used the functional annotations that are available in the MIPS database. Our experiments show that the performance of the radial basis kernel is similar to polynomial kernel is some functional classes together are better than linear, tree kernel and over all radial basis kernel outperformed the polynomial kernel, linear kernel and tree kernel. In analyzing these results we show that it will be feasible to make use of SVM classifier with radial basis function as kernel to predict the gene functionality using phylogenetic profiles.
Knowledge Discovery in Spectral Data by Means of Complex Networks
Zanin, Massimiliano; Papo, David; Solís, José Luis González; Espinosa, Juan Carlos Martínez; Frausto-Reyes, Claudio; Anda, Pascual Palomares; Sevilla-Escoboza, Ricardo; Boccaletti, Stefano; Menasalvas, Ernestina; Sousa, Pedro
2013-01-01
In the last decade, complex networks have widely been applied to the study of many natural and man-made systems, and to the extraction of meaningful information from the interaction structures created by genes and proteins. Nevertheless, less attention has been devoted to metabonomics, due to the lack of a natural network representation of spectral data. Here we define a technique for reconstructing networks from spectral data sets, where nodes represent spectral bins, and pairs of them are connected when their intensities follow a pattern associated with a disease. The structural analysis of the resulting network can then be used to feed standard data-mining algorithms, for instance for the classification of new (unlabeled) subjects. Furthermore, we show how the structure of the network is resilient to the presence of external additive noise, and how it can be used to extract relevant knowledge about the development of the disease. PMID:24957895
Knowledge discovery in spectral data by means of complex networks.
Zanin, Massimiliano; Papo, David; Solís, José Luis González; Espinosa, Juan Carlos Martínez; Frausto-Reyes, Claudio; Anda, Pascual Palomares; Sevilla-Escoboza, Ricardo; Jaimes-Reategui, Rider; Boccaletti, Stefano; Menasalvas, Ernestina; Sousa, Pedro
2013-03-11
In the last decade, complex networks have widely been applied to the study of many natural and man-made systems, and to the extraction of meaningful information from the interaction structures created by genes and proteins. Nevertheless, less attention has been devoted to metabonomics, due to the lack of a natural network representation of spectral data. Here we define a technique for reconstructing networks from spectral data sets, where nodes represent spectral bins, and pairs of them are connected when their intensities follow a pattern associated with a disease. The structural analysis of the resulting network can then be used to feed standard data-mining algorithms, for instance for the classification of new (unlabeled) subjects. Furthermore, we show how the structure of the network is resilient to the presence of external additive noise, and how it can be used to extract relevant knowledge about the development of the disease.
Mallik, Saurav; Kundu, Sudip
2017-07-01
Is the order in which biomolecular subunits self-assemble into functional macromolecular complexes imprinted in their sequence-space? Here, we demonstrate that the temporal order of macromolecular complex self-assembly can be efficiently captured using the landscape of residue-level coevolutionary constraints. This predictive power of coevolutionary constraints is irrespective of the structural, functional, and phylogenetic classification of the complex and of the stoichiometry and quaternary arrangement of the constituent monomers. Combining this result with a number of structural attributes estimated from the crystal structure data, we find indications that stronger coevolutionary constraints at interfaces formed early in the assembly hierarchy probably promotes coordinated fixation of mutations that leads to high-affinity binding with higher surface area, increased surface complementarity and elevated number of molecular contacts, compared to those that form late in the assembly. Proteins 2017; 85:1183-1189. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
The Plant Peptidome: An Expanding Repertoire of Structural Features and Biological Functions[OPEN
Tavormina, Patrizia; De Coninck, Barbara; Nikonorova, Natalia; De Smet, Ive; Cammue, Bruno P.A.
2015-01-01
Peptides fulfill a plethora of functions in plant growth, development, and stress responses. They act as key components of cell-to-cell communication, interfere with signaling and response pathways, or display antimicrobial activity. Strikingly, both the diversity and amount of plant peptides have been largely underestimated. Most characterized plant peptides to date acting as small signaling peptides or antimicrobial peptides are derived from nonfunctional precursor proteins. However, evidence is emerging on peptides derived from a functional protein, directly translated from small open reading frames (without the involvement of a precursor) or even encoded by primary transcripts of microRNAs. These novel types of peptides further add to the complexity of the plant peptidome, even though their number is still limited and functional characterization as well as translational evidence are often controversial. Here, we provide a comprehensive overview of the reported types of plant peptides, including their described functional and structural properties. We propose a novel, unifying peptide classification system to emphasize the enormous diversity in peptide synthesis and consequent complexity of the still expanding knowledge on the plant peptidome. PMID:26276833
A novel, extremely alkaliphilic and cold-active esterase from Antarctic desert soil.
Hu, Xiao Ping; Heath, Caroline; Taylor, Mark Paul; Tuffin, Marla; Cowan, Don
2012-01-01
A novel, cold-active and highly alkaliphilic esterase was isolated from an Antarctic desert soil metagenomic library by functional screening. The 1,044 bp gene sequence contained several conserved regions common to lipases/esterases, but lacked clear classification based on sequence analysis alone. Moderate (<40%) amino acid sequence similarity to known esterases was apparent (the closest neighbour being a hypothetical protein from Chitinophaga pinensis), despite phylogenetic distance to many of the lipolytic "families". The enzyme functionally demonstrated activity towards shorter chain p-nitrophenyl esters with the optimal activity recorded towards p-nitrophenyl propionate (C3). The enzyme possessed an apparent T(opt) at 20°C and a pH optimum at pH 11. Esterases possessing such extreme alkaliphily are rare and so this enzyme represents an intriguing novel locus in protein sequence space. A metagenomic approach has been shown, in this case, to yield an enzyme with quite different sequential/structural properties to known lipases. It serves as an excellent candidate for analysis of the molecular mechanisms responsible for both cold and alkaline activity and novel structure-function relationships of esterase activity.
Structure-based CoMFA as a predictive model - CYP2C9 inhibitors as a test case.
Yasuo, Kazuya; Yamaotsu, Noriyuki; Gouda, Hiroaki; Tsujishita, Hideki; Hirono, Shuichi
2009-04-01
In this study, we tried to establish a general scheme to create a model that could predict the affinity of small compounds to their target proteins. This scheme consists of a search for ligand-binding sites on a protein, a generation of bound conformations (poses) of ligands in each of the sites by docking, identifications of the correct poses of each ligand by consensus scoring and MM-PBSA analysis, and a construction of a CoMFA model with the obtained poses to predict the affinity of the ligands. By using a crystal structure of CYP 2C9 and the twenty known CYP inhibitors as a test case, we obtained a CoMFA model with a good statistics, which suggested that the classification of the binding sites as well as the predicted bound poses of the ligands should be reasonable enough. The scheme described here would give a method to predict the affinity of small compounds with a reasonable accuracy, which is expected to heighten the value of computational chemistry in the drug design process.
Structural Bridges through Fold Space.
Edwards, Hannah; Deane, Charlotte M
2015-09-01
Several protein structure classification schemes exist that partition the protein universe into structural units called folds. Yet these schemes do not discuss how these units sit relative to each other in a global structure space. In this paper we construct networks that describe such global relationships between folds in the form of structural bridges. We generate these networks using four different structural alignment methods across multiple score thresholds. The networks constructed using the different methods remain a similar distance apart regardless of the probability threshold defining a structural bridge. This suggests that at least some structural bridges are method specific and that any attempt to build a picture of structural space should not be reliant on a single structural superposition method. Despite these differences all representations agree on an organisation of fold space into five principal community structures: all-α, all-β sandwiches, all-β barrels, α/β and α + β. We project estimated fold ages onto the networks and find that not only are the pairings of unconnected folds associated with higher age differences than bridged folds, but this difference increases with the number of networks displaying an edge. We also examine different centrality measures for folds within the networks and how these relate to fold age. While these measures interpret the central core of fold space in varied ways they all identify the disposition of ancestral folds to fall within this core and that of the more recently evolved structures to provide the peripheral landscape. These findings suggest that evolutionary information is encoded along these structural bridges. Finally, we identify four highly central pivotal folds representing dominant topological features which act as key attractors within our landscapes.
Chen, Xiang; Velliste, Meel; Murphy, Robert F.
2010-01-01
Proteomics, the large scale identification and characterization of many or all proteins expressed in a given cell type, has become a major area of biological research. In addition to information on protein sequence, structure and expression levels, knowledge of a protein’s subcellular location is essential to a complete understanding of its functions. Currently subcellular location patterns are routinely determined by visual inspection of fluorescence microscope images. We review here research aimed at creating systems for automated, systematic determination of location. These employ numerical feature extraction from images, feature reduction to identify the most useful features, and various supervised learning (classification) and unsupervised learning (clustering) methods. These methods have been shown to perform significantly better than human interpretation of the same images. When coupled with technologies for tagging large numbers of proteins and high-throughput microscope systems, the computational methods reviewed here enable the new subfield of location proteomics. This subfield will make critical contributions in two related areas. First, it will provide structured, high-resolution information on location to enable Systems Biology efforts to simulate cell behavior from the gene level on up. Second, it will provide tools for Cytomics projects aimed at characterizing the behaviors of all cell types before, during and after the onset of various diseases. PMID:16752421
2010-01-01
Background The extended light-harvesting complex (LHC) protein superfamily is a centerpiece of eukaryotic photosynthesis, comprising the LHC family and several families involved in photoprotection, like the LHC-like and the photosystem II subunit S (PSBS). The evolution of this complex superfamily has long remained elusive, partially due to previously missing families. Results In this study we present a meticulous search for LHC-like sequences in public genome and expressed sequence tag databases covering twelve representative photosynthetic eukaryotes from the three primary lineages of plants (Plantae): glaucophytes, red algae and green plants (Viridiplantae). By introducing a coherent classification of the different protein families based on both, hidden Markov model analyses and structural predictions, numerous new LHC-like sequences were identified and several new families were described, including the red lineage chlorophyll a/b-binding-like protein (RedCAP) family from red algae and diatoms. The test of alternative topologies of sequences of the highly conserved chlorophyll-binding core structure of LHC and PSBS proteins significantly supports the independent origins of LHC and PSBS families via two unrelated internal gene duplication events. This result was confirmed by the application of cluster likelihood mapping. Conclusions The independent evolution of LHC and PSBS families is supported by strong phylogenetic evidence. In addition, a possible origin of LHC and PSBS families from different homologous members of the stress-enhanced protein subfamily, a diverse and anciently paralogous group of two-helix proteins, seems likely. The new hypothesis for the evolution of the extended LHC protein superfamily proposed here is in agreement with the character evolution analysis that incorporates the distribution of families and subfamilies across taxonomic lineages. Intriguingly, stress-enhanced proteins, which are universally found in the genomes of green plants, red algae, glaucophytes and in diatoms with complex plastids, could represent an important and previously missing link in the evolution of the extended LHC protein superfamily. PMID:20673336
Identification and characterization of neutrophil extracellular trap shapes in flow cytometry
NASA Astrophysics Data System (ADS)
Ginley, Brandon; Emmons, Tiffany; Sasankan, Prabhu; Urban, Constantin; Segal, Brahm H.; Sarder, Pinaki
2017-03-01
Neutrophil extracellular trap (NET) formation is an alternate immunologic weapon used mainly by neutrophils. Chromatin backbones fused with proteins derived from granules are shot like projectiles onto foreign invaders. It is thought that this mechanism is highly anti-microbial, aids in preventing bacterial dissemination, is used to break down structures several sizes larger than neutrophils themselves, and may have several more uses yet unknown. NETs have been implied to be involved in a wide array of systemic host immune defenses, including sepsis, autoimmune diseases, and cancer. Existing methods used to visually quantify NETotic versus non-NETotic shapes are extremely time-consuming and subject to user bias. These limitations are obstacles to developing NETs as prognostic biomarkers and therapeutic targets. We propose an automated pipeline for quantitatively detecting neutrophil and NET shapes captured using a flow cytometry-imaging system. Our method uses contrast limited adaptive histogram equalization to improve signal intensity in dimly illuminated NETs. From the contrast improved image, fixed value thresholding is applied to convert the image to binary. Feature extraction is performed on the resulting binary image, by calculating region properties of the resulting foreground structures. Classification of the resulting features is performed using Support Vector Machine. Our method classifies NETs from neutrophils without traps at 0.97/0.96 sensitivity/specificity on n = 387 images, and is 1500X faster than manual classification, per sample. Our method can be extended to rapidly analyze whole-slide immunofluorescence tissue images for NET classification, and has potential to streamline the quantification of NETs for patients with diseases associated with cancer and autoimmunity.
The interactome of CCT complex - A computational analysis.
Narayanan, Aswathy; Pullepu, Dileep; Kabir, M Anaul
2016-10-01
The eukaryotic chaperonin, CCT (Chaperonin Containing TCP1 or TriC-TCP-1 Ring Complex) has been subjected to physical and genetic analyses in S. cerevisiae which can be extrapolated to human CCT (hCCT), owing to its structural and functional similarities with yeast CCT (yCCT). Studies on hCCT and its interactome acquire an additional dimension, as it has been implicated in several disease conditions like neurodegeneration and cancer. We attempt to study its stress response role in general, which will be reflected in the aspects of human diseases and yeast physiology, through computational analysis of the interactome. Towards consolidating and analysing the interactome data, we prepared and compared the unique CCT-interacting protein lists for S. cerevisiae and H. sapiens, performed GO term classification and enrichment studies which provide information on the diversity in CCT interactome, in terms of protein classes in the data set. Enrichment with disease-associated proteins and pathways highlight the medical importance of CCT. Different analyses converge, suggesting the significance of WD-repeat proteins, protein kinases and cytoskeletal proteins in the interactome. The prevalence of proteasomal subunits and ribosomal proteins suggest a possible cross-talk between protein-synthesis, folding and degradation machinery. A network of chaperones and chaperonins that function in combination can also be envisaged from the CCT interactome-Hsp70 interactome analysis. Copyright © 2016 Elsevier Ltd. All rights reserved.
Functional Proteomic Analysis of Human NucleolusD⃞
Scherl, Alexander; Couté, Yohann; Déon, Catherine; Callé, Aleth; Kindbeiter, Karine; Sanchez, Jean-Charles; Greco, Anna; Hochstrasser, Denis; Diaz, Jean-Jacques
2002-01-01
The notion of a “plurifunctional” nucleolus is now well established. However, molecular mechanisms underlying the biological processes occurring within this nuclear domain remain only partially understood. As a first step in elucidating these mechanisms we have carried out a proteomic analysis to draw up a list of proteins present within nucleoli of HeLa cells. This analysis allowed the identification of 213 different nucleolar proteins. This catalog complements that of the 271 proteins obtained recently by others, giving a total of ∼350 different nucleolar proteins. Functional classification of these proteins allowed outlining several biological processes taking place within nucleoli. Bioinformatic analyses permitted the assignment of hypothetical functions for 43 proteins for which no functional information is available. Notably, a role in ribosome biogenesis was proposed for 31 proteins. More generally, this functional classification reinforces the plurifunctional nature of nucleoli and provides convincing evidence that nucleoli may play a central role in the control of gene expression. Finally, this analysis supports the recent demonstration of a coupling of transcription and translation in higher eukaryotes. PMID:12429849
Wang, Xinglong; Rak, Rafal; Restificar, Angelo; Nobata, Chikashi; Rupp, C J; Batista-Navarro, Riza Theresa B; Nawaz, Raheel; Ananiadou, Sophia
2011-10-03
The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest's Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task's development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew's Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance.
Fernández-Suárez, Xosé M; Rigden, Daniel J; Galperin, Michael Y
2014-01-01
The 2014 Nucleic Acids Research Database Issue includes descriptions of 58 new molecular biology databases and recent updates to 123 databases previously featured in NAR or other journals. For convenience, the issue is now divided into eight sections that reflect major subject categories. Among the highlights of this issue are six databases of the transcription factor binding sites in various organisms and updates on such popular databases as CAZy, Database of Genomic Variants (DGV), dbGaP, DrugBank, KEGG, miRBase, Pfam, Reactome, SEED, TCDB and UniProt. There is a strong block of structural databases, which includes, among others, the new RNA Bricks database, updates on PDBe, PDBsum, ArchDB, Gene3D, ModBase, Nucleic Acid Database and the recently revived iPfam database. An update on the NCBI's MMDB describes VAST+, an improved tool for protein structure comparison. Two articles highlight the development of the Structural Classification of Proteins (SCOP) database: one describes SCOPe, which automates assignment of new structures to the existing SCOP hierarchy; the other one describes the first version of SCOP2, with its more flexible approach to classifying protein structures. This issue also includes a collection of articles on bacterial taxonomy and metagenomics, which includes updates on the List of Prokaryotic Names with Standing in Nomenclature (LPSN), Ribosomal Database Project (RDP), the Silva/LTP project and several new metagenomics resources. The NAR online Molecular Biology Database Collection, http://www.oxfordjournals.org/nar/database/c/, has been expanded to 1552 databases. The entire Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/).
Gao, She-Gan; Liu, Rui-Min; Zhao, Yun-Gang; Wang, Pei; Ward, Douglas G.; Wang, Guang-Chao; Guo, Xiang-Qian; Gu, Juan; Niu, Wan-Bin; Zhang, Tian; Martin, Ashley; Guo, Zhi-Peng; Feng, Xiao-Shan; Qi, Yi-Jun; Ma, Yuan-Fang
2016-01-01
Combining MS-based proteomic data with network and topological features of such network would identify more clinically relevant molecules and meaningfully expand the repertoire of proteins derived from MS analysis. The integrative topological indexes representing 95.96% information of seven individual topological measures of node proteins were calculated within a protein-protein interaction (PPI) network, built using 244 differentially expressed proteins (DEPs) identified by iTRAQ 2D-LC-MS/MS. Compared with DEPs, differentially expressed genes (DEGs) and comprehensive features (CFs), structurally dominant nodes (SDNs) based on integrative topological index distribution produced comparable classification performance in three different clinical settings using five independent gene expression data sets. The signature molecules of SDN-based classifier for distinction of early from late clinical TNM stages were enriched in biological traits of protein synthesis, intracellular localization and ribosome biogenesis, which suggests that ribosome biogenesis represents a promising therapeutic target for treating ESCC. In addition, ITGB1 expression selected exclusively by integrative topological measures correlated with clinical stages and prognosis, which was further validated with two independent cohorts of ESCC samples. Thus the integrative topological analysis of PPI networks proposed in this study provides an alternative approach to identify potential biomarkers and therapeutic targets from MS/MS data with functional insights in ESCC. PMID:26898710
21 CFR 866.5270 - C-reactive protein immuno-logical test system.
Code of Federal Regulations, 2013 CFR
2013-04-01
... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...
21 CFR 866.5270 - C-reactive protein immuno-logical test system.
Code of Federal Regulations, 2011 CFR
2011-04-01
... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...
21 CFR 866.5270 - C-reactive protein immuno-logical test system.
Code of Federal Regulations, 2012 CFR
2012-04-01
... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...
21 CFR 866.5270 - C-reactive protein immuno-logical test system.
Code of Federal Regulations, 2014 CFR
2014-04-01
... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...
Wise, Michael J
2003-10-29
The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins, originally found in plants but now being found in non-plant species. Their precise function is unknown, though considerable evidence suggests that LEA proteins are involved in desiccation resistance. Using a number of statistically-based bioinformatics tools the classification of a large set of LEA proteins, covering all Groups, is reexamined together with some previous findings. Searches based on peptide composition return proteins with similar composition to different LEA Groups; keyword clustering is then applied to reveal keywords and phrases suggestive of the Groups' properties. Previous research has suggested that glycine is characteristic of LEA proteins, but it is only highly over-represented in Groups 1 and 2, while alanine, thought characteristic of Group 2, is over-represented in Group 3, 4 and 6 but under-represented in Groups 1 and 2. However, for LEA Groups 1 2 and 3 it is shown that glutamine is very significantly over-represented, while cysteine, phenylalanine, isoleucine, leucine and tryptophan are significantly under-represented. There is also evidence that the Group 4 LEA proteins are more appropriately redistributed to Group 2 and Group 3. Similarly, Group 5 is better found among the Group 3 LEA proteins. There is evidence that Group 2 and Group 3 LEA proteins, though distinct, might be related. This relationship is also evident in the overlapping sets of keywords for the two Groups, emphasising alpha-helical structure and, at a larger scale, filaments, all of which fits well with experimental evidence that proteins from both Groups are natively unstructured, but become structured under stress conditions. The keywords support localisation of LEA proteins both in the nucleus and associated with the cytoskeleton, and a mode of action similar to chaperones, perhaps the cold shock chaperones, via a role in DNA-binding. In general, non-globular and low-complexity proteins, such as the LEA proteins, pose particular challenges in determining their functions and modes of action. Rather than masking off and ignoring low-complexity domains, novel tools and tool combinations are needed which are capable of analysing such proteins in their entirety.
Impact of Growth in the Universe of Subjects on Classification.
ERIC Educational Resources Information Center
Ranganathan, Shiyali Ramamritam
The development of the removal of rigidity in library classification is traced from the Enumerative Classification of DC (1876) through the Nearly-Faceted Classification of UDC (1896), the rigidly, though fully faceted version of CC (1933), the generalized faceted structure of version 2 of CC (1949), down to the Freely Faceted Classification of…
Nepomnyachiy, Sergey; Ben-Tal, Nir; Kolodny, Rachel
2017-01-01
Proteins share similar segments with one another. Such “reused parts”—which have been successfully incorporated into other proteins—are likely to offer an evolutionary advantage over de novo evolved segments, as most of the latter will not even have the capacity to fold. To systematically explore the evolutionary traces of segment “reuse” across proteins, we developed an automated methodology that identifies reused segments from protein alignments. We search for “themes”—segments of at least 35 residues of similar sequence and structure—reused within representative sets of 15,016 domains [Evolutionary Classification of Protein Domains (ECOD) database] or 20,398 chains [Protein Data Bank (PDB)]. We observe that theme reuse is highly prevalent and that reuse is more extensive when the length threshold for identifying a theme is lower. Structural domains, the best characterized form of reuse in proteins, are just one of many complex and intertwined evolutionary traces. Others include long themes shared among a few proteins, which encompass and overlap with shorter themes that recur in numerous proteins. The observed complexity is consistent with evolution by duplication and divergence, and some of the themes might include descendants of ancestral segments. The observed recursive footprints, where the same amino acid can simultaneously participate in several intertwined themes, could be a useful concept for protein design. Data are available at http://trachel-srv.cs.haifa.ac.il/rachel/ppi/themes/. PMID:29078314
33 CFR 67.01-15 - Classification of structures.
Code of Federal Regulations, 2011 CFR
2011-07-01
... 33 Navigation and Navigable Waters 1 2011-07-01 2011-07-01 false Classification of structures. 67.01-15 Section 67.01-15 Navigation and Navigable Waters COAST GUARD, DEPARTMENT OF HOMELAND SECURITY AIDS TO NAVIGATION AIDS TO NAVIGATION ON ARTIFICIAL ISLANDS AND FIXED STRUCTURES General Requirements...
Gordon, Larry M.; Nisthal, Alex; Lee, Andy B.; Eskandari, Sepehr; Ruchala, Piotr; Jung, Chun-Ling; Waring, Alan J.; Mobley, Patrick W.
2008-01-01
Given their high alanine and glycine levels, plaque formation, α-helix to β-sheet interconversion and fusogenicity, FP (i.e., the N-terminal fusion peptide of HIV-1 gp41; 23 residues) and amyloids were proposed as belonging to the same protein superfamily. Here, we further test whether FP may exhibit ‘amyloid-like’ characteristics, by contrasting its structural and functional properties with those of Aβ(26–42), a 17-residue peptide from the C-terminus of the amyloid-beta protein responsible for Alzheimer’s. FTIR spectroscopy, electron microscopy, light scattering and predicted amyloid structure aggregation (PASTA) indicated that aqueous FP and Aβ(26–42) formed similar networked β-sheet fibrils, although the FP fibril interactions were weaker. FP and Aβ(26–42) both lysed and aggregated human erythrocytes, with the hemolysis-onsets correlated with the conversion of α-helix to β-sheet for each peptide in liposomes. Congo red (CR), a marker of amyloid plaques in situ, similarly inhibited either FP- or Aβ(26–42)-induced hemolysis, and surface plasmon resonance indicated that this may be due to direct CR-peptide binding. These findings suggest that membrane-bound β-sheets of FP may contribute to the cytopathicity of HIV in vivo through an amyloid-type mechanism, and support the classification of HIV-1 FP as an ‘amyloid homolog’ (or ‘amylog’). PMID:18515070
NASA Astrophysics Data System (ADS)
Sarti, E.; Zamuner, S.; Cossio, P.; Laio, A.; Seno, F.; Trovato, A.
2013-12-01
In protein structure prediction it is of crucial importance, especially at the refinement stage, to score efficiently large sets of models by selecting the ones that are closest to the native state. We here present a new computational tool, BACHSCORE, that allows its users to rank different structural models of the same protein according to their quality, evaluated by using the BACH++ (Bayesian Analysis Conformation Hunt) scoring function. The original BACH statistical potential was already shown to discriminate with very good reliability the protein native state in large sets of misfolded models of the same protein. BACH++ features a novel upgrade in the solvation potential of the scoring function, now computed by adapting the LCPO (Linear Combination of Pairwise Orbitals) algorithm. This change further enhances the already good performance of the scoring function. BACHSCORE can be accessed directly through the web server: bachserver.pd.infn.it. Catalogue identifier: AEQD_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEQD_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: GNU General Public License version 3 No. of lines in distributed program, including test data, etc.: 130159 No. of bytes in distributed program, including test data, etc.: 24 687 455 Distribution format: tar.gz Programming language: C++. Computer: Any computer capable of running an executable produced by a g++ compiler (4.6.3 version). Operating system: Linux, Unix OS-es. RAM: 1 073 741 824 bytes Classification: 3. Nature of problem: Evaluate the quality of a protein structural model, taking into account the possible “a priori” knowledge of a reference primary sequence that may be different from the amino-acid sequence of the model; the native protein structure should be recognized as the best model. Solution method: The contact potential scores the occurrence of any given type of residue pair in 5 possible contact classes (α-helical contact, parallel β-sheet contact, anti-parallel β-sheet contact, side-chain contact, no contact). The solvation potential scores the occurrence of any residue type in 2 possible environments: buried and solvent exposed. Residue environment is assigned by adapting the LCPO algorithm. Residues present in the reference primary sequence and not present in the model structure contribute to the model score as solvent exposed and as non contacting all other residues. Restrictions: Input format file according to the Protein Data Bank standard Additional comments: Parameter values used in the scoring function can be found in the file /folder-to-bachscore/BACH/examples/bach_std.par. Running time: Roughly one minute to score one hundred structures on a desktop PC, depending on their size.
NASA Astrophysics Data System (ADS)
Rivas-Ubach, A.; Liu, Y.; Bianchi, T. S.; Tolic, N.; Jansson, C.; Paša-Tolić, L.
2017-12-01
The role of nutrients in organisms, especially primary producers, has been a topic of special interest in ecosystem research for understanding the ecosystem structure and function. The majority of macro-elements in organisms, such as C, H, O, N and P, do not act as single elements but are components of organic compounds (lipids, peptides, carbohydrates, etc), which are more directly related to the physiology of organisms and thus to the ecosystem function. However, accurately deciphering the overall content of the main compound classes (lipids, proteins, carbohydrates,…) in organisms is still a major challenge. van Krevelen (vK) diagrams have been widely used as an estimation of the main compound categories present in environmental samples based on O:C vs H:C molecular ratios, but a stoichiometric classification based exclusively on O:C and H:C ratios is feeble. Different compound classes show large O:C and H:C ratio overlapping and other heteroatoms, such as N and P, should be considered to robustly distinguish the different classes. We propose a new compound classification for biological/environmental samples based on the C:H:O:N:P stoichiometric ratios of thousands of molecular formulas of characterized compounds from 6 different main categories: lipids, peptides, amino-sugars, carbohydrates, nucleotides and phytochemical compounds (oxy-aromatic compounds). This new multidimensional stoichiometric compound constraints classification (MSCC) can be applied to data obtained with high resolution mass spectrometry (HRMS), allowing an accurate overview of the relative abundances of the main compound categories present in organismal samples. The MSCC has been optimized for plants, but it could be also applied to different organisms and serve as a strong starting point to further investigate other environmental complex matrices (soils, aerosols, etc). The proposed MSCC advances environmental research, especially eco-metabolomics, ecophysiology and ecological stoichiometry studies, providing a new tool to understand the ecosystem structure and function at the molecular level.
Granular support vector machines with association rules mining for protein homology prediction.
Tang, Yuchun; Jin, Bo; Zhang, Yan-Qing
2005-01-01
Protein homology prediction between protein sequences is one of critical problems in computational biology. Such a complex classification problem is common in medical or biological information processing applications. How to build a model with superior generalization capability from training samples is an essential issue for mining knowledge to accurately predict/classify unseen new samples and to effectively support human experts to make correct decisions. A new learning model called granular support vector machines (GSVM) is proposed based on our previous work. GSVM systematically and formally combines the principles from statistical learning theory and granular computing theory and thus provides an interesting new mechanism to address complex classification problems. It works by building a sequence of information granules and then building support vector machines (SVM) in some of these information granules on demand. A good granulation method to find suitable granules is crucial for modeling a GSVM with good performance. In this paper, we also propose an association rules-based granulation method. For the granules induced by association rules with high enough confidence and significant support, we leave them as they are because of their high "purity" and significant effect on simplifying the classification task. For every other granule, a SVM is modeled to discriminate the corresponding data. In this way, a complex classification problem is divided into multiple smaller problems so that the learning task is simplified. The proposed algorithm, here named GSVM-AR, is compared with SVM by KDDCUP04 protein homology prediction data. The experimental results show that finding the splitting hyperplane is not a trivial task (we should be careful to select the association rules to avoid overfitting) and GSVM-AR does show significant improvement compared to building one single SVM in the whole feature space. Another advantage is that the utility of GSVM-AR is very good because it is easy to be implemented. More importantly and more interestingly, GSVM provides a new mechanism to address complex classification problems.
ACLAME: a CLAssification of Mobile genetic Elements, update 2010.
Leplae, Raphaël; Lima-Mendez, Gipsi; Toussaint, Ariane
2010-01-01
The ACLAME database is dedicated to the collection, analysis and classification of sequenced mobile genetic elements (MGEs, in particular phages and plasmids). In addition to providing information on the MGEs content, classifications are available at various levels of organization. At the gene/protein level, families group similar sequences that are expected to share the same function. Families of four or more proteins are manually assigned with a functional annotation using the GeneOntology and the locally developed ontology MeGO dedicated to MGEs. At the genome level, evolutionary cohesive modules group sets of protein families shared among MGEs. At the population level, networks display the reticulate evolutionary relationships among MGEs. To increase the coverage of the phage sequence space, ACLAME version 0.4 incorporates 760 high-quality predicted prophages selected from the Prophinder database. Most of the data can be downloaded from the freely accessible ACLAME web site (http://aclame.ulb.ac.be). The BLAST interface for querying the database has been extended and numerous tools for in-depth analysis of the results have been added.
Implicit structured sequence learning: an fMRI study of the structural mere-exposure effect
Folia, Vasiliki; Petersson, Karl Magnus
2014-01-01
In this event-related fMRI study we investigated the effect of 5 days of implicit acquisition on preference classification by means of an artificial grammar learning (AGL) paradigm based on the structural mere-exposure effect and preference classification using a simple right-linear unification grammar. This allowed us to investigate implicit AGL in a proper learning design by including baseline measurements prior to grammar exposure. After 5 days of implicit acquisition, the fMRI results showed activations in a network of brain regions including the inferior frontal (centered on BA 44/45) and the medial prefrontal regions (centered on BA 8/32). Importantly, and central to this study, the inclusion of a naive preference fMRI baseline measurement allowed us to conclude that these fMRI findings were the intrinsic outcomes of the learning process itself and not a reflection of a preexisting functionality recruited during classification, independent of acquisition. Support for the implicit nature of the knowledge utilized during preference classification on day 5 come from the fact that the basal ganglia, associated with implicit procedural learning, were activated during classification, while the medial temporal lobe system, associated with explicit declarative memory, was consistently deactivated. Thus, preference classification in combination with structural mere-exposure can be used to investigate structural sequence processing (syntax) in unsupervised AGL paradigms with proper learning designs. PMID:24550865
Implicit structured sequence learning: an fMRI study of the structural mere-exposure effect.
Folia, Vasiliki; Petersson, Karl Magnus
2014-01-01
In this event-related fMRI study we investigated the effect of 5 days of implicit acquisition on preference classification by means of an artificial grammar learning (AGL) paradigm based on the structural mere-exposure effect and preference classification using a simple right-linear unification grammar. This allowed us to investigate implicit AGL in a proper learning design by including baseline measurements prior to grammar exposure. After 5 days of implicit acquisition, the fMRI results showed activations in a network of brain regions including the inferior frontal (centered on BA 44/45) and the medial prefrontal regions (centered on BA 8/32). Importantly, and central to this study, the inclusion of a naive preference fMRI baseline measurement allowed us to conclude that these fMRI findings were the intrinsic outcomes of the learning process itself and not a reflection of a preexisting functionality recruited during classification, independent of acquisition. Support for the implicit nature of the knowledge utilized during preference classification on day 5 come from the fact that the basal ganglia, associated with implicit procedural learning, were activated during classification, while the medial temporal lobe system, associated with explicit declarative memory, was consistently deactivated. Thus, preference classification in combination with structural mere-exposure can be used to investigate structural sequence processing (syntax) in unsupervised AGL paradigms with proper learning designs.
The property distance index PD predicts peptides that cross-react with IgE antibodies
Ivanciuc, Ovidiu; Midoro-Horiuti, Terumi; Schein, Catherine H.; Xie, Liping; Hillman, Gilbert R.; Goldblum, Randall M.; Braun, Werner
2009-01-01
Similarities in the sequence and structure of allergens can explain clinically observed cross-reactivities. Distinguishing sequences that bind IgE in patient sera can be used to identify potentially allergenic protein sequences and aid in the design of hypo-allergenic proteins. The property distance index PD, incorporated in our Structural Database of Allergenic Proteins (SDAP, http://fermi.utmb.edu/SDAP/), may identify potentially cross-reactive segments of proteins, based on their similarity to known IgE epitopes. We sought to obtain experimental validation of the PD index as a quantitative predictor of IgE cross-reactivity, by designing peptide variants with predetermined PD scores relative to three linear IgE epitopes of Jun a 1, the dominant allergen from mountain cedar pollen. For each of the three epitopes, 60 peptides were designed with increasing PD values (decreasing physicochemical similarity) to the starting sequence. The peptides synthesized on a derivatized cellulose membrane were probed with sera from patients who were allergic to Jun a 1, and the experimental data were interpreted with a PD classification method. Peptides with low PD values relative to a given epitope were more likely to bind IgE from the sera than were those with PD values larger than 6. Control sequences, with PD values between 18 and 20 to all the three epitopes, did not bind patient IgE, thus validating our procedure for identifying negative control peptides. The PD index is a statistically validated method to detect discrete regions of proteins that have a high probability of cross-reacting with IgE from allergic patients. PMID:18950868
Semantic Structures of One-Step Word Problems Involving Multiplication or Division.
ERIC Educational Resources Information Center
Schmidt, Siegbert; Weiser, Werner
1995-01-01
Proposes a four-category classification of semantic structures of one-step word problems involving multiplication and division: forming the n-th multiple of measures, combinatorial multiplication, composition of operators, and multiplication by formula. This classification is compatible with semantic structures of addition and subtraction word…
Federal Register 2010, 2011, 2012, 2013, 2014
2011-11-23
... positive nitrogen balance and protein metabolism, resulting in increases in protein synthesis and lean body... nitrogen balance and androgenic activity based on weight changes of the ventral prostrate of prostanozol...
NASA Astrophysics Data System (ADS)
Omenzetter, Piotr; de Lautour, Oliver R.
2010-04-01
Developed for studying long, periodic records of various measured quantities, time series analysis methods are inherently suited and offer interesting possibilities for Structural Health Monitoring (SHM) applications. However, their use in SHM can still be regarded as an emerging application and deserves more studies. In this research, Autoregressive (AR) models were used to fit experimental acceleration time histories from two experimental structural systems, a 3- storey bookshelf-type laboratory structure and the ASCE Phase II SHM Benchmark Structure, in healthy and several damaged states. The coefficients of the AR models were chosen as damage sensitive features. Preliminary visual inspection of the large, multidimensional sets of AR coefficients to check the presence of clusters corresponding to different damage severities was achieved using Sammon mapping - an efficient nonlinear data compression technique. Systematic classification of damage into states based on the analysis of the AR coefficients was achieved using two supervised classification techniques: Nearest Neighbor Classification (NNC) and Learning Vector Quantization (LVQ), and one unsupervised technique: Self-organizing Maps (SOM). This paper discusses the performance of AR coefficients as damage sensitive features and compares the efficiency of the three classification techniques using experimental data.
Noninvasive diagnosis of intraamniotic infection: proteomic biomarkers in vaginal fluid.
Hitti, Jane; Lapidus, Jodi A; Lu, Xinfang; Reddy, Ashok P; Jacob, Thomas; Dasari, Surendra; Eschenbach, David A; Gravett, Michael G; Nagalla, Srinivasa R
2010-07-01
We analyzed the vaginal fluid proteome to identify biomarkers of intraamniotic infection among women in preterm labor. Proteome analysis was performed on vaginal fluid specimens from women with preterm labor, using multidimensional liquid chromatography, tandem mass spectrometry, and label-free quantification. Enzyme immunoassays were used to quantify candidate proteins. Classification accuracy for intraamniotic infection (positive amniotic fluid bacterial culture and/or interleukin-6 >2 ng/mL) was evaluated using receiver-operator characteristic curves obtained by logistic regression. Of 170 subjects, 30 (18%) had intraamniotic infection. Vaginal fluid proteome analysis revealed 338 unique proteins. Label-free quantification identified 15 proteins differentially expressed in intraamniotic infection, including acute-phase reactants, immune modulators, high-abundance amniotic fluid proteins and extracellular matrix-signaling factors; these findings were confirmed by enzyme immunoassay. A multi-analyte algorithm showed accurate classification of intraamniotic infection. Vaginal fluid proteome analyses identified proteins capable of discriminating between patients with and without intraamniotic infection. Copyright (c) 2010 Mosby, Inc. All rights reserved.
Pang, Erli; Wu, Xiaomei; Lin, Kui
2016-06-01
Protein evolution plays an important role in the evolution of each genome. Because of their functional nature, in general, most of their parts or sites are differently constrained selectively, particularly by purifying selection. Most previous studies on protein evolution considered individual proteins in their entirety or compared protein-coding sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each protein of a given genome. To this end, based on PfamA annotation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occurring within protein domains and those within unassigned regions. With these classifications, we found: the density of synonymous SNPs within domains is significantly greater than that of synonymous SNPs within unassigned regions; however, the density of non-synonymous SNPs shows the opposite pattern. We also found there are signatures of purifying selection on both the domain and unassigned regions. Furthermore, the selective strength on domains is significantly greater than that on unassigned regions. In addition, among all of the human protein sequences, there are 117 PfamA domains in which no SNPs are found. Our results highlight an important aspect of protein domains and may contribute to our understanding of protein evolution.
GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.
Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de
2006-03-31
Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.