Cui, Xuefeng; Lu, Zhiwu; Wang, Sheng; Jing-Yan Wang, Jim; Gao, Xin
2016-06-15
Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods. Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx : xin.gao@kaust.edu.sa Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Computationally mapping sequence space to understand evolutionary protein engineering.
Armstrong, Kathryn A; Tidor, Bruce
2008-01-01
Evolutionary protein engineering has been dramatically successful, producing a wide variety of new proteins with altered stability, binding affinity, and enzymatic activity. However, the success of such procedures is often unreliable, and the impact of the choice of protein, engineering goal, and evolutionary procedure is not well understood. We have created a framework for understanding aspects of the protein engineering process by computationally mapping regions of feasible sequence space for three small proteins using structure-based design protocols. We then tested the ability of different evolutionary search strategies to explore these sequence spaces. The results point to a non-intuitive relationship between the error-prone PCR mutation rate and the number of rounds of replication. The evolutionary relationships among feasible sequences reveal hub-like sequences that serve as particularly fruitful starting sequences for evolutionary search. Moreover, genetic recombination procedures were examined, and tradeoffs relating sequence diversity and search efficiency were identified. This framework allows us to consider the impact of protein structure on the allowed sequence space and therefore on the challenges that each protein presents to error-prone PCR and genetic recombination procedures.
Naranjo, Yandi; Pons, Miquel; Konrat, Robert
2012-01-01
The number of existing protein sequences spans a very small fraction of sequence space. Natural proteins have overcome a strong negative selective pressure to avoid the formation of insoluble aggregates. Stably folded globular proteins and intrinsically disordered proteins (IDPs) use alternative solutions to the aggregation problem. While in globular proteins folding minimizes the access to aggregation prone regions, IDPs on average display large exposed contact areas. Here, we introduce the concept of average meta-structure correlation maps to analyze sequence space. Using this novel conceptual view we show that representative ensembles of folded and ID proteins show distinct characteristics and respond differently to sequence randomization. By studying the way evolutionary constraints act on IDPs to disable a negative function (aggregation) we might gain insight into the mechanisms by which function-enabling information is encoded in IDPs.
Dynamics of domain coverage of the protein sequence universe.
Rekapalli, Bhanu; Wuichet, Kristin; Peterson, Gregory D; Zhulin, Igor B
2012-11-16
The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its "dark matter". Here we suggest that true size of "dark matter" is much larger than stated by current definitions. We propose an approach to reducing the size of "dark matter" by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of "dark matter"; however, its absolute size increases substantially with the growth of sequence data.
Use of designed sequences in protein structure recognition.
Kumar, Gayatri; Mudgal, Richa; Srinivasan, Narayanaswamy; Sandhya, Sankaran
2018-05-09
Knowledge of the protein structure is a pre-requisite for improved understanding of molecular function. The gap in the sequence-structure space has increased in the post-genomic era. Grouping related protein sequences into families can aid in narrowing the gap. In the Pfam database, structure description is provided for part or full-length proteins of 7726 families. For the remaining 52% of the families, information on 3-D structure is not yet available. We use the computationally designed sequences that are intermediately related to two protein domain families, which are already known to share the same fold. These strategically designed sequences enable detection of distant relationships and here, we have employed them for the purpose of structure recognition of protein families of yet unknown structure. We first measured the success rate of our approach using a dataset of protein families of known fold and achieved a success rate of 88%. Next, for 1392 families of yet unknown structure, we made structural assignments for part/full length of the proteins. Fold association for 423 domains of unknown function (DUFs) are provided as a step towards functional annotation. The results indicate that knowledge-based filling of gaps in protein sequence space is a lucrative approach for structure recognition. Such sequences assist in traversal through protein sequence space and effectively function as 'linkers', where natural linkers between distant proteins are unavailable. This article was reviewed by Oliviero Carugo, Christine Orengo and Srikrishna Subramanian.
Dynamics of domain coverage of the protein sequence universe
2012-01-01
Background The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”. Results Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Conclusions Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data. PMID:23157439
2014-01-01
Background Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space. Methods Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain. Results We show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership. Conclusions This work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools. PMID:25080993
Sequence space and the ongoing expansion of the protein universe.
Povolotskaya, Inna S; Kondrashov, Fyodor A
2010-06-17
The need to maintain the structural and functional integrity of an evolving protein severely restricts the repertoire of acceptable amino-acid substitutions. However, it is not known whether these restrictions impose a global limit on how far homologous protein sequences can diverge from each other. Here we explore the limits of protein evolution using sequence divergence data. We formulate a computational approach to study the rate of divergence of distant protein sequences and measure this rate for ancient proteins, those that were present in the last universal common ancestor. We show that ancient proteins are still diverging from each other, indicating an ongoing expansion of the protein sequence universe. The slow rate of this divergence is imposed by the sparseness of functional protein sequences in sequence space and the ruggedness of the protein fitness landscape: approximately 98 per cent of sites cannot accept an amino-acid substitution at any given moment but a vast majority of all sites may eventually be permitted to evolve when other, compensatory, changes occur. Thus, approximately 3.5 x 10(9) yr has not been enough to reach the limit of divergent evolution of proteins, and for most proteins the limit of sequence similarity imposed by common function may not exceed that of random sequences.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Murphy, Grant S.; Mills, Jeffrey L.; Miley, Michael J.
2015-10-15
Protein design tests our understanding of protein stability and structure. Successful design methods should allow the exploration of sequence space not found in nature. However, when redesigning naturally occurring protein structures, most fixed backbone design algorithms return amino acid sequences that share strong sequence identity with wild-type sequences, especially in the protein core. This behavior places a restriction on functional space that can be explored and is not consistent with observations from nature, where sequences of low identity have similar structures. Here, we allow backbone flexibility during design to mutate every position in the core (38 residues) of a four-helixmore » bundle protein. Only small perturbations to the backbone, 12 {angstrom}, were needed to entirely mutate the core. The redesigned protein, DRNN, is exceptionally stable (melting point >140C). An NMR and X-ray crystal structure show that the side chains and backbone were accurately modeled (all-atom RMSD = 1.3 {angstrom}).« less
Evolutionary Dynamics on Protein Bi-stability Landscapes can Potentially Resolve Adaptive Conflicts
Sikosek, Tobias; Bornberg-Bauer, Erich; Chan, Hue Sun
2012-01-01
Experimental studies have shown that some proteins exist in two alternative native-state conformations. It has been proposed that such bi-stable proteins can potentially function as evolutionary bridges at the interface between two neutral networks of protein sequences that fold uniquely into the two different native conformations. Under adaptive conflict scenarios, bi-stable proteins may be of particular advantage if they simultaneously provide two beneficial biological functions. However, computational models that simulate protein structure evolution do not yet recognize the importance of bi-stability. Here we use a biophysical model to analyze sequence space to identify bi-stable or multi-stable proteins with two or more equally stable native-state structures. The inclusion of such proteins enhances phenotype connectivity between neutral networks in sequence space. Consideration of the sequence space neighborhood of bridge proteins revealed that bi-stability decreases gradually with each mutation that takes the sequence further away from an exactly bi-stable protein. With relaxed selection pressures, we found that bi-stable proteins in our model are highly successful under simulated adaptive conflict. Inspired by these model predictions, we developed a method to identify real proteins in the PDB with bridge-like properties, and have verified a clear bi-stability gradient for a series of mutants studied by Alexander et al. (Proc Nat Acad Sci USA 2009, 106:21149–21154) that connect two sequences that fold uniquely into two different native structures via a bridge-like intermediate mutant sequence. Based on these findings, new testable predictions for future studies on protein bi-stability and evolution are discussed. PMID:23028272
Quantiprot - a Python package for quantitative analysis of protein sequences.
Konopka, Bogumił M; Marciniak, Marta; Dyrka, Witold
2017-07-17
The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf's law coefficient. We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.
Physical Model of the Genotype-to-Phenotype Map of Proteins
NASA Astrophysics Data System (ADS)
Tlusty, Tsvi; Libchaber, Albert; Eckmann, Jean-Pierre
2017-04-01
How DNA is mapped to functional proteins is a basic question of living matter. We introduce and study a physical model of protein evolution which suggests a mechanical basis for this map. Many proteins rely on large-scale motion to function. We therefore treat protein as learning amorphous matter that evolves towards such a mechanical function: Genes are binary sequences that encode the connectivity of the amino acid network that makes a protein. The gene is evolved until the network forms a shear band across the protein, which allows for long-range, soft modes required for protein function. The evolution reduces the high-dimensional sequence space to a low-dimensional space of mechanical modes, in accord with the observed dimensional reduction between genotype and phenotype of proteins. Spectral analysis of the space of 1 06 solutions shows a strong correspondence between localization around the shear band of both mechanical modes and the sequence structure. Specifically, our model shows how mutations are correlated among amino acids whose interactions determine the functional mode.
Rattei, Thomas; Tischler, Patrick; Götz, Stefan; Jehl, Marc-André; Hoser, Jonathan; Arnold, Roland; Conesa, Ana; Mewes, Hans-Werner
2010-01-01
The prediction of protein function as well as the reconstruction of evolutionary genesis employing sequence comparison at large is still the most powerful tool in sequence analysis. Due to the exponential growth of the number of known protein sequences and the subsequent quadratic growth of the similarity matrix, the computation of the Similarity Matrix of Proteins (SIMAP) becomes a computational intensive task. The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences. Novel features of SIMAP include the expansion of the sequence space by including databases such as ENSEMBL as well as the integration of metagenomes based on their consistent processing and annotation. Furthermore, protein function predictions by Blast2GO are pre-calculated for all sequences in SIMAP and the data access and query functions have been improved. SIMAP assists biologists to query the up-to-date sequence space systematically and facilitates large-scale downstream projects in computational biology. Access to SIMAP is freely provided through the web portal for individuals (http://mips.gsf.de/simap/) and for programmatic access through DAS (http://webclu.bio.wzw.tum.de/das/) and Web-Service (http://mips.gsf.de/webservices/services/SimapService2.0?wsdl).
Adaptation in protein fitness landscapes is facilitated by indirect paths
Wu, Nicholas C; Dai, Lei; Olson, C Anders; Lloyd-Smith, James O; Sun, Ren
2016-01-01
The structure of fitness landscapes is critical for understanding adaptive protein evolution. Previous empirical studies on fitness landscapes were confined to either the neighborhood around the wild type sequence, involving mostly single and double mutants, or a combinatorially complete subgraph involving only two amino acids at each site. In reality, the dimensionality of protein sequence space is higher (20L) and there may be higher-order interactions among more than two sites. Here we experimentally characterized the fitness landscape of four sites in protein GB1, containing 204 = 160,000 variants. We found that while reciprocal sign epistasis blocked many direct paths of adaptation, such evolutionary traps could be circumvented by indirect paths through genotype space involving gain and subsequent loss of mutations. These indirect paths alleviate the constraint on adaptive protein evolution, suggesting that the heretofore neglected dimensions of sequence space may change our views on how proteins evolve. DOI: http://dx.doi.org/10.7554/eLife.16965.001 PMID:27391790
Identifying and reducing error in cluster-expansion approximations of protein energies.
Hahn, Seungsoo; Ashenberg, Orr; Grigoryan, Gevorg; Keating, Amy E
2010-12-01
Protein design involves searching a vast space for sequences that are compatible with a defined structure. This can pose significant computational challenges. Cluster expansion is a technique that can accelerate the evaluation of protein energies by generating a simple functional relationship between sequence and energy. The method consists of several steps. First, for a given protein structure, a training set of sequences with known energies is generated. Next, this training set is used to expand energy as a function of clusters consisting of single residues, residue pairs, and higher order terms, if required. The accuracy of the sequence-based expansion is monitored and improved using cross-validation testing and iterative inclusion of additional clusters. As a trade-off for evaluation speed, the cluster-expansion approximation causes prediction errors, which can be reduced by including more training sequences, including higher order terms in the expansion, and/or reducing the sequence space described by the cluster expansion. This article analyzes the sources of error and introduces a method whereby accuracy can be improved by judiciously reducing the described sequence space. The method is applied to describe the sequence-stability relationship for several protein structures: coiled-coil dimers and trimers, a PDZ domain, and T4 lysozyme as examples with computationally derived energies, and SH3 domains in amphiphysin-1 and endophilin-1 as examples where the expanded pseudo-energies are obtained from experiments. Our open-source software package Cluster Expansion Version 1.0 allows users to expand their own energy function of interest and thereby apply cluster expansion to custom problems in protein design. © 2010 Wiley Periodicals, Inc.
Fujimori, Shigeo; Hirai, Naoya; Ohashi, Hiroyuki; Masuoka, Kazuyo; Nishikimi, Akihiko; Fukui, Yoshinori; Washio, Takanori; Oshikubo, Tomohiro; Yamashita, Tatsuhiro; Miyamoto-Sato, Etsuko
2012-01-01
Next-generation sequencing (NGS) has been applied to various kinds of omics studies, resulting in many biological and medical discoveries. However, high-throughput protein-protein interactome datasets derived from detection by sequencing are scarce, because protein-protein interaction analysis requires many cell manipulations to examine the interactions. The low reliability of the high-throughput data is also a problem. Here, we describe a cell-free display technology combined with NGS that can improve both the coverage and reliability of interactome datasets. The completely cell-free method gives a high-throughput and a large detection space, testing the interactions without using clones. The quantitative information provided by NGS reduces the number of false positives. The method is suitable for the in vitro detection of proteins that interact not only with the bait protein, but also with DNA, RNA and chemical compounds. Thus, it could become a universal approach for exploring the large space of protein sequences and interactome networks. PMID:23056904
Structural genomics: keeping up with expanding knowledge of the protein universe.
Grabowski, Marek; Joachimiak, Andrzej; Otwinowski, Zbyszek; Minor, Wladek
2007-06-01
Structural characterization of the protein universe is the main mission of Structural Genomics (SG) programs. However, progress in gene sequencing technology, set in motion in the 1990s, has resulted in rapid expansion of protein sequence space--a twelvefold increase in the past seven years. For the SG field, this creates new challenges and necessitates a re-assessment of its strategies. Nevertheless, despite the growth of sequence space, at present nearly half of the content of the Swiss-Prot database and over 40% of Pfam protein families can be structurally modeled based on structures determined so far, with SG projects making an increasingly significant contribution. The SG contribution of new Pfam structures nearly doubled from 27.2% in 2003 to 51.6% in 2006.
Structural genomics: keeping up with expanding knowledge of the protein universe
Grabowski, Marek; Joachimiak, Andrzej; Otwinowski, Zbyszek; Minor, Wladek
2010-01-01
Structural characterization of the protein universe is the main mission of Structural Genomics (SG) programs. However, progress in gene sequencing technology, set in motion in the 1990s, has resulted in rapid expansion of protein sequence space — a twelvefold increase in the past seven years. For the SG field, this creates new challenges and necessitates a reassessment of its strategies. Nevertheless, despite the growth of sequence space, at present nearly half of the content of the Swiss-Prot database and over 40% of Pfam protein families can be structurally modeled based on structures determined so far, with SG projects making an increasingly significant contribution. The SG contribution of new Pfam structures nearly doubled from 27.2% in 2003 to 51.6% in 2006. PMID:17587562
Protein Science by DNA Sequencing: How Advances in Molecular Biology Are Accelerating Biochemistry.
Higgins, Sean A; Savage, David F
2018-01-09
A fundamental goal of protein biochemistry is to determine the sequence-function relationship, but the vastness of sequence space makes comprehensive evaluation of this landscape difficult. However, advances in DNA synthesis and sequencing now allow researchers to assess the functional impact of every single mutation in many proteins, but challenges remain in library construction and the development of general assays applicable to a diverse range of protein functions. This Perspective briefly outlines the technical innovations in DNA manipulation that allow massively parallel protein biochemistry and then summarizes the methods currently available for library construction and the functional assays of protein variants. Areas in need of future innovation are highlighted with a particular focus on assay development and the use of computational analysis with machine learning to effectively traverse the sequence-function landscape. Finally, applications in the fundamentals of protein biochemistry, disease prediction, and protein engineering are presented.
Bioinformatic prediction and in vivo validation of residue-residue interactions in human proteins
NASA Astrophysics Data System (ADS)
Jordan, Daniel; Davis, Erica; Katsanis, Nicholas; Sunyaev, Shamil
2014-03-01
Identifying residue-residue interactions in protein molecules is important for understanding both protein structure and function in the context of evolutionary dynamics and medical genetics. Such interactions can be difficult to predict using existing empirical or physical potentials, especially when residues are far from each other in sequence space. Using a multiple sequence alignment of 46 diverse vertebrate species we explore the space of allowed sequences for orthologous protein families. Amino acid changes that are known to damage protein function allow us to identify specific changes that are likely to have interacting partners. We fit the parameters of the continuous-time Markov process used in the alignment to conclude that these interactions are primarily pairwise, rather than higher order. Candidates for sites under pairwise epistasis are predicted, which can then be tested by experiment. We report the results of an initial round of in vivo experiments in a zebrafish model that verify the presence of multiple pairwise interactions predicted by our model. These experimentally validated interactions are novel, distant in sequence, and are not readily explained by known biochemical or biophysical features.
``Sequence space soup'' of proteins and copolymers
NASA Astrophysics Data System (ADS)
Chan, Hue Sun; Dill, Ken A.
1991-09-01
To study the protein folding problem, we use exhaustive computer enumeration to explore ``sequence space soup,'' an imaginary solution containing the ``native'' conformations (i.e., of lowest free energy) under folding conditions, of every possible copolymer sequence. The model is of short self-avoiding chains of hydrophobic (H) and polar (P) monomers configured on the two-dimensional square lattice. By exhaustive enumeration, we identify all native structures for every possible sequence. We find that random sequences of H/P copolymers will bear striking resemblance to known proteins: Most sequences under folding conditions will be approximately as compact as known proteins, will have considerable amounts of secondary structure, and it is most probable that an arbitrary sequence will fold to a number of lowest free energy conformations that is of order one. In these respects, this simple model shows that proteinlike behavior should arise simply in copolymers in which one monomer type is highly solvent averse. It suggests that the structures and uniquenesses of native proteins are not consequences of having 20 different monomer types, or of unique properties of amino acid monomers with regard to special packing or interactions, and thus that simple copolymers might be designable to collapse to proteinlike structures and properties. A good strategy for designing a sequence to have a minimum possible number of native states is to strategically insert many P monomers. Thus known proteins may be marginally stable due to a balance: More H residues stabilize the desired native state, but more P residues prevent simultaneous stabilization of undesired native states.
Exploration of the relationship between topology and designability of conformations
NASA Astrophysics Data System (ADS)
Leelananda, Sumudu P.; Towfic, Fadi; Jernigan, Robert L.; Kloczkowski, Andrzej
2011-06-01
Protein structures are evolutionarily more conserved than sequences, and sequences with very low sequence identity frequently share the same fold. This leads to the concept of protein designability. Some folds are more designable and lots of sequences can assume that fold. Elucidating the relationship between protein sequence and the three-dimensional (3D) structure that the sequence folds into is an important problem in computational structural biology. Lattice models have been utilized in numerous studies to model protein folds and predict the designability of certain folds. In this study, all possible compact conformations within a set of two-dimensional and 3D lattice spaces are explored. Complementary interaction graphs are then generated for each conformation and are described using a set of graph features. The full HP sequence space for each lattice model is generated and contact energies are calculated by threading each sequence onto all the possible conformations. Unique conformation giving minimum energy is identified for each sequence and the number of sequences folding to each conformation (designability) is obtained. Machine learning algorithms are used to predict the designability of each conformation. We find that the highly designable structures can be distinguished from other non-designable conformations based on certain graphical geometric features of the interactions. This finding confirms the fact that the topology of a conformation is an important determinant of the extent of its designability and suggests that the interactions themselves are important for determining the designability.
Chan, Yvonne H.; Venev, Sergey V.; Zeldovich, Konstantin B.; Matthews, C. Robert
2017-01-01
Sequence divergence of orthologous proteins enables adaptation to environmental stresses and promotes evolution of novel functions. Limits on evolution imposed by constraints on sequence and structure were explored using a model TIM barrel protein, indole-3-glycerol phosphate synthase (IGPS). Fitness effects of point mutations in three phylogenetically divergent IGPS proteins during adaptation to temperature stress were probed by auxotrophic complementation of yeast with prokaryotic, thermophilic IGPS. Analysis of beneficial mutations pointed to an unexpected, long-range allosteric pathway towards the active site of the protein. Significant correlations between the fitness landscapes of distant orthologues implicate both sequence and structure as primary forces in defining the TIM barrel fitness landscape and suggest that fitness landscapes can be translocated in sequence space. Exploration of fitness landscapes in the context of a protein fold provides a strategy for elucidating the sequence-structure-fitness relationships in other common motifs. PMID:28262665
Ali, Safdar; Majid, Abdul; Khan, Asifullah
2014-04-01
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.
Bastien, Olivier; Ortet, Philippe; Roy, Sylvaine; Maréchal, Eric
2005-03-10
Popular methods to reconstruct molecular phylogenies are based on multiple sequence alignments, in which addition or removal of data may change the resulting tree topology. We have sought a representation of homologous proteins that would conserve the information of pair-wise sequence alignments, respect probabilistic properties of Z-scores (Monte Carlo methods applied to pair-wise comparisons) and be the basis for a novel method of consistent and stable phylogenetic reconstruction. We have built up a spatial representation of protein sequences using concepts from particle physics (configuration space) and respecting a frame of constraints deduced from pair-wise alignment score properties in information theory. The obtained configuration space of homologous proteins (CSHP) allows the representation of real and shuffled sequences, and thereupon an expression of the TULIP theorem for Z-score probabilities. Based on the CSHP, we propose a phylogeny reconstruction using Z-scores. Deduced trees, called TULIP trees, are consistent with multiple-alignment based trees. Furthermore, the TULIP tree reconstruction method provides a solution for some previously reported incongruent results, such as the apicomplexan enolase phylogeny. The CSHP is a unified model that conserves mutual information between proteins in the way physical models conserve energy. Applications include the reconstruction of evolutionary consistent and robust trees, the topology of which is based on a spatial representation that is not reordered after addition or removal of sequences. The CSHP and its assigned phylogenetic topology, provide a powerful and easily updated representation for massive pair-wise genome comparisons based on Z-score computations.
Neutrality and evolvability of designed protein sequences
NASA Astrophysics Data System (ADS)
Bhattacherjee, Arnab; Biswas, Parbati
2010-07-01
The effect of foldability on protein’s evolvability is analyzed by a two-prong approach consisting of a self-consistent mean-field theory and Monte Carlo simulations. Theory and simulation models representing protein sequences with binary patterning of amino acid residues compatible with a particular foldability criteria are used. This generalized foldability criterion is derived using the high temperature cumulant expansion approximating the free energy of folding. The effect of cumulative point mutations on these designed proteins is studied under neutral condition. The robustness, protein’s ability to tolerate random point mutations is determined with a selective pressure of stability (ΔΔG) for the theory designed sequences, which are found to be more robust than that of Monte Carlo and mean-field-biased Monte Carlo generated sequences. The results show that this foldability criterion selects viable protein sequences more effectively compared to the Monte Carlo method, which has a marked effect on how the selective pressure shapes the evolutionary sequence space. These observations may impact de novo sequence design and its applications in protein engineering.
El-Assaad, Atlal; Dawy, Zaher; Nemer, Georges; Hajj, Hazem; Kobeissy, Firas H
2017-01-01
Degradomics is a novel discipline that involves determination of the proteases/substrate fragmentation profile, called the substrate degradome, and has been recently applied in different disciplines. A major application of degradomics is its utility in the field of biomarkers where the breakdown products (BDPs) of different protease have been investigated. Among the major proteases assessed, calpain and caspase proteases have been associated with the execution phases of the pro-apoptotic and pro-necrotic cell death, generating caspase/calpain-specific cleaved fragments. The distinction between calpain and caspase protein fragments has been applied to distinguish injury mechanisms. Advanced proteomics technology has been used to identify these BDPs experimentally. However, it has been a challenge to identify these BDPs with high precision and efficiency, especially if we are targeting a number of proteins at one time. In this chapter, we present a novel bioinfromatic detection method that identifies BDPs accurately and efficiently with validation against experimental data. This method aims at predicting the consensus sequence occurrences and their variants in a large set of experimentally detected protein sequences based on state-of-the-art sequence matching and alignment algorithms. After detection, the method generates all the potential cleaved fragments by a specific protease. This space and time-efficient algorithm is flexible to handle the different orientations that the consensus sequence and the protein sequence can take before cleaving. It is O(mn) in space complexity and O(Nmn) in time complexity, with N number of protein sequences, m length of the consensus sequence, and n length of each protein sequence. Ultimately, this knowledge will subsequently feed into the development of a novel tool for researchers to detect diverse types of selected BDPs as putative disease markers, contributing to the diagnosis and treatment of related disorders.
Neumann, Sindy; Hartmann, Holger; Martin-Galiano, Antonio J; Fuchs, Angelika; Frishman, Dmitrij
2012-03-01
Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/. Copyright © 2011 Wiley Periodicals, Inc.
Jaschob, Daniel; Davis, Trisha N; Riffle, Michael
2014-07-23
As high throughput sequencing continues to grow more commonplace, the need to disseminate the resulting data via web applications continues to grow. Particularly, there is a need to disseminate multiple versions of related gene and protein sequences simultaneously--whether they represent alleles present in a single species, variations of the same gene among different strains, or homologs among separate species. Often this is accomplished by displaying all versions of the sequence at once in a manner that is not intuitive or space-efficient and does not facilitate human understanding of the data. Web-based applications needing to disseminate multiple versions of sequences would benefit from a drop-in module designed to effectively disseminate these data. SnipViz is a client-side software tool designed to disseminate multiple versions of related gene and protein sequences on web sites. SnipViz has a space-efficient, interactive, and dynamic interface for navigating, analyzing and visualizing sequence data. It is written using standard World Wide Web technologies (HTML, Javascript, and CSS) and is compatible with most web browsers. SnipViz is designed as a modular client-side web component and may be incorporated into virtually any web site and be implemented without any programming. SnipViz is a drop-in client-side module for web sites designed to efficiently visualize and disseminate gene and protein sequences. SnipViz is open source and is freely available at https://github.com/yeastrc/snipviz.
Exploring the repeat protein universe through computational protein design
Brunette, TJ; Parmeggiani, Fabio; Huang, Po-Ssu; ...
2015-12-16
A central question in protein evolution is the extent to which naturally occurring proteins sample the space of folded structures accessible to the polypeptide chain. Repeat proteins composed of multiple tandem copies of a modular structure unit are widespread in nature and have critical roles in molecular recognition, signalling, and other essential biological processes. Naturally occurring repeat proteins have been re-engineered for molecular recognition and modular scaffolding applications. In this paper, we use computational protein design to investigate the space of folded structures that can be generated by tandem repeating a simple helix–loop–helix–loop structural motif. Eighty-three designs with sequences unrelatedmore » to known repeat proteins were experimentally characterized. Of these, 53 are monomeric and stable at 95 °C, and 43 have solution X-ray scattering spectra consistent with the design models. Crystal structures of 15 designs spanning a broad range of curvatures are in close agreement with the design models with root mean square deviations ranging from 0.7 to 2.5 Å. Finally, our results show that existing repeat proteins occupy only a small fraction of the possible repeat protein sequence and structure space and that it is possible to design novel repeat proteins with precisely specified geometries, opening up a wide array of new possibilities for biomolecular engineering.« less
Gerlt, John A; Bouvier, Jason T; Davidson, Daniel B; Imker, Heidi J; Sadkhin, Boris; Slater, David R; Whalen, Katie L
2015-08-01
The Enzyme Function Initiative, an NIH/NIGMS-supported Large-Scale Collaborative Project (EFI; U54GM093342; http://enzymefunction.org/), is focused on devising and disseminating bioinformatics and computational tools as well as experimental strategies for the prediction and assignment of functions (in vitro activities and in vivo physiological/metabolic roles) to uncharacterized enzymes discovered in genome projects. Protein sequence similarity networks (SSNs) are visually powerful tools for analyzing sequence relationships in protein families (H.J. Atkinson, J.H. Morris, T.E. Ferrin, and P.C. Babbitt, PLoS One 2009, 4, e4345). However, the members of the biological/biomedical community have not had access to the capability to generate SSNs for their "favorite" protein families. In this article we announce the EFI-EST (Enzyme Function Initiative-Enzyme Similarity Tool) web tool (http://efi.igb.illinois.edu/efi-est/) that is available without cost for the automated generation of SSNs by the community. The tool can create SSNs for the "closest neighbors" of a user-supplied protein sequence from the UniProt database (Option A) or of members of any user-supplied Pfam and/or InterPro family (Option B). We provide an introduction to SSNs, a description of EFI-EST, and a demonstration of the use of EFI-EST to explore sequence-function space in the OMP decarboxylase superfamily (PF00215). This article is designed as a tutorial that will allow members of the community to use the EFI-EST web tool for exploring sequence/function space in protein families. Copyright © 2015 Elsevier B.V. All rights reserved.
PinaColada: peptide-inhibitor ant colony ad-hoc design algorithm.
Zaidman, Daniel; Wolfson, Haim J
2016-08-01
Design of protein-protein interaction (PPI) inhibitors is a major challenge in Structural Bioinformatics. Peptides, especially short ones (5-15 amino acid long), are natural candidates for inhibition of protein-protein complexes due to several attractive features such as high structural compatibility with the protein binding site (mimicking the surface of one of the proteins), small size and the ability to form strong hotspot binding connections with the protein surface. Efficient rational peptide design is still a major challenge in computer aided drug design, due to the huge space of possible sequences, which is exponential in the length of the peptide, and the high flexibility of peptide conformations. In this article we present PinaColada, a novel computational method for the design of peptide inhibitors for protein-protein interactions. We employ a version of the ant colony optimization heuristic, which is used to explore the exponential space ([Formula: see text]) of length n peptide sequences, in combination with our fast robotics motivated PepCrawler algorithm, which explores the conformational space for each candidate sequence. PinaColada is being run in parallel, on a DELL PowerEdge 2.8 GHZ computer with 20 cores and 256 GB memory, and takes up to 24 h to design a peptide of 5-15 amino acids length. An online server available at: http://bioinfo3d.cs.tau.ac.il/PinaColada/. danielza@post.tau.ac.il; wolfson@tau.ac.il. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Sequence co-evolution gives 3D contacts and structures of protein complexes
Hopf, Thomas A; Schärfe, Charlotta P I; Rodrigues, João P G L M; Green, Anna G; Kohlbacher, Oliver; Sander, Chris; Bonvin, Alexandre M J J; Marks, Debora S
2014-01-01
Protein–protein interactions are fundamental to many biological processes. Experimental screens have identified tens of thousands of interactions, and structural biology has provided detailed functional insight for select 3D protein complexes. An alternative rich source of information about protein interactions is the evolutionary sequence record. Building on earlier work, we show that analysis of correlated evolutionary sequence changes across proteins identifies residues that are close in space with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We evaluate prediction performance in blinded tests on 76 complexes of known 3D structure, predict protein–protein contacts in 32 complexes of unknown structure, and demonstrate how evolutionary couplings can be used to distinguish between interacting and non-interacting protein pairs in a large complex. With the current growth of sequences, we expect that the method can be generalized to genome-wide elucidation of protein–protein interaction networks and used for interaction predictions at residue resolution. DOI: http://dx.doi.org/10.7554/eLife.03430.001 PMID:25255213
RNA regulators responding to ribosomal protein S15 are frequent in sequence space
Slinger, Betty L.; Meyer, Michelle M.
2016-01-01
There are several natural examples of distinct RNA structures that interact with the same ligand to regulate the expression of homologous genes in different organisms. One essential question regarding this phenomenon is whether such RNA regulators are the result of convergent or divergent evolution. Are the RNAs derived from some common ancestor and diverged to the point where we cannot identify the similarity, or have multiple solutions to the same biological problem arisen independently? A key variable in assessing these alternatives is how frequently such regulators arise within sequence space. Ribosomal protein S15 is autogenously regulated via an RNA regulator in many bacterial species; four apparently distinct regulators have been functionally validated in different bacterial phyla. Here, we explore how frequently such regulators arise within a partially randomized sequence population. We find many RNAs that interact specifically with ribosomal protein S15 from Geobacillus kaustophilus with biologically relevant dissociation constants. Furthermore, of the six sequences we characterize, four show regulatory activity in an Escherichia coli reporter assay. Subsequent footprinting and mutagenesis analysis indicates that protein binding proximal to regulatory features such as the Shine–Dalgarno sequence is sufficient to enable regulation, suggesting that regulation in response to S15 is relatively easily acquired. PMID:27580716
Validating a Coarse-Grained Potential Energy Function through Protein Loop Modelling
MacDonald, James T.; Kelley, Lawrence A.; Freemont, Paul S.
2013-01-01
Coarse-grained (CG) methods for sampling protein conformational space have the potential to increase computational efficiency by reducing the degrees of freedom. The gain in computational efficiency of CG methods often comes at the expense of non-protein like local conformational features. This could cause problems when transitioning to full atom models in a hierarchical framework. Here, a CG potential energy function was validated by applying it to the problem of loop prediction. A novel method to sample the conformational space of backbone atoms was benchmarked using a standard test set consisting of 351 distinct loops. This method used a sequence-independent CG potential energy function representing the protein using -carbon positions only and sampling conformations with a Monte Carlo simulated annealing based protocol. Backbone atoms were added using a method previously described and then gradient minimised in the Rosetta force field. Despite the CG potential energy function being sequence-independent, the method performed similarly to methods that explicitly use either fragments of known protein backbones with similar sequences or residue-specific /-maps to restrict the search space. The method was also able to predict with sub-Angstrom accuracy two out of seven loops from recently solved crystal structures of proteins with low sequence and structure similarity to previously deposited structures in the PDB. The ability to sample realistic loop conformations directly from a potential energy function enables the incorporation of additional geometric restraints and the use of more advanced sampling methods in a way that is not possible to do easily with fragment replacement methods and also enable multi-scale simulations for protein design and protein structure prediction. These restraints could be derived from experimental data or could be design restraints in the case of computational protein design. C++ source code is available for download from http://www.sbg.bio.ic.ac.uk/phyre2/PD2/. PMID:23824634
Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria
Farasat, Iman; Kushwaha, Manish; Collens, Jason; Easterbrook, Michael; Guido, Matthew; Salis, Howard M
2014-01-01
Developing predictive models of multi-protein genetic systems to understand and optimize their behavior remains a combinatorial challenge, particularly when measurement throughput is limited. We developed a computational approach to build predictive models and identify optimal sequences and expression levels, while circumventing combinatorial explosion. Maximally informative genetic system variants were first designed by the RBS Library Calculator, an algorithm to design sequences for efficiently searching a multi-protein expression space across a > 10,000-fold range with tailored search parameters and well-predicted translation rates. We validated the algorithm's predictions by characterizing 646 genetic system variants, encoded in plasmids and genomes, expressed in six gram-positive and gram-negative bacterial hosts. We then combined the search algorithm with system-level kinetic modeling, requiring the construction and characterization of 73 variants to build a sequence-expression-activity map (SEAMAP) for a biosynthesis pathway. Using model predictions, we designed and characterized 47 additional pathway variants to navigate its activity space, find optimal expression regions with desired activity response curves, and relieve rate-limiting steps in metabolism. Creating sequence-expression-activity maps accelerates the optimization of many protein systems and allows previous measurements to quantitatively inform future designs. PMID:24952589
Domain similarity based orthology detection.
Bitard-Feildel, Tristan; Kemena, Carsten; Greenwood, Jenny M; Bornberg-Bauer, Erich
2015-05-13
Orthologous protein detection software mostly uses pairwise comparisons of amino-acid sequences to assert whether two proteins are orthologous or not. Accordingly, when the number of sequences for comparison increases, the number of comparisons to compute grows in a quadratic order. A current challenge of bioinformatic research, especially when taking into account the increasing number of sequenced organisms available, is to make this ever-growing number of comparisons computationally feasible in a reasonable amount of time. We propose to speed up the detection of orthologous proteins by using strings of domains to characterize the proteins. We present two new protein similarity measures, a cosine and a maximal weight matching score based on domain content similarity, and new software, named porthoDom. The qualities of the cosine and the maximal weight matching similarity measures are compared against curated datasets. The measures show that domain content similarities are able to correctly group proteins into their families. Accordingly, the cosine similarity measure is used inside porthoDom, the wrapper developed for proteinortho. porthoDom makes use of domain content similarity measures to group proteins together before searching for orthologs. By using domains instead of amino acid sequences, the reduction of the search space decreases the computational complexity of an all-against-all sequence comparison. We demonstrate that representing and comparing proteins as strings of discrete domains, i.e. as a concatenation of their unique identifiers, allows a drastic simplification of search space. porthoDom has the advantage of speeding up orthology detection while maintaining a degree of accuracy similar to proteinortho. The implementation of porthoDom is released using python and C++ languages and is available under the GNU GPL licence 3 at http://www.bornberglab.org/pages/porthoda .
Hidden long evolutionary memory in a model biochemical network
NASA Astrophysics Data System (ADS)
Ali, Md. Zulfikar; Wingreen, Ned S.; Mukhopadhyay, Ranjan
2018-04-01
We introduce a minimal model for the evolution of functional protein-interaction networks using a sequence-based mutational algorithm, and apply the model to study neutral drift in networks that yield oscillatory dynamics. Starting with a functional core module, random evolutionary drift increases network complexity even in the absence of specific selective pressures. Surprisingly, we uncover a hidden order in sequence space that gives rise to long-term evolutionary memory, implying strong constraints on network evolution due to the topology of accessible sequence space.
Wu, Fang; Li, Yikun; Chang, Shaojie; Zhou, Zhaocai; Wang, Fang; Song, Xiaomin; Lin, Yujuan; Gong, Weimin
2002-12-01
A 16 kDa protein SPE16 was purified from the seeds of Pachyrrhizus erosus. Its N-terminal amino-acid sequence showed significant sequence homology to pathogenesis-related proteins from the PR-10 family. An activity assay indicated that SPE16 possesses ribonuclease activity as do some other PR-10 proteins. SPE16 crystals were obtained by the hanging-drop vapour-diffusion method. The space group is P2(1)2(1)2(1), with unit-cell parameters a = 53.36, b = 63.70, c = 72.96 A.
Relationships between residue Voronoi volume and sequence conservation in proteins.
Liu, Jen-Wei; Cheng, Chih-Wen; Lin, Yu-Feng; Chen, Shao-Yu; Hwang, Jenn-Kang; Yen, Shih-Chung
2018-02-01
Functional and biophysical constraints can cause different levels of sequence conservation in proteins. Previously, structural properties, e.g., relative solvent accessibility (RSA) and packing density of the weighted contact number (WCN), have been found to be related to protein sequence conservation (CS). The Voronoi volume has recently been recognized as a new structural property of the local protein structural environment reflecting CS. However, for surface residues, it is sensitive to water molecules surrounding the protein structure. Herein, we present a simple structural determinant termed the relative space of Voronoi volume (RSV); it uses the Voronoi volume and the van der Waals volume of particular residues to quantify the local structural environment. RSV (range, 0-1) is defined as (Voronoi volume-van der Waals volume)/Voronoi volume of the target residue. The concept of RSV describes the extent of available space for every protein residue. RSV and Voronoi profiles with and without water molecules (RSVw, RSV, VOw, and VO) were compared for 554 non-homologous proteins. RSV (without water) showed better Pearson's correlations with CS than did RSVw, VO, or VOw values. The mean correlation coefficient between RSV and CS was 0.51, which is comparable to the correlation between RSA and CS (0.49) and that between WCN and CS (0.56). RSV is a robust structural descriptor with and without water molecules and can quantitatively reflect evolutionary information in a single protein structure. Therefore, it may represent a practical structural determinant to study protein sequence, structure, and function relationships. Copyright © 2017 Elsevier B.V. All rights reserved.
Protein space: a natural method for realizing the nature of protein universe.
Yu, Chenglong; Deng, Mo; Cheng, Shiu-Yuen; Yau, Shek-Chung; He, Rong L; Yau, Stephen S-T
2013-02-07
Current methods cannot tell us what the nature of the protein universe is concretely. They are based on different models of amino acid substitution and multiple sequence alignment which is an NP-hard problem and requires manual intervention. Protein structural analysis also gives a direction for mapping the protein universe. Unfortunately, now only a minuscule fraction of proteins' 3-dimensional structures are known. Furthermore, the phylogenetic tree representations are not unique for any existing tree construction methods. Here we develop a novel method to realize the nature of protein universe. We show the protein universe can be realized as a protein space in 60-dimensional Euclidean space using a distance based on a normalized distribution of amino acids. Every protein is in one-to-one correspondence with a point in protein space, where proteins with similar properties stay close together. Thus the distance between two points in protein space represents the biological distance of the corresponding two proteins. We also propose a natural graphical representation for inferring phylogenies. The representation is natural and unique based on the biological distances of proteins in protein space. This will solve the fundamental question of how proteins are distributed in the protein universe. Copyright © 2012 Elsevier Ltd. All rights reserved.
Karfeld-Sulzer, Lindsay S.; Waters, Emily A.; Davis, Nicolynn E.; Meade, Thomas J.; Barron, Annelise E.
2010-01-01
Magnetic Resonance Imaging (MRI) is a noninvasive imaging modality with high spatial and temporal resolution. Contrast agents (CAs) are frequently used to increase the contrast between tissues of interest. To increase the effectiveness of MR agents, small molecule CAs have been attached to macromolecules. We have created a family of biodegradable, macromolecular CAs based on protein polymers, allowing control over the CA properties. The protein polymers are monodisperse, random coil, and contain evenly spaced lysines that serve as reactive sites for Gd(III) chelates. The exact sequence and length of the protein can be specified, enabling controlled variation in lysine spacing and molecular weight. Relaxivity could be modulated by changing protein polymer length and lysine spacing. Relaxivities of up to ∼14 mM-1s-1 per Gd(III) and ∼461 mM-1s-1 per conjugate were observed. These CAs are biodegradable by incubation with plasmin, such that they can be easily excreted after use. They do not reduce cell viability, a prerequisite for future in vivo studies. The protein polymer CAs can be customized for different clinical diagnostic applications, including biomaterial tracking, as a balanced agent with high relaxivity and appropriate molar mass. PMID:20420441
Pease, Anthony; Sullivan, Stacey; Olby, Natasha; Galano, Heather; Cerda-Gonzalez, Sophia; Robertson, Ian D; Gavin, Patrick; Thrall, Donald
2006-01-01
Three case history reports are presented to illustrate the value of the single-shot turbo spin-echo pulse sequence for assessment of the subarachnoid space. The use of the single-shot turbo spin-echo pulse sequence, which is a heavily T2-weighted sequence, allows for a rapid, noninvasive evaluation of the subarachnoid space by using the high signal from cerebrospinal fluid. This sequence can be completed in seconds rather than the several minutes required for a T2-fast spin-echo sequence. Unlike the standard T2-fast spin-echo sequence, a single-shot turbo spin-echo pulse sequence also provides qualitative information about the protein and the cellular content of the cerebrospinal fluid, such as in patients with inflammatory debris or hemorrhage in the cerebrospinal fluid. Although the resolution of the single-shot turbo spin-echo pulse sequence images is relatively poor compared with more conventional sequences, the qualitative information about the subarachnoid space and cerebrospinal fluid and the rapid acquisition time, make it a useful sequence to include in standard protocols of spinal magnetic resonance imaging.
Gerlt, John A
2017-08-22
The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of "genomic enzymology" web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence-function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems.
Automated design evolution of stereochemically randomized protein foldamers
NASA Astrophysics Data System (ADS)
Ranbhor, Ranjit; Kumar, Anil; Patel, Kirti; Ramakrishnan, Vibin; Durani, Susheel
2018-05-01
Diversification of chain stereochemistry opens up the possibilities of an ‘in principle’ increase in the design space of proteins. This huge increase in the sequence and consequent structural variation is aimed at the generation of smart materials. To diversify protein structure stereochemically, we introduced L- and D-α-amino acids as the design alphabet. With a sequence design algorithm, we explored the usage of specific variables such as chirality and the sequence of this alphabet in independent steps. With molecular dynamics, we folded stereochemically diverse homopolypeptides and evaluated their ‘fitness’ for possible design as protein-like foldamers. We propose a fitness function to prune the most optimal fold among 1000 structures simulated with an automated repetitive simulated annealing molecular dynamics (AR-SAMD) approach. The highly scored poly-leucine fold with sequence lengths of 24 and 30 amino acids were later sequence-optimized using a Dead End Elimination cum Monte Carlo based optimization tool. This paper demonstrates a novel approach for the de novo design of protein-like foldamers.
A Generative Angular Model of Protein Structure Evolution
Golden, Michael; García-Portugués, Eduardo; Sørensen, Michael; Mardia, Kanti V.; Hamelryck, Thomas; Hein, Jotun
2017-01-01
Abstract Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and structure evolution in a pair of homologous proteins. The evolutionary trajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space, which is modeled using a novel angular diffusion process on the two-dimensional torus. Coupling sequence and structure evolution in our model allows for modeling both “smooth” conformational changes and “catastrophic” conformational jumps, conditioned on the amino acid changes. The model has interpretable parameters and is comparatively more realistic than previous stochastic models, providing new insights into the relationship between sequence and structure evolution. For example, using the trained model we were able to identify an apparent sequence–structure evolutionary motif present in a large number of homologous protein pairs. The generative nature of our model enables us to evaluate its validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence, a related amino acid sequence, a related structure or any combination thereof. PMID:28453724
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-07-01
UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.
Scop3D: three-dimensional visualization of sequence conservation.
Vermeire, Tessa; Vermaere, Stijn; Schepens, Bert; Saelens, Xavier; Van Gucht, Steven; Martens, Lennart; Vandermarliere, Elien
2015-04-01
The integration of a protein's structure with its known sequence variation provides insight on how that protein evolves, for instance in terms of (changing) function or immunogenicity. Yet, collating the corresponding sequence variants into a multiple sequence alignment, calculating each position's conservation, and mapping this information back onto a relevant structure is not straightforward. We therefore built the Sequence Conservation on Protein 3D structure (scop3D) tool to perform these tasks automatically. The output consists of two modified PDB files in which the B-values for each position are replaced by the percentage sequence conservation, or the information entropy for each position, respectively. Furthermore, text files with absolute and relative amino acid occurrences for each position are also provided, along with snapshots of the protein from six distinct directions in space. The visualization provided by scop3D can for instance be used as an aid in vaccine development or to identify antigenic hotspots, which we here demonstrate based on an analysis of the fusion proteins of human respiratory syncytial virus and mumps virus. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Using cellular automata to generate image representation for biological sequences.
Xiao, X; Shao, S; Ding, Y; Huang, Z; Chen, X; Chou, K-C
2005-02-01
A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419-424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their "fingerprint". It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246-255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location.
Song, Jiangning; Yuan, Zheng; Tan, Hao; Huber, Thomas; Burrage, Kevin
2007-12-01
Disulfide bonds are primary covalent crosslinks between two cysteine residues in proteins that play critical roles in stabilizing the protein structures and are commonly found in extracy-toplasmatic or secreted proteins. In protein folding prediction, the localization of disulfide bonds can greatly reduce the search in conformational space. Therefore, there is a great need to develop computational methods capable of accurately predicting disulfide connectivity patterns in proteins that could have potentially important applications. We have developed a novel method to predict disulfide connectivity patterns from protein primary sequence, using a support vector regression (SVR) approach based on multiple sequence feature vectors and predicted secondary structure by the PSIPRED program. The results indicate that our method could achieve a prediction accuracy of 74.4% and 77.9%, respectively, when averaged on proteins with two to five disulfide bridges using 4-fold cross-validation, measured on the protein and cysteine pair on a well-defined non-homologous dataset. We assessed the effects of different sequence encoding schemes on the prediction performance of disulfide connectivity. It has been shown that the sequence encoding scheme based on multiple sequence feature vectors coupled with predicted secondary structure can significantly improve the prediction accuracy, thus enabling our method to outperform most of other currently available predictors. Our work provides a complementary approach to the current algorithms that should be useful in computationally assigning disulfide connectivity patterns and helps in the annotation of protein sequences generated by large-scale whole-genome projects. The prediction web server and Supplementary Material are accessible at http://foo.maths.uq.edu.au/~huber/disulfide
Protein 3D Structure Computed from Evolutionary Sequence Variation
Sheridan, Robert; Hopf, Thomas A.; Pagnani, Andrea; Zecchina, Riccardo; Sander, Chris
2011-01-01
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes. PMID:22163331
Ma, Yue; Tuskan, Gerald A.
2018-01-01
The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here) from the protein distribution densities in the LD space defined by ln(L) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level. PMID:29686995
Smith, Colin A; Kortemme, Tanja
2011-01-01
Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-01-01
Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il PMID:18586742
GATA: A graphic alignment tool for comparative sequenceanalysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nix, David A.; Eisen, Michael B.
2005-01-01
Several problems exist with current methods used to align DNA sequences for comparative sequence analysis. Most dynamic programming algorithms assume that conserved sequence elements are collinear. This assumption appears valid when comparing orthologous protein coding sequences. Functional constraints on proteins provide strong selective pressure against sequence inversions, and minimize sequence duplications and feature shuffling. For non-coding sequences this collinearity assumption is often invalid. For example, enhancers contain clusters of transcription factor binding sites that change in number, orientation, and spacing during evolution yet the enhancer retains its activity. Dotplot analysis is often used to estimate non-coding sequence relatedness. Yet dotmore » plots do not actually align sequences and thus cannot account well for base insertions or deletions. Moreover, they lack an adequate statistical framework for comparing sequence relatedness and are limited to pairwise comparisons. Lastly, dot plots and dynamic programming text outputs fail to provide an intuitive means for visualizing DNA alignments.« less
The Universal Protein Resource (UniProt): an expanding universe of protein information.
Wu, Cathy H; Apweiler, Rolf; Bairoch, Amos; Natale, Darren A; Barker, Winona C; Boeckmann, Brigitte; Ferro, Serenella; Gasteiger, Elisabeth; Huang, Hongzhan; Lopez, Rodrigo; Magrane, Michele; Martin, Maria J; Mazumder, Raja; O'Donovan, Claire; Redaschi, Nicole; Suzek, Baris
2006-01-01
The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at http://www.uniprot.org or downloaded at ftp://ftp.uniprot.org/pub/databases/.
Dutta, Sanjib; Koide, Akiko; Koide, Shohei
2008-01-01
Stability evaluation of many mutants can lead to a better understanding of the sequence determinants of a structural motif and of factors governing protein stability and protein evolution. The traditional biophysical analysis of protein stability is low throughput, limiting our ability to widely explore the sequence space in a quantitative manner. In this study, we have developed a high-throughput library screening method for quantifying stability changes, which is based on protein fragment reconstitution and yeast surface display. Our method exploits the thermodynamic linkage between protein stability and fragment reconstitution and the ability of the yeast surface display technique to quantitatively evaluate protein-protein interactions. The method was applied to a fibronectin type III (FN3) domain. Characterization of fragment reconstitution was facilitated by the co-expression of two FN3 fragments, thus establishing a "yeast surface two-hybrid" method. Importantly, our method does not rely on competition between clones and thus eliminates a common limitation of high-throughput selection methods in which the most stable variants are predominantly recovered. Thus, it allows for the isolation of sequences that exhibits a desired level of stability. We identified over one hundred unique sequences for a β-bulge motif, which was significantly more informative than natural sequences of the FN3 family in revealing the sequence determinants for the β-bulge. Our method provides a powerful means to rapidly assess stability of many variants, to systematically assess contribution of different factors to protein stability and to enhance protein stability. PMID:18674545
Kono, H; Saven, J G
2001-02-23
Combinatorial experiments provide new ways to probe the determinants of protein folding and to identify novel folding amino acid sequences. These types of experiments, however, are complicated both by enormous conformational complexity and by large numbers of possible sequences. Therefore, a quantitative computational theory would be helpful in designing and interpreting these types of experiment. Here, we present and apply a statistically based, computational approach for identifying the properties of sequences compatible with a given main-chain structure. Protein side-chain conformations are included in an atom-based fashion. Calculations are performed for a variety of similar backbone structures to identify sequence properties that are robust with respect to minor changes in main-chain structure. Rather than specific sequences, the method yields the likelihood of each of the amino acids at preselected positions in a given protein structure. The theory may be used to quantify the characteristics of sequence space for a chosen structure without explicitly tabulating sequences. To account for hydrophobic effects, we introduce an environmental energy that it is consistent with other simple hydrophobicity scales and show that it is effective for side-chain modeling. We apply the method to calculate the identity probabilities of selected positions of the immunoglobulin light chain-binding domain of protein L, for which many variant folding sequences are available. The calculations compare favorably with the experimentally observed identity probabilities.
Stegemann, Björn; Klebe, Gerhard
2012-02-01
Small molecules are recognized in protein-binding pockets through surface-exposed physicochemical properties. To optimize binding, they have to adopt a conformation corresponding to a local energy minimum within the formed protein-ligand complex. However, their conformational flexibility makes them competent to bind not only to homologous proteins of the same family but also to proteins of remote similarity with respect to the shape of the binding pockets and folding pattern. Considering drug action, such observations can give rise to unexpected and undesired cross reactivity. In this study, datasets of six different cofactors (ADP, ATP, NAD(P)(H), FAD, and acetyl CoA, sharing an adenosine diphosphate moiety as common substructure), observed in multiple crystal structures of protein-cofactor complexes exhibiting sequence identity below 25%, have been analyzed for the conformational properties of the bound ligands, the distribution of physicochemical properties in the accommodating protein-binding pockets, and the local folding patterns next to the cofactor-binding site. State-of-the-art clustering techniques have been applied to group the different protein-cofactor complexes in the different spaces. Interestingly, clustering in cavity (Cavbase) and fold space (DALI) reveals virtually the same data structuring. Remarkable relationships can be found among the different spaces. They provide information on how conformations are conserved across the host proteins and which distinct local cavity and fold motifs recognize the different portions of the cofactors. In those cases, where different cofactors are found to be accommodated in a similar fashion to the same fold motifs, only a commonly shared substructure of the cofactors is used for the recognition process. Copyright © 2011 Wiley Periodicals, Inc.
Flexible DNA binding of the BTB/POZ-domain protein FBI-1.
Pessler, Frank; Hernandez, Nouria
2003-08-01
POZ-domain transcription factors are characterized by the presence of a protein-protein interaction domain called the POZ or BTB domain at their N terminus and zinc fingers at their C terminus. Despite the large number of POZ-domain transcription factors that have been identified to date and the significant insights that have been gained into their cellular functions, relatively little is known about their DNA binding properties. FBI-1 is a BTB/POZ-domain protein that has been shown to modulate HIV-1 Tat trans-activation and to repress transcription of some cellular genes. We have used various viral and cellular FBI-1 binding sites to characterize the interaction of a POZ-domain protein with DNA in detail. We find that FBI-1 binds to inverted sequence repeats downstream of the HIV-1 transcription start site. Remarkably, it binds efficiently to probes carrying these repeats in various orientations and spacings with no particular rotational alignment, indicating that its interaction with DNA is highly flexible. Indeed, FBI-1 binding sites in the adenovirus 2 major late promoter, the c-fos gene, and the c-myc P1 and P2 promoters reveal variously spaced direct, inverted, and everted sequence repeats with the consensus sequence G(A/G)GGG(T/C)(C/T)(T/C)(C/T) for each repeat.
Conformational Entropy of Intrinsically Disordered Proteins from Amino Acid Triads
Baruah, Anupaul; Rani, Pooja; Biswas, Parbati
2015-01-01
This work quantitatively characterizes intrinsic disorder in proteins in terms of sequence composition and backbone conformational entropy. Analysis of the normalized relative composition of the amino acid triads highlights a distinct boundary between globular and disordered proteins. The conformational entropy is calculated from the dihedral angles of the middle amino acid in the amino acid triad for the conformational ensemble of the globular, partially and completely disordered proteins relative to the non-redundant database. Both Monte Carlo (MC) and Molecular Dynamics (MD) simulations are used to characterize the conformational ensemble of the representative proteins of each group. The results show that the globular proteins span approximately half of the allowed conformational states in the Ramachandran space, while the amino acid triads in disordered proteins sample the entire range of the allowed dihedral angle space following Flory’s isolated-pair hypothesis. Therefore, only the sequence information in terms of the relative amino acid triad composition may be sufficient to predict protein disorder and the backbone conformational entropy, even in the absence of well-defined structure. The predicted entropies are found to agree with those calculated using mutual information expansion and the histogram method. PMID:26138206
Principles of protein folding--a perspective from simple exact models.
Dill, K. A.; Bromberg, S.; Yue, K.; Fiebig, K. M.; Yee, D. P.; Thomas, P. D.; Chan, H. S.
1995-01-01
General principles of protein structure, stability, and folding kinetics have recently been explored in computer simulations of simple exact lattice models. These models represent protein chains at a rudimentary level, but they involve few parameters, approximations, or implicit biases, and they allow complete explorations of conformational and sequence spaces. Such simulations have resulted in testable predictions that are sometimes unanticipated: The folding code is mainly binary and delocalized throughout the amino acid sequence. The secondary and tertiary structures of a protein are specified mainly by the sequence of polar and nonpolar monomers. More specific interactions may refine the structure, rather than dominate the folding code. Simple exact models can account for the properties that characterize protein folding: two-state cooperativity, secondary and tertiary structures, and multistage folding kinetics--fast hydrophobic collapse followed by slower annealing. These studies suggest the possibility of creating "foldable" chain molecules other than proteins. The encoding of a unique compact chain conformation may not require amino acids; it may require only the ability to synthesize specific monomer sequences in which at least one monomer type is solvent-averse. PMID:7613459
Mahajan, Gaurang; Mande, Shekhar C
2017-04-04
A comprehensive map of the human-M. tuberculosis (MTB) protein interactome would help fill the gaps in our understanding of the disease, and computational prediction can aid and complement experimental studies towards this end. Several sequence-based in silico approaches tap the existing data on experimentally validated protein-protein interactions (PPIs); these PPIs serve as templates from which novel interactions between pathogen and host are inferred. Such comparative approaches typically make use of local sequence alignment, which, in the absence of structural details about the interfaces mediating the template interactions, could lead to incorrect inferences, particularly when multi-domain proteins are involved. We propose leveraging the domain-domain interaction (DDI) information in PDB complexes to score and prioritize candidate PPIs between host and pathogen proteomes based on targeted sequence-level comparisons. Our method picks out a small set of human-MTB protein pairs as candidates for physical interactions, and the use of functional meta-data suggests that some of them could contribute to the in vivo molecular cross-talk between pathogen and host that regulates the course of the infection. Further, we present numerical data for Pfam domain families that highlights interaction specificity on the domain level. Not every instance of a pair of domains, for which interaction evidence has been found in a few instances (i.e. structures), is likely to functionally interact. Our sorting approach scores candidates according to how "distant" they are in sequence space from known examples of DDIs (templates). Thus, it provides a natural way to deal with the heterogeneity in domain-level interactions. Our method represents a more informed application of local alignment to the sequence-based search for potential human-microbial interactions that uses available PPI data as a prior. Our approach is somewhat limited in its sensitivity by the restricted size and diversity of the template dataset, but, given the rapid accumulation of solved protein complex structures, its scope and utility are expected to keep steadily improving.
Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.; ...
2018-01-01
The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.
The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less
Moreno, I M; Malpica, J M; Díaz-Pendón, J A; Moriones, E; Fraile, A; García-Arenal, F
2004-01-05
The genetic structure of the population of Watermelon mosaic virus (WMV) in Spain was analysed by the biological and molecular characterisation of isolates sampled from its main host plant, melon. The population was a highly homogeneous one, built of a single pathotype, and comprising isolates closely related genetically. There was indication of temporal replacement of genotypes, but not of spatial structure of the population. Analyses of nucleotide sequences in three genomic regions, that is, in the cistrons for the P1, cylindrical inclusion (CI) and capsid (CP) proteins, showed lower similar values of nucleotide diversity for the P1 than for the CI or CP cistrons. The CI protein and the CP were under tighter evolutionary constraints than the P1 protein. Also, for the CI and CP cistrons, but not for the P1 cistron, two groups of sequences, defining two genetic strains, were apparent. Thus, different genomic regions of WMV show different evolutionary dynamics. Interestingly, for the CI and CP cistrons, sequences were clustered into two regions of the sequence space, defining the two strains above, and no intermediary sequences were identified. Recombinant isolates were found, accounting for at least 7% of the population. These recombinants presented two interesting features: (i) crossover points were detected between the analysed regions in the CI and CP cistrons, but not between those in the P1 and CI cistrons, (ii) crossover points were not observed within the analysed coding regions for the P1, CI or CP proteins. This indicates strong selection against isolates with recombinant proteins, even when originated from closely related strains. Hence, data indicate that genotypes of WMV, generated by mutation or recombination, outside of acceptable, discrete, regions in the evolutionary space, are eliminated from the virus population by negative selection.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Delfosse, Vanessa; Hugonnet, Jean-Emmanuel; Sougakoff, Wladimir
The crystallization of a hypothetical penicillin-binding protein from the archaeon P. abyssi in space group C2 by hanging-drop vapour diffusion is reported. The genome of the hyperthermophilic archaeon Pyrococcus abyssi contains a gene (pab0087) encoding a penicillin-binding protein (PBP) homologue. This sequence consists of 447 residues and shows significant sequence similarity to low-molecular-weight PBPs and class C β-lactamases. The Pab0087 protein was overexpressed, purified and crystallized. Diffraction data from two different crystal forms were collected to 2.7 and 2.0 Å resolution. Both crystals belong to space group C2, with unit-cell parameters a = 160.59, b = 135.74, c = 113.02more » Å, β = 117.36° and a = 166.97, b = 131.25, c = 189.39 Å, β = 113.81°, respectively. The asymmetric unit contains four and eight molecules, respectively, with fourfold non-crystallographic symmetry.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shevtsov, M. B.; Streeter, S. D.; Thresh, S.-J.
2015-02-01
The structure of the new class of controller proteins (exemplified by C.Csp231I) in complex with its 21 bp DNA-recognition sequence is presented, and the molecular basis of sequence recognition in this class of proteins is discussed. An unusual extended spacer between the dimer binding sites suggests a novel interaction between the two C-protein dimers. In a wide variety of bacterial restriction–modification systems, a regulatory ‘controller’ protein (or C-protein) is required for effective transcription of its own gene and for transcription of the endonuclease gene found on the same operon. We have recently turned our attention to a new class ofmore » controller proteins (exemplified by C.Csp231I) that have quite novel features, including a much larger DNA-binding site with an 18 bp (∼60 Å) spacer between the two palindromic DNA-binding sequences and a very different recognition sequence from the canonical GACT/AGTC. Using X-ray crystallography, the structure of the protein in complex with its 21 bp DNA-recognition sequence was solved to 1.8 Å resolution, and the molecular basis of sequence recognition in this class of proteins was elucidated. An unusual aspect of the promoter sequence is the extended spacer between the dimer binding sites, suggesting a novel interaction between the two C-protein dimers when bound to both recognition sites correctly spaced on the DNA. A U-bend model is proposed for this tetrameric complex, based on the results of gel-mobility assays, hydrodynamic analysis and the observation of key contacts at the interface between dimers in the crystal.« less
An estimated 5% of new protein structures solved today represent a new Pfam family
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mistry, Jaina; Kloppmann, Edda; Rost, Burkhard
2013-11-01
This study uses the Pfam database to show that the sequence redundancy of protein structures deposited in the PDB is increasing. The possible reasons behind this trend are discussed. High-resolution structural knowledge is key to understanding how proteins function at the molecular level. The number of entries in the Protein Data Bank (PDB), the repository of all publicly available protein structures, continues to increase, with more than 8000 structures released in 2012 alone. The authors of this article have studied how structural coverage of the protein-sequence space has changed over time by monitoring the number of Pfam families that acquiredmore » their first representative structure each year from 1976 to 2012. Twenty years ago, for every 100 new PDB entries released, an estimated 20 Pfam families acquired their first structure. By 2012, this decreased to only about five families per 100 structures. The reasons behind the slower pace at which previously uncharacterized families are being structurally covered were investigated. It was found that although more than 50% of current Pfam families are still without a structural representative, this set is enriched in families that are small, functionally uncharacterized or rich in problem features such as intrinsically disordered and transmembrane regions. While these are important constraints, the reasons why it may not yet be time to give up the pursuit of a targeted but more comprehensive structural coverage of the protein-sequence space are discussed.« less
Zhou, Jiyun; Xu, Ruifeng; He, Yulan; Lu, Qin; Wang, Hongpeng; Kong, Bing
2016-01-01
Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community. PMID:27282833
SEQATOMS: a web tool for identifying missing regions in PDB in sequence context.
Brandt, Bernd W; Heringa, Jaap; Leunissen, Jack A M
2008-07-01
With over 46 000 proteins, the Protein Data Bank (PDB) is the most important database with structural information of biological macromolecules. PDB files contain sequence and coordinate information. Residues present in the sequence can be absent from the coordinate section, which means their position in space is unknown. Similarity searches are routinely carried out against sequences taken from PDB SEQRES. However, there no distinction is made between residues that have a known or unknown position in the 3D protein structure. We present a FASTA sequence database that is produced by combining the sequence and coordinate information. All residues absent from the PDB coordinate section are masked with lower-case letters, thereby providing a view of these residues in the context of the entire protein sequence, which facilitates inspecting 'missing' regions. We also provide a masked version of the CATH domain database. A user-friendly BLAST interface is available for similarity searching. In contrast to standard (stand-alone) BLAST output, which only contains upper-case letters, our output retains the lower-case letters of the masked regions. Thus, our server can be used to perform BLAST searching case-sensitively. Here, we have applied it to the study of missing regions in their sequence context. SEQATOMS is available at http://www.bioinformatics.nl/tools/seqatoms/.
Hayat, Maqsood; Tahir, Muhammad
2015-08-01
Membrane protein is a central component of the cell that manages intra and extracellular processes. Membrane proteins execute a diversity of functions that are vital for the survival of organisms. The topology of transmembrane proteins describes the number of transmembrane (TM) helix segments and its orientation. However, owing to the lack of its recognized structures, the identification of TM helix and its topology through experimental methods is laborious with low throughput. In order to identify TM helix segments reliably, accurately, and effectively from topogenic sequences, we propose the PSOFuzzySVM-TMH model. In this model, evolutionary based information position specific scoring matrix and discrete based information 6-letter exchange group are used to formulate transmembrane protein sequences. The noisy and extraneous attributes are eradicated using an optimization selection technique, particle swarm optimization, from both feature spaces. Finally, the selected feature spaces are combined in order to form ensemble feature space. Fuzzy-support vector Machine is utilized as a classification algorithm. Two benchmark datasets, including low and high resolution datasets, are used. At various levels, the performance of the PSOFuzzySVM-TMH model is assessed through 10-fold cross validation test. The empirical results reveal that the proposed framework PSOFuzzySVM-TMH outperforms in terms of classification performance in the examined datasets. It is ascertained that the proposed model might be a useful and high throughput tool for academia and research community for further structure and functional studies on transmembrane proteins.
Automated sequence-specific protein NMR assignment using the memetic algorithm MATCH.
Volk, Jochen; Herrmann, Torsten; Wüthrich, Kurt
2008-07-01
MATCH (Memetic Algorithm and Combinatorial Optimization Heuristics) is a new memetic algorithm for automated sequence-specific polypeptide backbone NMR assignment of proteins. MATCH employs local optimization for tracing partial sequence-specific assignments within a global, population-based search environment, where the simultaneous application of local and global optimization heuristics guarantees high efficiency and robustness. MATCH thus makes combined use of the two predominant concepts in use for automated NMR assignment of proteins. Dynamic transition and inherent mutation are new techniques that enable automatic adaptation to variable quality of the experimental input data. The concept of dynamic transition is incorporated in all major building blocks of the algorithm, where it enables switching between local and global optimization heuristics at any time during the assignment process. Inherent mutation restricts the intrinsically required randomness of the evolutionary algorithm to those regions of the conformation space that are compatible with the experimental input data. Using intact and artificially deteriorated APSY-NMR input data of proteins, MATCH performed sequence-specific resonance assignment with high efficiency and robustness.
A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3
Dietmann, Sabine; Park, Jong; Notredame, Cedric; Heger, Andreas; Lappe, Michael; Holm, Liisa
2001-01-01
The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families. PMID:11125048
Du, Yushen; Wu, Nicholas C.; Jiang, Lin; Zhang, Tianhao; Gong, Danyang; Shu, Sara; Wu, Ting-Ting
2016-01-01
ABSTRACT Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. PMID:27803181
Automated design of degenerate codon libraries.
Mena, Marco A; Daugherty, Patrick S
2005-12-01
Degenerate codon libraries are frequently used in protein engineering and evolution studies but are often limited to targeting a small number of positions to adequately limit the search space. To mitigate this, codon degeneracy can be limited using heuristics or previous knowledge of the targeted positions. To automate design of libraries given a set of amino acid sequences, an algorithm (LibDesign) was developed that generates a set of possible degenerate codon libraries, their resulting size, and their score relative to a user-defined scoring function. A gene library of a specified size can then be constructed that is representative of the given amino acid distribution or that includes specific sequences or combinations thereof. LibDesign provides a new tool for automated design of high-quality protein libraries that more effectively harness existing sequence-structure information derived from multiple sequence alignment or computational protein design data.
2017-01-01
The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of “genomic enzymology” web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence–function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems. PMID:28826221
Characterizing protein conformations by correlation analysis of coarse-grained contact matrices.
Lindsay, Richard J; Siess, Jan; Lohry, David P; McGee, Trevor S; Ritchie, Jordan S; Johnson, Quentin R; Shen, Tongye
2018-01-14
We have developed a method to capture the essential conformational dynamics of folded biopolymers using statistical analysis of coarse-grained segment-segment contacts. Previously, the residue-residue contact analysis of simulation trajectories was successfully applied to the detection of conformational switching motions in biomolecular complexes. However, the application to large protein systems (larger than 1000 amino acid residues) is challenging using the description of residue contacts. Also, the residue-based method cannot be used to compare proteins with different sequences. To expand the scope of the method, we have tested several coarse-graining schemes that group a collection of consecutive residues into a segment. The definition of these segments may be derived from structural and sequence information, while the interaction strength of the coarse-grained segment-segment contacts is a function of the residue-residue contacts. We then perform covariance calculations on these coarse-grained contact matrices. We monitored how well the principal components of the contact matrices is preserved using various rendering functions. The new method was demonstrated to assist the reduction of the degrees of freedom for describing the conformation space, and it potentially allows for the analysis of a system that is approximately tenfold larger compared with the corresponding residue contact-based method. This method can also render a family of similar proteins into the same conformational space, and thus can be used to compare the structures of proteins with different sequences.
Characterizing protein conformations by correlation analysis of coarse-grained contact matrices
NASA Astrophysics Data System (ADS)
Lindsay, Richard J.; Siess, Jan; Lohry, David P.; McGee, Trevor S.; Ritchie, Jordan S.; Johnson, Quentin R.; Shen, Tongye
2018-01-01
We have developed a method to capture the essential conformational dynamics of folded biopolymers using statistical analysis of coarse-grained segment-segment contacts. Previously, the residue-residue contact analysis of simulation trajectories was successfully applied to the detection of conformational switching motions in biomolecular complexes. However, the application to large protein systems (larger than 1000 amino acid residues) is challenging using the description of residue contacts. Also, the residue-based method cannot be used to compare proteins with different sequences. To expand the scope of the method, we have tested several coarse-graining schemes that group a collection of consecutive residues into a segment. The definition of these segments may be derived from structural and sequence information, while the interaction strength of the coarse-grained segment-segment contacts is a function of the residue-residue contacts. We then perform covariance calculations on these coarse-grained contact matrices. We monitored how well the principal components of the contact matrices is preserved using various rendering functions. The new method was demonstrated to assist the reduction of the degrees of freedom for describing the conformation space, and it potentially allows for the analysis of a system that is approximately tenfold larger compared with the corresponding residue contact-based method. This method can also render a family of similar proteins into the same conformational space, and thus can be used to compare the structures of proteins with different sequences.
From Ramachandran Maps to Tertiary Structures of Proteins.
DasGupta, Debarati; Kaushik, Rahul; Jayaram, B
2015-08-27
Sequence to structure of proteins is an unsolved problem. A possible coarse grained resolution to this entails specification of all the torsional (Φ, Ψ) angles along the backbone of the polypeptide chain. The Ramachandran map quite elegantly depicts the allowed conformational (Φ, Ψ) space of proteins which is still very large for the purposes of accurate structure generation. We have divided the allowed (Φ, Ψ) space in Ramachandran maps into 27 distinct conformations sufficient to regenerate a structure to within 5 Å from the native, at least for small proteins, thus reducing the structure prediction problem to a specification of an alphanumeric string, i.e., the amino acid sequence together with one of the 27 conformations preferred by each amino acid residue. This still theoretically results in 27(n) conformations for a protein comprising "n" amino acids. We then investigated the spatial correlations at the two-residue (dipeptide) and three-residue (tripeptide) levels in what may be described as higher order Ramachandran maps, with the premise that the allowed conformational space starts to shrink as we introduce neighborhood effects. We found, for instance, for a tripeptide which potentially can exist in any of the 27(3) "allowed" conformations, three-fourths of these conformations are redundant to the 95% confidence level, suggesting sequence context dependent preferred conformations. We then created a look-up table of preferred conformations at the tripeptide level and correlated them with energetically favorable conformations. We found in particular that Boltzmann probabilities calculated from van der Waals energies for each conformation of tripeptides correlate well with the observed populations in the structural database (the average correlation coefficient is ∼0.8). An alpha-numeric string and hence the tertiary structure can be generated for any sequence from the look-up table within minutes on a single processor and to a higher level of accuracy if secondary structure can be specified. We tested the methodology on 100 small proteins, and in 90% of the cases, a structure within 5 Å is recovered. We thus believe that the method presented here provides the missing link between Ramachandran maps and tertiary structures of proteins. A Web server to convert a tertiary structure to an alphanumeric string and to predict the tertiary structure from the sequence of a protein using the above methodology is created and made freely accessible at http://www.scfbio-iitd.res.in/software/proteomics/rm2ts.jsp.
De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy
Parmeggiani, Fabio; Velasco, D. Alejandro Fernandez; Höcker, Birte; Baker, David
2015-01-01
Despite efforts for over 25 years, de novo protein design has not succeeded in achieving the TIM-barrel fold. Here we describe the computational design of 4-fold symmetrical (β/α)8-barrels guided by geometrical and chemical principles. Experimental characterization of 33 designs revealed the importance of sidechain-backbone hydrogen bonding for defining the strand register between repeat units. The X-ray crystal structure of a designed thermostable 184-residue protein is nearly identical with the designed TIM-barrel model. PSI-BLAST searches do not identify sequence similarities to known TIM-barrel proteins, and sensitive profile-profile searches indicate that the design sequence is distant from other naturally occurring TIM-barrel superfamilies, suggesting that Nature has only sampled a subset of the sequence space available to the TIM-barrel fold. The ability to de novo design TIM-barrels opens new possibilities for custom-made enzymes. PMID:26595462
Exploration of sequence space as the basis of viral RNA genome segmentation.
Moreno, Elena; Ojosnegros, Samuel; García-Arriaza, Juan; Escarmís, Cristina; Domingo, Esteban; Perales, Celia
2014-05-06
The mechanisms of viral RNA genome segmentation are unknown. On extensive passage of foot-and-mouth disease virus in baby hamster kidney-21 cells, the virus accumulated multiple point mutations and underwent a transition akin to genome segmentation. The standard single RNA genome molecule was replaced by genomes harboring internal in-frame deletions affecting the L- or capsid-coding region. These genomes were infectious and killed cells by complementation. Here we show that the point mutations in the nonstructural protein-coding region (P2, P3) that accumulated in the standard genome before segmentation increased the relative fitness of the segmented version relative to the standard genome. Fitness increase was documented by intracellular expression of virus-coded proteins and infectious progeny production by RNAs with the internal deletions placed in the sequence context of the parental and evolved genome. The complementation activity involved several viral proteins, one of them being the leader proteinase L. Thus, a history of genetic drift with accumulation of point mutations was needed to allow a major variation in the structure of a viral genome. Thus, exploration of sequence space by a viral genome (in this case an unsegmented RNA) can reach a point of the space in which a totally different genome structure (in this case, a segmented RNA) is favored over the form that performed the exploration.
An Evolution-Based Approach to De Novo Protein Design and Case Study on Mycobacterium tuberculosis
Brender, Jeffrey R.; Czajka, Jeff; Marsh, David; Gray, Felicia; Cierpicki, Tomasz; Zhang, Yang
2013-01-01
Computational protein design is a reverse procedure of protein folding and structure prediction, where constructing structures from evolutionarily related proteins has been demonstrated to be the most reliable method for protein 3-dimensional structure prediction. Following this spirit, we developed a novel method to design new protein sequences based on evolutionarily related protein families. For a given target structure, a set of proteins having similar fold are identified from the PDB library by structural alignments. A structural profile is then constructed from the protein templates and used to guide the conformational search of amino acid sequence space, where physicochemical packing is accommodated by single-sequence based solvation, torsion angle, and secondary structure predictions. The method was tested on a computational folding experiment based on a large set of 87 protein structures covering different fold classes, which showed that the evolution-based design significantly enhances the foldability and biological functionality of the designed sequences compared to the traditional physics-based force field methods. Without using homologous proteins, the designed sequences can be folded with an average root-mean-square-deviation of 2.1 Å to the target. As a case study, the method is extended to redesign all 243 structurally resolved proteins in the pathogenic bacteria Mycobacterium tuberculosis, which is the second leading cause of death from infectious disease. On a smaller scale, five sequences were randomly selected from the design pool and subjected to experimental validation. The results showed that all the designed proteins are soluble with distinct secondary structure and three have well ordered tertiary structure, as demonstrated by circular dichroism and NMR spectroscopy. Together, these results demonstrate a new avenue in computational protein design that uses knowledge of evolutionary conservation from protein structural families to engineer new protein molecules of improved fold stability and biological functionality. PMID:24204234
Newman, Laura E; Schiavon, Cara; Kahn, Richard A
2016-01-01
We describe the construction and uses of a series of plasmids for directing expression to varied levels of exogenous proteins targeted to the mitochondrial matrix or intermembrane space. We found that the level of protein expression achieved, the kinetics of expression and mitochondrial import, and half-life after import can each vary with the protein examined. These factors should be considered when directing localization of an exogenous protein to mitochondria for rescue, proteomics, or other approaches. We describe the construction of a collection of plasmids for varied expression of proteins targeted to the mitochondrial matrix or intermembrane space, using previously defined targeting sequences and strength CMV promoters. The limited size of these compartments makes them particularly vulnerable to artifacts from over-expression. We found that different proteins display different kinetics of expression and import that should be considered when analyzing results from this approach. Finally, this collection of plasmids has been deposited in the Addgene plasmid repository to facilitate the ready access and use of these tools.
Adaptive compressive learning for prediction of protein-protein interactions from primary sequence.
Zhang, Ya-Nan; Pan, Xiao-Yong; Huang, Yan; Shen, Hong-Bin
2011-08-21
Protein-protein interactions (PPIs) play an important role in biological processes. Although much effort has been devoted to the identification of novel PPIs by integrating experimental biological knowledge, there are still many difficulties because of lacking enough protein structural and functional information. It is highly desired to develop methods based only on amino acid sequences for predicting PPIs. However, sequence-based predictors are often struggling with the high-dimensionality causing over-fitting and high computational complexity problems, as well as the redundancy of sequential feature vectors. In this paper, a novel computational approach based on compressed sensing theory is proposed to predict yeast Saccharomyces cerevisiae PPIs from primary sequence and has achieved promising results. The key advantage of the proposed compressed sensing algorithm is that it can compress the original high-dimensional protein sequential feature vector into a much lower but more condensed space taking the sparsity property of the original signal into account. What makes compressed sensing much more attractive in protein sequence analysis is its compressed signal can be reconstructed from far fewer measurements than what is usually considered necessary in traditional Nyquist sampling theory. Experimental results demonstrate that proposed compressed sensing method is powerful for analyzing noisy biological data and reducing redundancy in feature vectors. The proposed method represents a new strategy of dealing with high-dimensional protein discrete model and has great potentiality to be extended to deal with many other complicated biological systems. Copyright © 2011 Elsevier Ltd. All rights reserved.
1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life
Mukherjee, Supratim; Seshadri, Rekha; Varghese, Neha J.; ...
2017-06-12
We present 1,003 reference genomes that were sequenced as part of the Genomic Encyclopedia of Bacteria and Archaea (GEBA) initiative, selected to maximize sequence coverage of phylogenetic space. These genomes double the number of existing type strains and expand their overall phylogenetic diversity by 25%. Comparative analyses with previously available finished and draft genomes reveal a 10.5% increase in novel protein families as a function of phylogenetic diversity. The GEBA genomes recruit 25 million previously unassigned metagenomic proteins from 4,650 samples, improving their phylogenetic and functional interpretation. We identify numerous biosynthetic clusters and experimentally validate a divergent phenazine cluster withmore » potential new chemical structure and antimicrobial activity. This Resource is the largest single release of reference genomes to date. Bacterial and archaeal isolate sequence space is still far from saturated, and future endeavors in this direction will continue to be a valuable resource for scientific discovery.« less
1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mukherjee, Supratim; Seshadri, Rekha; Varghese, Neha J.
We present 1,003 reference genomes that were sequenced as part of the Genomic Encyclopedia of Bacteria and Archaea (GEBA) initiative, selected to maximize sequence coverage of phylogenetic space. These genomes double the number of existing type strains and expand their overall phylogenetic diversity by 25%. Comparative analyses with previously available finished and draft genomes reveal a 10.5% increase in novel protein families as a function of phylogenetic diversity. The GEBA genomes recruit 25 million previously unassigned metagenomic proteins from 4,650 samples, improving their phylogenetic and functional interpretation. We identify numerous biosynthetic clusters and experimentally validate a divergent phenazine cluster withmore » potential new chemical structure and antimicrobial activity. This Resource is the largest single release of reference genomes to date. Bacterial and archaeal isolate sequence space is still far from saturated, and future endeavors in this direction will continue to be a valuable resource for scientific discovery.« less
Centromeres: long intergenic spaces with adaptive features.
Kanizay, Lisa; Dawe, R Kelly
2009-08-01
Centromeres are composed of inner kinetochore proteins, which are largely conserved across species, and repetitive DNA, which shows comparatively little sequence conservation. Due to this fundamental paradox the formation and maintenance of centromeres remains largely a mystery. However, it has become increasingly clear that a long-standing balance between epigenetic and genetic control governs the interactions of centromeric DNA and inner kinetochore proteins. The comparison of classical neocentromeres in plants, which are entirely genetic in their mode of operation, and clinical neocentromeres, which are sequence-independent, illustrates the conflict between genetics and epigenetics in regions that control their own transmission to progeny. Tandem repeat arrays present in centromeres may have an origin in meiotic drive or other selfish patterns of evolution, as is the case for the CENP-B box and CENP-B protein in human. In grasses retrotransposons have invaded centromeres to the point of complete domination, consequently breaking genetic regulation at these centromeres. The accumulation of tandem repeats and transposons causes centromeres to expand in size, effectively pushing genes to the sides and opening the centromere to ever fewer constraints on the DNA sequence. On genetic maps centromeres appear as long intergenic spaces that evolve rapidly and apparently without regard to host fitness.
Exploring Fold Space Preferences of New-born and Ancient Protein Superfamilies
Edwards, Hannah; Abeln, Sanne; Deane, Charlotte M.
2013-01-01
The evolution of proteins is one of the fundamental processes that has delivered the diversity and complexity of life we see around ourselves today. While we tend to define protein evolution in terms of sequence level mutations, insertions and deletions, it is hard to translate these processes to a more complete picture incorporating a polypeptide's structure and function. By considering how protein structures change over time we can gain an entirely new appreciation of their long-term evolutionary dynamics. In this work we seek to identify how populations of proteins at different stages of evolution explore their possible structure space. We use an annotation of superfamily age to this space and explore the relationship between these ages and a diverse set of properties pertaining to a superfamily's sequence, structure and function. We note several marked differences between the populations of newly evolved and ancient structures, such as in their length distributions, secondary structure content and tertiary packing arrangements. In particular, many of these differences suggest a less elaborate structure for newly evolved superfamilies when compared with their ancient counterparts. We show that the structural preferences we report are not a residual effect of a more fundamental relationship with function. Furthermore, we demonstrate the robustness of our results, using significant variation in the algorithm used to estimate the ages. We present these age estimates as a useful tool to analyse protein populations. In particularly, we apply this in a comparison of domains containing greek key or jelly roll motifs. PMID:24244135
Modularity of Protein Folds as a Tool for Template-Free Modeling of Structures.
Vallat, Brinda; Madrid-Aliste, Carlos; Fiser, Andras
2015-08-01
Predicting the three-dimensional structure of proteins from their amino acid sequences remains a challenging problem in molecular biology. While the current structural coverage of proteins is almost exclusively provided by template-based techniques, the modeling of the rest of the protein sequences increasingly require template-free methods. However, template-free modeling methods are much less reliable and are usually applicable for smaller proteins, leaving much space for improvement. We present here a novel computational method that uses a library of supersecondary structure fragments, known as Smotifs, to model protein structures. The library of Smotifs has saturated over time, providing a theoretical foundation for efficient modeling. The method relies on weak sequence signals from remotely related protein structures to create a library of Smotif fragments specific to the target protein sequence. This Smotif library is exploited in a fragment assembly protocol to sample decoys, which are assessed by a composite scoring function. Since the Smotif fragments are larger in size compared to the ones used in other fragment-based methods, the proposed modeling algorithm, SmotifTF, can employ an exhaustive sampling during decoy assembly. SmotifTF successfully predicts the overall fold of the target proteins in about 50% of the test cases and performs competitively when compared to other state of the art prediction methods, especially when sequence signal to remote homologs is diminishing. Smotif-based modeling is complementary to current prediction methods and provides a promising direction in addressing the structure prediction problem, especially when targeting larger proteins for modeling.
Hao, Xiaohu; Zhang, Guijun; Zhou, Xiaogen
2018-04-01
Computing conformations which are essential to associate structural and functional information with gene sequences, is challenging due to the high dimensionality and rugged energy surface of the protein conformational space. Consequently, the dimension of the protein conformational space should be reduced to a proper level, and an effective exploring algorithm should be proposed. In this paper, a plug-in method for guiding exploration in conformational feature space with Lipschitz underestimation (LUE) for ab-initio protein structure prediction is proposed. The conformational space is converted into ultrafast shape recognition (USR) feature space firstly. Based on the USR feature space, the conformational space can be further converted into Underestimation space according to Lipschitz estimation theory for guiding exploration. As a consequence of the use of underestimation model, the tight lower bound estimate information can be used for exploration guidance, the invalid sampling areas can be eliminated in advance, and the number of energy function evaluations can be reduced. The proposed method provides a novel technique to solve the exploring problem of protein conformational space. LUE is applied to differential evolution (DE) algorithm, and metropolis Monte Carlo(MMC) algorithm which is available in the Rosetta; When LUE is applied to DE and MMC, it will be screened by the underestimation method prior to energy calculation and selection. Further, LUE is compared with DE and MMC by testing on 15 small-to-medium structurally diverse proteins. Test results show that near-native protein structures with higher accuracy can be obtained more rapidly and efficiently with the use of LUE. Copyright © 2018 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Cheng, Ryan; Morcos, Faruck; Levine, Herbert; Onuchic, Jose
2014-03-01
An important challenge in biology is to distinguish the subset of residues that allow bacterial two-component signaling (TCS) proteins to preferentially interact with their correct TCS partner such that they can bind and transfer signal. Detailed knowledge of this information would allow one to search sequence-space for mutations that can systematically tune the signal transmission between TCS partners as well as re-encode a TCS protein to preferentially transfer signals to a non-partner. Motivated by the notion that this detailed information is found in sequence data, we explore the mutual sequence co-evolution between signaling partners to infer how mutations can positively or negatively alter their interaction. Using Direct Coupling Analysis (DCA) for determining evolutionarily conserved interprotein interactions, we apply a DCA-based metric to quantify mutational changes in the interaction between TCS proteins and demonstrate that it accurately correlates with experimental mutagenesis studies probing the mutational change in the in vitro phosphotransfer. Our methodology serves as a potential framework for the rational design of TCS systems as well as a framework for the system-level study of protein-protein interactions in sequence-rich systems. This research has been supported by the NSF INSPIRE award MCB-1241332 and by the CTBP sponsored by the NSF (Grant PHY-1308264).
Clausen, Rudy; Ma, Buyong; Nussinov, Ruth; Shehu, Amarda
2015-01-01
An important goal in molecular biology is to understand functional changes upon single-point mutations in proteins. Doing so through a detailed characterization of structure spaces and underlying energy landscapes is desirable but continues to challenge methods based on Molecular Dynamics. In this paper we propose a novel algorithm, SIfTER, which is based instead on stochastic optimization to circumvent the computational challenge of exploring the breadth of a protein’s structure space. SIfTER is a data-driven evolutionary algorithm, leveraging experimentally-available structures of wildtype and variant sequences of a protein to define a reduced search space from where to efficiently draw samples corresponding to novel structures not directly observed in the wet laboratory. The main advantage of SIfTER is its ability to rapidly generate conformational ensembles, thus allowing mapping and juxtaposing landscapes of variant sequences and relating observed differences to functional changes. We apply SIfTER to variant sequences of the H-Ras catalytic domain, due to the prominent role of the Ras protein in signaling pathways that control cell proliferation, its well-studied conformational switching, and abundance of documented mutations in several human tumors. Many Ras mutations are oncogenic, but detailed energy landscapes have not been reported until now. Analysis of SIfTER-computed energy landscapes for the wildtype and two oncogenic variants, G12V and Q61L, suggests that these mutations cause constitutive activation through two different mechanisms. G12V directly affects binding specificity while leaving the energy landscape largely unchanged, whereas Q61L has pronounced, starker effects on the landscape. An implementation of SIfTER is made available at http://www.cs.gmu.edu/~ashehu/?q=OurTools. We believe SIfTER is useful to the community to answer the question of how sequence mutations affect the function of a protein, when there is an abundance of experimental structures that can be exploited to reconstruct an energy landscape that would be computationally impractical to do via Molecular Dynamics. PMID:26325505
Rotondi, Kenneth S; Gierasch, Lila M
2003-07-08
The experiments described here explore the role of local sequence in the folding of cellular retinoic acid binding protein I (CRABP I). This is a 136-residue, 10-stranded, antiparallel beta-barrel protein with seven beta-hairpins and is a member of the intracellular lipid binding protein (iLBP) family. The relative roles of local and global sequence information in governing the folding of this class of proteins are not well-understood. In question is whether the beta-turns are locally defined by short-range interactions within their sequences, and are thus able to play an active role in reducing the conformational space available to the folding chain, or whether the turns are passive, relying upon global forces to form. Short (six- and seven-residue) peptides corresponding to the seven CRABP I turns were analyzed by circular dichroism and NMR for their tendencies to take up the conformations they adopt in the context of the native protein. The results indicate that two of the peptides, encompassing turns III and IV in CRABP I, have a strong intrinsic bias to form native turns. Intriguingly, these turns are on linked hairpins in CRABP I and represent the best-conserved turns in the iLBP family. These results suggest that local sequence may play an important role in narrowing the conformational ensemble of CRABP I during folding.
Structural analysis of a set of proteins resulting from a bacterial genomics project.
Badger, J; Sauder, J M; Adams, J M; Antonysamy, S; Bain, K; Bergseid, M G; Buchanan, S G; Buchanan, M D; Batiyenko, Y; Christopher, J A; Emtage, S; Eroshkina, A; Feil, I; Furlong, E B; Gajiwala, K S; Gao, X; He, D; Hendle, J; Huber, A; Hoda, K; Kearins, P; Kissinger, C; Laubert, B; Lewis, H A; Lin, J; Loomis, K; Lorimer, D; Louie, G; Maletic, M; Marsh, C D; Miller, I; Molinari, J; Muller-Dieckmann, H J; Newman, J M; Noland, B W; Pagarigan, B; Park, F; Peat, T S; Post, K W; Radojicic, S; Ramos, A; Romero, R; Rutter, M E; Sanderson, W E; Schwinn, K D; Tresser, J; Winhoven, J; Wright, T A; Wu, L; Xu, J; Harris, T J R
2005-09-01
The targets of the Structural GenomiX (SGX) bacterial genomics project were proteins conserved in multiple prokaryotic organisms with no obvious sequence homolog in the Protein Data Bank of known structures. The outcome of this work was 80 structures, covering 60 unique sequences and 49 different genes. Experimental phase determination from proteins incorporating Se-Met was carried out for 45 structures with most of the remainder solved by molecular replacement using members of the experimentally phased set as search models. An automated tool was developed to deposit these structures in the Protein Data Bank, along with the associated X-ray diffraction data (including refined experimental phases) and experimentally confirmed sequences. BLAST comparisons of the SGX structures with structures that had appeared in the Protein Data Bank over the intervening 3.5 years since the SGX target list had been compiled identified homologs for 49 of the 60 unique sequences represented by the SGX structures. This result indicates that, for bacterial structures that are relatively easy to express, purify, and crystallize, the structural coverage of gene space is proceeding rapidly. More distant sequence-structure relationships between the SGX and PDB structures were investigated using PDB-BLAST and Combinatorial Extension (CE). Only one structure, SufD, has a truly unique topology compared to all folds in the PDB. Copyright 2005 Wiley-Liss, Inc.
NASA Astrophysics Data System (ADS)
El-Assaad, Atlal; Dawy, Zaher; Nemer, Georges; Kobeissy, Firas
2017-01-01
The crucial biological role of proteases has been visible with the development of degradomics discipline involved in the determination of the proteases/substrates resulting in breakdown-products (BDPs) that can be utilized as putative biomarkers associated with different biological-clinical significance. In the field of cancer biology, matrix metalloproteinases (MMPs) have shown to result in MMPs-generated protein BDPs that are indicative of malignant growth in cancer, while in the field of neural injury, calpain-2 and caspase-3 proteases generate BDPs fragments that are indicative of different neural cell death mechanisms in different injury scenarios. Advanced proteomic techniques have shown a remarkable progress in identifying these BDPs experimentally. In this work, we present a bioinformatics-based prediction method that identifies protease-associated BDPs with high precision and efficiency. The method utilizes state-of-the-art sequence matching and alignment algorithms. It starts by locating consensus sequence occurrences and their variants in any set of protein substrates, generating all fragments resulting from cleavage. The complexity exists in space O(mn) as well as in O(Nmn) time, where N, m, and n are the number of protein sequences, length of the consensus sequence, and length per protein sequence, respectively. Finally, the proposed methodology is validated against βII-spectrin protein, a brain injury validated biomarker.
Schmidt Am Busch, Marcel; Sedano, Audrey; Simonson, Thomas
2010-05-05
Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases. WE EXPLORE THIS STRATEGY FOR FOUR SCOP FAMILIES: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000-300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed. For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.
Higgins, Matthew K; Carrington, Mark
2014-01-01
Trypanosoma and Plasmodium species are unicellular, eukaryotic pathogens that have evolved the capacity to survive and proliferate within a human host, causing sleeping sickness and malaria, respectively. They have very different survival strategies. African trypanosomes divide in blood and extracellular spaces, whereas Plasmodium species invade and proliferate within host cells. Interaction with host macromolecules is central to establishment and maintenance of an infection by both parasites. Proteins that mediate these interactions are under selection pressure to bind host ligands without compromising immune avoidance strategies. In both parasites, the expansion of genes encoding a small number of protein folds has established large protein families. This has permitted both diversification to form novel ligand binding sites and variation in sequence that contributes to avoidance of immune recognition. In this review we consider two such parasite surface protein families, one from each species. In each case, known structures demonstrate how extensive sequence variation around a conserved molecular architecture provides an adaptable protein scaffold that the parasites can mobilise to mediate interactions with their hosts. PMID:24442723
A Survey of Protein Structures from Archaeal Viruses
Dellas, Nikki; Lawrence, C. Martin; Young, Mark J.
2013-01-01
Viruses that infect the third domain of life, Archaea, are a newly emerging field of interest. To date, all characterized archaeal viruses infect archaea that thrive in extreme conditions, such as halophilic, hyperthermophilic, and methanogenic environments. Viruses in general, especially those replicating in extreme environments, contain highly mosaic genomes with open reading frames (ORFs) whose sequences are often dissimilar to all other known ORFs. It has been estimated that approximately 85% of virally encoded ORFs do not match known sequences in the nucleic acid databases, and this percentage is even higher for archaeal viruses (typically 90%–100%). This statistic suggests that either virus genomes represent a larger segment of sequence space and/or that viruses encode genes of novel fold and/or function. Because the overall three-dimensional fold of a protein evolves more slowly than its sequence, efforts have been geared toward structural characterization of proteins encoded by archaeal viruses in order to gain insight into their potential functions. In this short review, we provide multiple examples where structural characterization of archaeal viral proteins has indeed provided significant functional and evolutionary insight. PMID:25371334
Salt bridges: geometrically specific, designable interactions.
Donald, Jason E; Kulp, Daniel W; DeGrado, William F
2011-03-01
Salt bridges occur frequently in proteins, providing conformational specificity and contributing to molecular recognition and catalysis. We present a comprehensive analysis of these interactions in protein structures by surveying a large database of protein structures. Salt bridges between Asp or Glu and His, Arg, or Lys display extremely well-defined geometric preferences. Several previously observed preferences are confirmed, and others that were previously unrecognized are discovered. Salt bridges are explored for their preferences for different separations in sequence and in space, geometric preferences within proteins and at protein-protein interfaces, co-operativity in networked salt bridges, inclusion within metal-binding sites, preference for acidic electrons, apparent conformational side chain entropy reduction on formation, and degree of burial. Salt bridges occur far more frequently between residues at close than distant sequence separations, but, at close distances, there remain strong preferences for salt bridges at specific separations. Specific types of complex salt bridges, involving three or more members, are also discovered. As we observe a strong relationship between the propensity to form a salt bridge and the placement of salt-bridging residues in protein sequences, we discuss the role that salt bridges might play in kinetically influencing protein folding and thermodynamically stabilizing the native conformation. We also develop a quantitative method to select appropriate crystal structure resolution and B-factor cutoffs. Detailed knowledge of these geometric and sequence dependences should aid de novo design and prediction algorithms. Copyright © 2010 Wiley-Liss, Inc.
Sirius PSB: a generic system for analysis of biological sequences.
Koh, Chuan Hock; Lin, Sharene; Jedd, Gregory; Wong, Limsoon
2009-12-01
Computational tools are essential components of modern biological research. For example, BLAST searches can be used to identify related proteins based on sequence homology, or when a new genome is sequenced, prediction models can be used to annotate functional sites such as transcription start sites, translation initiation sites and polyadenylation sites and to predict protein localization. Here we present Sirius Prediction Systems Builder (PSB), a new computational tool for sequence analysis, classification and searching. Sirius PSB has four main operations: (1) Building a classifier, (2) Deploying a classifier, (3) Search for proteins similar to query proteins, (4) Preliminary and post-prediction analysis. Sirius PSB supports all these operations via a simple and interactive graphical user interface. Besides being a convenient tool, Sirius PSB has also introduced two novelties in sequence analysis. Firstly, genetic algorithm is used to identify interesting features in the feature space. Secondly, instead of the conventional method of searching for similar proteins via sequence similarity, we introduced searching via features' similarity. To demonstrate the capabilities of Sirius PSB, we have built two prediction models - one for the recognition of Arabidopsis polyadenylation sites and another for the subcellular localization of proteins. Both systems are competitive against current state-of-the-art models based on evaluation of public datasets. More notably, the time and effort required to build each model is greatly reduced with the assistance of Sirius PSB. Furthermore, we show that under certain conditions when BLAST is unable to find related proteins, Sirius PSB can identify functionally related proteins based on their biophysical similarities. Sirius PSB and its related supplements are available at: http://compbio.ddns.comp.nus.edu.sg/~sirius.
Extraordinary Structured Noncoding RNAs Revealed by Bacterial Metagenome Analysis
Weinberg, Zasha; Perreault, Jonathan; Meyer, Michelle M.; Breaker, Ronald R.
2012-01-01
Estimates of the total number of bacterial species1-3 suggest that existing DNA sequence databases carry only a tiny fraction of the total amount of DNA sequence space represented by this division of life. Indeed, environmental DNA samples have been shown to encode many previously unknown classes of proteins4 and RNAs5. Bioinformatics searches6-10 of genomic DNA from bacteria commonly identify novel noncoding RNAs (ncRNAs)10-12 such as riboswitches13,14. In rare instances, RNAs that exhibit more extensive sequence and structural conservation across a wide range of bacteria are encountered15,16. Given that large structured RNAs are known to carry out complex biochemical functions such as protein synthesis and RNA processing reactions, identifying more RNAs of great size and intricate structure is likely to reveal additional biochemical functions that can be achieved by RNA. We applied an updated computational pipeline17 to discover ncRNAs that rival the known large ribozymes in size and structural complexity or that are among the most abundant RNAs in bacteria that encode them. These RNAs would have been difficult or impossible to detect without examining environmental DNA sequences, suggesting that numerous RNAs with extraordinary size, structural complexity, or other exceptional characteristics remain to be discovered in unexplored sequence space. PMID:19956260
DOE Office of Scientific and Technical Information (OSTI.GOV)
Qureshi, Insaf A.; Sethi, Dhruv K.; Salunke, Dinakar M., E-mail: dinakar@nii.res.in
2006-09-01
A 24 kDa protein was purified from the seeds of L. sativus by ammonium sulfate fractionation and ion-exchange chromatography. Crystals were obtained by the hanging-drop vapour-diffusion method. A 24 kDa protein was purified from the seeds of Lathyrus sativus by ammonium sulfate fractionation and ion-exchange chromatography. The N-terminal amino-acid sequence showed significant homology with the 2S albumin class of seed storage proteins. The protein showed 85% sequence homology with the seed albumin of Pisum sativum within the 40 N-terminal residues. Crystals were obtained by the hanging-drop vapour-diffusion method. The crystals belonged to space group P2{sub 1}2{sub 1}2{sub 1}, with unit-cellmore » parameters a = 43.5, b = 82.7, c = 153.4 Å.« less
Gold nanoparticle capture within protein crystal scaffolds.
Kowalski, Ann E; Huber, Thaddaus R; Ni, Thomas W; Hartje, Luke F; Appel, Karina L; Yost, Jarad W; Ackerson, Christopher J; Snow, Christopher D
2016-07-07
DNA assemblies have been used to organize inorganic nanoparticles into 3D arrays, with emergent properties arising as a result of nanoparticle spacing and geometry. We report here the use of engineered protein crystals as an alternative approach to biologically mediated assembly of inorganic nanoparticles. The protein crystal's 13 nm diameter pores result in an 80% solvent content and display hexahistidine sequences on their interior. The hexahistidine sequence captures Au25(glutathione)∼17 (nitrilotriacetic acid)∼1 nanoclusters throughout a chemically crosslinked crystal via the coordination of Ni(ii) to both the cluster and the protein. Nanoparticle loading was validated by confocal microscopy and elemental analysis. The nanoparticles may be released from the crystal by exposure to EDTA, which chelates the Ni(ii) and breaks the specific protein/nanoparticle interaction. The integrity of the protein crystals after crosslinking and nanoparticle capture was confirmed by single crystal X-ray crystallography.
Hierarchy and extremes in selections from pools of randomized proteins
Boyer, Sébastien; Biswas, Dipanwita; Kumar Soshee, Ananda; Scaramozzino, Natale; Nizak, Clément; Rivoire, Olivier
2016-01-01
Variation and selection are the core principles of Darwinian evolution, but quantitatively relating the diversity of a population to its capacity to respond to selection is challenging. Here, we examine this problem at a molecular level in the context of populations of partially randomized proteins selected for binding to well-defined targets. We built several minimal protein libraries, screened them in vitro by phage display, and analyzed their response to selection by high-throughput sequencing. A statistical analysis of the results reveals two main findings. First, libraries with the same sequence diversity but built around different “frameworks” typically have vastly different responses; second, the distribution of responses of the best binders in a library follows a simple scaling law. We show how an elementary probabilistic model based on extreme value theory rationalizes the latter finding. Our results have implications for designing synthetic protein libraries, estimating the density of functional biomolecules in sequence space, characterizing diversity in natural populations, and experimentally investigating evolvability (i.e., the potential for future evolution). PMID:26969726
Hierarchy and extremes in selections from pools of randomized proteins.
Boyer, Sébastien; Biswas, Dipanwita; Kumar Soshee, Ananda; Scaramozzino, Natale; Nizak, Clément; Rivoire, Olivier
2016-03-29
Variation and selection are the core principles of Darwinian evolution, but quantitatively relating the diversity of a population to its capacity to respond to selection is challenging. Here, we examine this problem at a molecular level in the context of populations of partially randomized proteins selected for binding to well-defined targets. We built several minimal protein libraries, screened them in vitro by phage display, and analyzed their response to selection by high-throughput sequencing. A statistical analysis of the results reveals two main findings. First, libraries with the same sequence diversity but built around different "frameworks" typically have vastly different responses; second, the distribution of responses of the best binders in a library follows a simple scaling law. We show how an elementary probabilistic model based on extreme value theory rationalizes the latter finding. Our results have implications for designing synthetic protein libraries, estimating the density of functional biomolecules in sequence space, characterizing diversity in natural populations, and experimentally investigating evolvability (i.e., the potential for future evolution).
Evolution of Enzyme Superfamilies: Comprehensive Exploration of Sequence-Function Relationships.
Baier, F; Copp, J N; Tokuriki, N
2016-11-22
The sequence and functional diversity of enzyme superfamilies have expanded through billions of years of evolution from a common ancestor. Understanding how protein sequence and functional "space" have expanded, at both the evolutionary and molecular level, is central to biochemistry, molecular biology, and evolutionary biology. Integrative approaches that examine protein sequence, structure, and function have begun to provide comprehensive views of the functional diversity and evolutionary relationships within enzyme superfamilies. In this review, we outline the recent advances in our understanding of enzyme evolution and superfamily functional diversity. We describe the tools that have been used to comprehensively analyze sequence relationships and to characterize sequence and function relationships. We also highlight recent large-scale experimental approaches that systematically determine the activity profiles across enzyme superfamilies. We identify several intriguing insights from this recent body of work. First, promiscuous activities are prevalent among extant enzymes. Second, many divergent proteins retain "function connectivity" via enzyme promiscuity, which can be used to probe the evolutionary potential and history of enzyme superfamilies. Finally, we discuss open questions regarding the intricacies of enzyme divergence, as well as potential research directions that will deepen our understanding of enzyme superfamily evolution.
2014-01-01
Background Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate. The approach was applied to a large-scale study of SMART and PFAM domains in the space of seed sequences and in the space of UniProt/SwissProt. Conclusions Sequence similarity core dissection with regard to fold-critical and other contributions systematically suppresses false hits and, additionally, recovers previously obscured homology relationships such as the one between aquaporins and formate/nitrite transporters that, so far, was only supported by structure comparison. PMID:24890864
An improved stochastic fractal search algorithm for 3D protein structure prediction.
Zhou, Changjun; Sun, Chuan; Wang, Bin; Wang, Xiaojun
2018-05-03
Protein structure prediction (PSP) is a significant area for biological information research, disease treatment, and drug development and so on. In this paper, three-dimensional structures of proteins are predicted based on the known amino acid sequences, and the structure prediction problem is transformed into a typical NP problem by an AB off-lattice model. This work applies a novel improved Stochastic Fractal Search algorithm (ISFS) to solve the problem. The Stochastic Fractal Search algorithm (SFS) is an effective evolutionary algorithm that performs well in exploring the search space but falls into local minimums sometimes. In order to avoid the weakness, Lvy flight and internal feedback information are introduced in ISFS. In the experimental process, simulations are conducted by ISFS algorithm on Fibonacci sequences and real peptide sequences. Experimental results prove that the ISFS performs more efficiently and robust in terms of finding the global minimum and avoiding getting stuck in local minimums.
Intriguing olfactory proteins from the yellow fever mosquito, Aedes aegypti
NASA Astrophysics Data System (ADS)
Ishida, Yuko; Chen, Angela M.; Tsuruda, Jennifer M.; Cornel, Anthon J.; Debboun, Mustapha; Leal, Walter S.
2004-09-01
Four antennae-specific proteins (AaegOBP1, AaegOBP2, AaegOBP3, and AaegASP1) were isolated from the yellow fever mosquito, Aedes aegypti and their full-length cDNAs were cloned. RT-PCR indicated that they are expressed in female and, to a lesser extent, in male antennae, but not in control tissues (legs). AaegOBP1 and AaegOBP3 showed significant similarity to previously identified mosquito odorant-binding proteins (OBPs) in cysteine spacing pattern and sequence. Two of the isolated proteins have a total of eight cysteine residues. The similarity of the spacing pattern of the cysteine residues and amino acid sequence to those of previously identified olfactory proteins suggests that one of the cysteine-rich proteins (AaegOBP2) is an OBP. The other (AaegASP1) did not belong to any group of known OBPs. Structural analyses indicate that six of the cysteine residues in AaegOBP2 are linked in a similar pattern to the previously known cysteine pairing in OBPs, i.e., Cys-24 Cys-55, Cys-51 Cys-104, Cys-95 Cys-113. The additional disulfide bridge, Cys-38 Cys-125, knits the extended C-terminal segment of the protein to a predicted α2-helix. As indicated by circular dichroism (CD) spectra, the extra rigidity seems to prevent the predicted formation of a C-terminal α-helix at low pH.
On the conservative nature of intragenic recombination
Drummond, D. Allan; Silberg, Jonathan J.; Meyer, Michelle M.; Wilke, Claus O.; Arnold, Frances H.
2005-01-01
Intragenic recombination rapidly creates protein sequence diversity compared with random mutation, but little is known about the relative effects of recombination and mutation on protein function. Here, we compare recombination of the distantly related β-lactamases PSE-4 and TEM-1 to mutation of PSE-4. We show that, among β-lactamase variants containing the same number of amino acid substitutions, variants created by recombination retain function with a significantly higher probability than those generated by random mutagenesis. We present a simple model that accurately captures the differing effects of mutation and recombination in real and simulated proteins with only four parameters: (i) the amino acid sequence distance between parents, (ii) the number of substitutions, (iii) the average probability that random substitutions will preserve function, and (iv) the average probability that substitutions generated by recombination will preserve function. Our results expose a fundamental functional enrichment in regions of protein sequence space accessible by recombination and provide a framework for evaluating whether the relative rates of mutation and recombination observed in nature reflect the underlying imbalance in their effects on protein function. PMID:15809422
On the conservative nature of intragenic recombination.
Drummond, D Allan; Silberg, Jonathan J; Meyer, Michelle M; Wilke, Claus O; Arnold, Frances H
2005-04-12
Intragenic recombination rapidly creates protein sequence diversity compared with random mutation, but little is known about the relative effects of recombination and mutation on protein function. Here, we compare recombination of the distantly related beta-lactamases PSE-4 and TEM-1 to mutation of PSE-4. We show that, among beta-lactamase variants containing the same number of amino acid substitutions, variants created by recombination retain function with a significantly higher probability than those generated by random mutagenesis. We present a simple model that accurately captures the differing effects of mutation and recombination in real and simulated proteins with only four parameters: (i) the amino acid sequence distance between parents, (ii) the number of substitutions, (iii) the average probability that random substitutions will preserve function, and (iv) the average probability that substitutions generated by recombination will preserve function. Our results expose a fundamental functional enrichment in regions of protein sequence space accessible by recombination and provide a framework for evaluating whether the relative rates of mutation and recombination observed in nature reflect the underlying imbalance in their effects on protein function.
Focareta, T; Manning, P A
1987-01-01
The gene encoding the extracellular DNase of Vibrio cholerae was cloned into Escherichia coli K-12. A maximal coding region of 1.2 kb and a minimal region of 0.6 kb were determined by transposon mutagenesis and deletion analysis. The nucleotide sequence of this region contained a single open reading frame of 690 bp corresponding to a protein of Mr 26,389 with a typical N-terminal signal sequence of 18 aa which, when removed, would give a mature protein of Mr 24,163. This is in good agreement with the size of 24 kDa, calculated directly by Coomassie blue staining following sodium dodecyl sulphate-polyacrylamide gel electrophoresis and indirectly via a DNA-hydrolysis assay. The protein is located in the periplasmic space of E. coli K-12 unlike in V. cholerae where it is excreted into the extracellular medium. The introduction of the DNase gene into a periplasmic (tolA) leaky mutant of E. coli K-12 facilitates the release of the protein, further confirming the periplasmic location.
Biophysical and structural considerations for protein sequence evolution
2011-01-01
Background Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field. Results Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed rates across sites. Conclusions Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model. PMID:22171550
Error catastrophe and phase transition in the empirical fitness landscape of HIV
NASA Astrophysics Data System (ADS)
Hart, Gregory R.; Ferguson, Andrew L.
2015-03-01
We have translated clinical sequence databases of the p6 HIV protein into an empirical fitness landscape quantifying viral replicative capacity as a function of the amino acid sequence. We show that the viral population resides close to a phase transition in sequence space corresponding to an "error catastrophe" beyond which there is lethal accumulation of mutations. Our model predicts that the phase transition may be induced by drug therapies that elevate the mutation rate, or by forcing mutations at particular amino acids. Applying immune pressure to any combination of killer T-cell targets cannot induce the transition, providing a rationale for why the viral protein can exist close to the error catastrophe without sustaining fatal fitness penalties due to adaptive immunity.
Evolutionary profiles from the QR factorization of multiple sequence alignments
Sethi, Anurag; O'Donoghue, Patrick; Luthey-Schulten, Zaida
2005-01-01
We present an algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of the homologous group. The method, based on the multidimensional QR factorization of numerically encoded multiple sequence alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. We observe a general trend that these smaller, more evolutionarily balanced profiles have comparable and, in many cases, better performance in database searches than conventional profiles containing hundreds of sequences, constructed in an iterative and computationally intensive procedure. For more diverse families or superfamilies, with sequence identity <30%, structural alignments, based purely on the geometry of the protein structures, provide better alignments than pure sequence-based methods. Merging the structure and sequence information allows the construction of accurate profiles for distantly related groups. These structure-based profiles outperformed other sequence-based methods for finding distant homologs and were used to identify a putative class II cysteinyl-tRNA synthetase (CysRS) in several archaea that eluded previous annotation studies. Phylogenetic analysis showed the putative class II CysRSs to be a monophyletic group and homology modeling revealed a constellation of active site residues similar to that in the known class I CysRS. PMID:15741270
Kitahara, Kei; Kajiura, Akimasa; Sato, Neuza Satomi; Suzuki, Tsutomu
2007-01-01
Ribosomal protein L2 is a highly conserved primary 23S rRNA-binding protein. L2 specifically recognizes the internal bulge sequence in Helix 66 (H66) of 23S rRNA and is localized to the intersubunit space through formation of bridge B7b with 16S rRNA. The L2-binding site in H66 is highly conserved in prokaryotic ribosomes, whereas the corresponding site in eukaryotic ribosomes has evolved into distinct classes of sequences. We performed a systematic genetic selection of randomized rRNA sequences in Escherichia coli, and isolated 20 functional variants of the L2-binding site. The isolated variants consisted of eukaryotic sequences, in addition to prokaryotic sequences. These results suggest that L2/L8e does not recognize a specific base sequence of H66, but rather a characteristic architecture of H66. The growth phenotype of the isolated variants correlated well with their ability of subunit association. Upon continuous cultivation of a deleterious variant, we isolated two spontaneous mutations within domain IV of 23S rRNA that compensated for its weak subunit association, and alleviated its growth defect, implying that functional interactions between intersubunit bridges compensate ribosomal function. PMID:17553838
Hao, Xiao-Hu; Zhang, Gui-Jun; Zhou, Xiao-Gen; Yu, Xu-Feng
2016-01-01
To address the searching problem of protein conformational space in ab-initio protein structure prediction, a novel method using abstract convex underestimation (ACUE) based on the framework of evolutionary algorithm was proposed. Computing such conformations, essential to associate structural and functional information with gene sequences, is challenging due to the high-dimensionality and rugged energy surface of the protein conformational space. As a consequence, the dimension of protein conformational space should be reduced to a proper level. In this paper, the high-dimensionality original conformational space was converted into feature space whose dimension is considerably reduced by feature extraction technique. And, the underestimate space could be constructed according to abstract convex theory. Thus, the entropy effect caused by searching in the high-dimensionality conformational space could be avoided through such conversion. The tight lower bound estimate information was obtained to guide the searching direction, and the invalid searching area in which the global optimal solution is not located could be eliminated in advance. Moreover, instead of expensively calculating the energy of conformations in the original conformational space, the estimate value is employed to judge if the conformation is worth exploring to reduce the evaluation time, thereby making computational cost lower and the searching process more efficient. Additionally, fragment assembly and the Monte Carlo method are combined to generate a series of metastable conformations by sampling in the conformational space. The proposed method provides a novel technique to solve the searching problem of protein conformational space. Twenty small-to-medium structurally diverse proteins were tested, and the proposed ACUE method was compared with It Fix, HEA, Rosetta and the developed method LEDE without underestimate information. Test results show that the ACUE method can more rapidly and more efficiently obtain the near-native protein structure.
Comparative Genomics Reveals Chd1 as a Determinant of Nucleosome Spacing in Vivo.
Hughes, Amanda L; Rando, Oliver J
2015-07-14
Packaging of genomic DNA into nucleosomes is nearly universally conserved in eukaryotes, and many features of the nucleosome landscape are quite conserved. Nonetheless, quantitative aspects of nucleosome packaging differ between species because, for example, the average length of linker DNA between nucleosomes can differ significantly even between closely related species. We recently showed that the difference in nucleosome spacing between two Hemiascomycete species-Saccharomyces cerevisiae and Kluyveromyces lactis-is established by trans-acting factors rather than being encoded in cis in the DNA sequence. Here, we generated several S. cerevisiae strains in which endogenous copies of candidate nucleosome spacing factors are deleted and replaced with the orthologous factors from K. lactis. We find no change in nucleosome spacing in such strains in which H1 or Isw1 complexes are swapped. In contrast, the K. lactis gene encoding the ATP-dependent remodeler Chd1 was found to direct longer internucleosomal spacing in S. cerevisiae, establishing that this remodeler is partially responsible for the relatively long internucleosomal spacing observed in K. lactis. By analyzing several chimeric proteins, we find that sequence differences that contribute to the spacing activity of this remodeler are dispersed throughout the coding sequence, but that the strongest spacing effect is linked to the understudied N-terminal end of Chd1. Taken together, our data find a role for sequence evolution of a chromatin remodeler in establishing quantitative aspects of the chromatin landscape in a species-specific manner. Copyright © 2015 Hughes and Rando.
2014-01-01
Background The advent of human genome sequencing project has led to a spurt in the number of protein sequences in the databanks. Success of structure based drug discovery severely hinges on the availability of structures. Despite significant progresses in the area of experimental protein structure determination, the sequence-structure gap is continually widening. Data driven homology based computational methods have proved successful in predicting tertiary structures for sequences sharing medium to high sequence similarities. With dwindling similarities of query sequences, advanced homology/ ab initio hybrid approaches are being explored to solve structure prediction problem. Here we describe Bhageerath-H, a homology/ ab initio hybrid software/server for predicting protein tertiary structures with advancing drug design attempts as one of the goals. Results Bhageerath-H web-server was validated on 75 CASP10 targets which showed TM-scores ≥0.5 in 91% of the cases and Cα RMSDs ≤5Å from the native in 58% of the targets, which is well above the CASP10 water mark. Comparison with some leading servers demonstrated the uniqueness of the hybrid methodology in effectively sampling conformational space, scoring best decoys and refining low resolution models to high and medium resolution. Conclusion Bhageerath-H methodology is web enabled for the scientific community as a freely accessible web server. The methodology is fielded in the on-going CASP11 experiment. PMID:25521245
Ollikainen, Noah; de Jong, René M; Kortemme, Tanja
2015-01-01
Interactions between small molecules and proteins play critical roles in regulating and facilitating diverse biological functions, yet our ability to accurately re-engineer the specificity of these interactions using computational approaches has been limited. One main difficulty, in addition to inaccuracies in energy functions, is the exquisite sensitivity of protein-ligand interactions to subtle conformational changes, coupled with the computational problem of sampling the large conformational search space of degrees of freedom of ligands, amino acid side chains, and the protein backbone. Here, we describe two benchmarks for evaluating the accuracy of computational approaches for re-engineering protein-ligand interactions: (i) prediction of enzyme specificity altering mutations and (ii) prediction of sequence tolerance in ligand binding sites. After finding that current state-of-the-art "fixed backbone" design methods perform poorly on these tests, we develop a new "coupled moves" design method in the program Rosetta that couples changes to protein sequence with alterations in both protein side-chain and protein backbone conformations, and allows for changes in ligand rigid-body and torsion degrees of freedom. We show significantly increased accuracy in both predicting ligand specificity altering mutations and binding site sequences. These methodological improvements should be useful for many applications of protein-ligand design. The approach also provides insights into the role of subtle conformational adjustments that enable functional changes not only in engineering applications but also in natural protein evolution.
NASA Astrophysics Data System (ADS)
Sugimoto, Asuka; Sumi, Takuya; Kang, Jiyoung; Tateno, Masaru
2017-07-01
Recognition in biological macromolecular systems, such as DNA-protein recognition, is one of the most crucial problems to solve toward understanding the fundamental mechanisms of various biological processes. Since specific base sequences of genome DNA are discriminated by proteins, such as transcription factors (TFs), finding TF binding motifs (TFBMs) in whole genome DNA sequences is currently a central issue in interdisciplinary biophysical and information sciences. In the present study, a novel strategy to create a discriminant function for discrimination of TFBMs by constituting mathematical neural networks (NNs) is proposed, together with a method to determine the boundary of signals (TFBMs) and noise in the NN-score (output) space. This analysis also leads to the mathematical limitation of discrimination in the recognition of features representing TFBMs, in an information geometrical manifold. Thus, the present strategy enables the identification of the whole space of TFBMs, right up to the noise boundary.
NASA Astrophysics Data System (ADS)
Gopinath, T.; Nelson, Sarah E. D.; Veglia, Gianluigi
2017-12-01
Magic angle spinning (MAS) solid-state NMR (ssNMR) spectroscopy is emerging as a unique method for the atomic resolution structure determination of native membrane proteins in lipid bilayers. Although 13C-detected ssNMR experiments continue to play a major role, recent technological developments have made it possible to carry out 1H-detected experiments, boosting both sensitivity and resolution. Here, we describe a new set of 1H-detected hybrid pulse sequences that combine through-bond and through-space correlation elements into single experiments, enabling the simultaneous detection of rigid and dynamic domains of membrane proteins. As proof-of-principle, we applied these new pulse sequences to the membrane protein phospholamban (PLN) reconstituted in lipid bilayers under moderate MAS conditions. The cross-polarization (CP) based elements enabled the detection of the relatively immobile residues of PLN in the transmembrane domain using through-space correlations; whereas the most dynamic region, which is in equilibrium between folded and unfolded states, was mapped by through-bond INEPT-based elements. These new 1H-detected experiments will enable one to detect not only the most populated (ground) states of biomacromolecules, but also sparsely populated high-energy (excited) states for a complete characterization of protein free energy landscapes.
A probabilistic and continuous model of protein conformational space for template-free modeling.
Zhao, Feng; Peng, Jian; Debartolo, Joe; Freed, Karl F; Sosnick, Tobin R; Xu, Jinbo
2010-06-01
One of the major challenges with protein template-free modeling is an efficient sampling algorithm that can explore a huge conformation space quickly. The popular fragment assembly method constructs a conformation by stringing together short fragments extracted from the Protein Data Base (PDB). The discrete nature of this method may limit generated conformations to a subspace in which the native fold does not belong. Another worry is that a protein with really new fold may contain some fragments not in the PDB. This article presents a probabilistic model of protein conformational space to overcome the above two limitations. This probabilistic model employs directional statistics to model the distribution of backbone angles and 2(nd)-order Conditional Random Fields (CRFs) to describe sequence-angle relationship. Using this probabilistic model, we can sample protein conformations in a continuous space, as opposed to the widely used fragment assembly and lattice model methods that work in a discrete space. We show that when coupled with a simple energy function, this probabilistic method compares favorably with the fragment assembly method in the blind CASP8 evaluation, especially on alpha or small beta proteins. To our knowledge, this is the first probabilistic method that can search conformations in a continuous space and achieves favorable performance. Our method also generated three-dimensional (3D) models better than template-based methods for a couple of CASP8 hard targets. The method described in this article can also be applied to protein loop modeling, model refinement, and even RNA tertiary structure prediction.
2012-01-01
Background To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. Results We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. Conclusions SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery. PMID:23281852
Chiu, Yi-Yuan; Lin, Chun-Yu; Lin, Chih-Ta; Hsu, Kai-Cheng; Chang, Li-Zen; Yang, Jinn-Moon
2012-01-01
To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery.
Andrews, T Daniel; Gojobori, Takashi
2004-01-01
The PilE protein is the major component of the Neisseria meningitidis pilus, which is encoded by the pilE/pilS locus that includes an expressed gene and eight homologous silent fragments. The silent gene fragments have been shown to recombine through gene conversion with the expressed gene and thereby provide a means by which novel antigenic variants of the PilE protein can be generated. We have analyzed the evolutionary rate of the pilE gene using the nucleotide sequence of two complete pilE/pilS loci. The very high rate of evolution displayed by the PilE protein appears driven by both recombination and positive selection. Within the semivariable region of the pilE and pilS genes, recombination appears to occur within multiple small sequence blocks that lie between conserved sequence elements. Within the hypervariable region, positive selection was identified from comparison of the silent and expressed genes. The unusual gene conversion mechanism that operates at the pilE/pilS locus is a strategy employed by N. meningitidis to enhance mutation of certain regions of the PilE protein. The silent copies of the gene effectively allow "parallelized" evolution of pilE, thus enabling the encoded protein to rapidly explore a large area of sequence space in an effort to find novel antigenic variants.
Du, Yushen; Wu, Nicholas C; Jiang, Lin; Zhang, Tianhao; Gong, Danyang; Shu, Sara; Wu, Ting-Ting; Sun, Ren
2016-11-01
Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. To fully comprehend the diverse functions of a protein, it is essential to understand the functionality of individual residues. Current methods are highly dependent on evolutionary sequence conservation, which is usually limited by sampling size. Sequence conservation-based methods are further confounded by structural constraints and multifunctionality of proteins. Here we present a method that can systematically identify and annotate functional residues of a given protein. We used a high-throughput functional profiling platform to identify essential residues. Coupling it with homologous-structure comparison, we were able to annotate multiple functions of proteins. We demonstrated the method with the PB1 protein of influenza A virus and identified novel functional residues in addition to its canonical function as an RNA-dependent RNA polymerase. Not limited to virology, this method is generally applicable to other proteins that can be functionally selected and about which homologous-structure information is available. Copyright © 2016 Du et al.
Jo, Sunhwan; Lee, Hui Sun; Skolnick, Jeffrey; Im, Wonpil
2013-01-01
Understanding glycan structure and dynamics is central to understanding protein-carbohydrate recognition and its role in protein-protein interactions. Given the difficulties in obtaining the glycan's crystal structure in glycoconjugates due to its flexibility and heterogeneity, computational modeling could play an important role in providing glycosylated protein structure models. To address if glycan structures available in the PDB can be used as templates or fragments for glycan modeling, we present a survey of the N-glycan structures of 35 different sequences in the PDB. Our statistical analysis shows that the N-glycan structures found on homologous glycoproteins are significantly conserved compared to the random background, suggesting that N-glycan chains can be confidently modeled with template glycan structures whose parent glycoproteins share sequence similarity. On the other hand, N-glycan structures found on non-homologous glycoproteins do not show significant global structural similarity. Nonetheless, the internal substructures of these N-glycans, particularly, the substructures that are closer to the protein, show significantly similar structures, suggesting that such substructures can be used as fragments in glycan modeling. Increased interactions with protein might be responsible for the restricted conformational space of N-glycan chains. Our results suggest that structure prediction/modeling of N-glycans of glycoconjugates using structure database could be effective and different modeling approaches would be needed depending on the availability of template structures.
Restricted N-glycan Conformational Space in the PDB and Its Implication in Glycan Structure Modeling
Jo, Sunhwan; Lee, Hui Sun; Skolnick, Jeffrey; Im, Wonpil
2013-01-01
Understanding glycan structure and dynamics is central to understanding protein-carbohydrate recognition and its role in protein-protein interactions. Given the difficulties in obtaining the glycan's crystal structure in glycoconjugates due to its flexibility and heterogeneity, computational modeling could play an important role in providing glycosylated protein structure models. To address if glycan structures available in the PDB can be used as templates or fragments for glycan modeling, we present a survey of the N-glycan structures of 35 different sequences in the PDB. Our statistical analysis shows that the N-glycan structures found on homologous glycoproteins are significantly conserved compared to the random background, suggesting that N-glycan chains can be confidently modeled with template glycan structures whose parent glycoproteins share sequence similarity. On the other hand, N-glycan structures found on non-homologous glycoproteins do not show significant global structural similarity. Nonetheless, the internal substructures of these N-glycans, particularly, the substructures that are closer to the protein, show significantly similar structures, suggesting that such substructures can be used as fragments in glycan modeling. Increased interactions with protein might be responsible for the restricted conformational space of N-glycan chains. Our results suggest that structure prediction/modeling of N-glycans of glycoconjugates using structure database could be effective and different modeling approaches would be needed depending on the availability of template structures. PMID:23516343
Hensen, Ulf; Meyer, Tim; Haas, Jürgen; Rex, René; Vriend, Gert; Grubmüller, Helmut
2012-01-01
Proteins are usually described and classified according to amino acid sequence, structure or function. Here, we develop a minimally biased scheme to compare and classify proteins according to their internal mobility patterns. This approach is based on the notion that proteins not only fold into recurring structural motifs but might also be carrying out only a limited set of recurring mobility motifs. The complete set of these patterns, which we tentatively call the dynasome, spans a multi-dimensional space with axes, the dynasome descriptors, characterizing different aspects of protein dynamics. The unique dynamic fingerprint of each protein is represented as a vector in the dynasome space. The difference between any two vectors, consequently, gives a reliable measure of the difference between the corresponding protein dynamics. We characterize the properties of the dynasome by comparing the dynamics fingerprints obtained from molecular dynamics simulations of 112 proteins but our approach is, in principle, not restricted to any specific source of data of protein dynamics. We conclude that: 1. the dynasome consists of a continuum of proteins, rather than well separated classes. 2. For the majority of proteins we observe strong correlations between structure and dynamics. 3. Proteins with similar function carry out similar dynamics, which suggests a new method to improve protein function annotation based on protein dynamics. PMID:22606222
Brylinski, Michal; Konieczny, Leszek; Kononowicz, Andrzej; Roterman, Irena
2008-03-21
The well-known procedure implemented in ClustalW oriented on the sequence comparison was applied to structure comparison. The consensus sequence as well as consensus structure has been defined for proteins belonging to serpine family. The structure of early stage intermediate was the object for similarity search. The high values of W(sequence) appeared to be accordant with high values of W(structure) making possible structure comparison using common criteria for sequence and structure comparison. Since the early stage structural form has been created according to limited conformational sub-space which does not include the beta-structure (this structure is mediated by C7eq structural form), is particularly important to see, that the C7eq structural form may be treated as the seed for beta-structure present in the final native structure of protein. The applicability of ClustalW procedure to structure comparison makes these two comparisons unified.
Dictionary-driven protein annotation.
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-09-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
GenProBiS: web server for mapping of sequence variants to protein binding sites.
Konc, Janez; Skrlj, Blaz; Erzen, Nika; Kunej, Tanja; Janezic, Dusanka
2017-07-03
Discovery of potentially deleterious sequence variants is important and has wide implications for research and generation of new hypotheses in human and veterinary medicine, and drug discovery. The GenProBiS web server maps sequence variants to protein structures from the Protein Data Bank (PDB), and further to protein-protein, protein-nucleic acid, protein-compound, and protein-metal ion binding sites. The concept of a protein-compound binding site is understood in the broadest sense, which includes glycosylation and other post-translational modification sites. Binding sites were defined by local structural comparisons of whole protein structures using the Protein Binding Sites (ProBiS) algorithm and transposition of ligands from the similar binding sites found to the query protein using the ProBiS-ligands approach with new improvements introduced in GenProBiS. Binding site surfaces were generated as three-dimensional grids encompassing the space occupied by predicted ligands. The server allows intuitive visual exploration of comprehensively mapped variants, such as human somatic mis-sense mutations related to cancer and non-synonymous single nucleotide polymorphisms from 21 species, within the predicted binding sites regions for about 80 000 PDB protein structures using fast WebGL graphics. The GenProBiS web server is open and free to all users at http://genprobis.insilab.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
In-silico studies of neutral drift for functional protein interaction networks
NASA Astrophysics Data System (ADS)
Ali, Md Zulfikar; Wingreen, Ned S.; Mukhopadhyay, Ranjan
We have developed a minimal physically-motivated model of protein-protein interaction networks. Our system consists of two classes of enzymes, activators (e.g. kinases) and deactivators (e.g. phosphatases), and the enzyme-mediated activation/deactivation rates are determined by sequence-dependent binding strengths between enzymes and their targets. The network is evolved by introducing random point mutations in the binding sequences where we assume that each new mutation is either fixed or entirely lost. We apply this model to studies of neutral drift in networks that yield oscillatory dynamics, where we start, for example, with a relatively simple network and allow it to evolve by adding nodes and connections while requiring that dynamics be conserved. Our studies demonstrate both the importance of employing a sequence-based evolutionary scheme and the relative rapidity (in evolutionary time) for the redistribution of function over new nodes via neutral drift. Surprisingly, in addition to this redistribution time we discovered another much slower timescale for network evolution, reflecting hidden order in sequence space that we interpret in terms of sparsely connected domains.
Application of an E. coli signal sequence as a versatile inclusion body tag.
Jong, Wouter S P; Vikström, David; Houben, Diane; van den Berg van Saparoea, H Bart; de Gier, Jan-Willem; Luirink, Joen
2017-03-21
Heterologous protein production in Escherichia coli often suffers from bottlenecks such as proteolytic degradation, complex purification procedures and toxicity towards the expression host. Production of proteins in an insoluble form in inclusion bodies (IBs) can alleviate these problems. Unfortunately, the propensity of heterologous proteins to form IBs is variable and difficult to predict. Hence, fusing the target protein to an aggregation prone polypeptide or IB-tag is a useful strategy to produce difficult-to-express proteins in an insoluble form. When screening for signal sequences that mediate optimal targeting of heterologous proteins to the periplasmic space of E. coli, we observed that fusion to the 39 amino acid signal sequence of E. coli TorA (ssTorA) did not promote targeting but rather directed high-level expression of the human proteins hEGF, Pla2 and IL-3 in IBs. Further analysis revealed that ssTorA even mediated IB formation of the highly soluble endogenous E. coli proteins TrxA and MBP. The ssTorA also induced aggregation when fused to the C-terminus of target proteins and appeared functional as IB-tag in E. coli K-12 as well as B strains. An additive effect on IB-formation was observed upon fusion of multiple ssTorA sequences in tandem, provoking almost complete aggregation of TrxA and MBP. The ssTorA-moiety was successfully used to produce the intrinsically unstable hEGF and the toxic fusion partner SymE, demonstrating its applicability as an IB-tag for difficult-to-express and toxic proteins. We present proof-of-concept for the use of ssTorA as a small, versatile tag for robust E. coli-based expression of heterologous proteins in IBs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gupta, Pankaj; Gaur, Vineet; Salunke, Dinakar M., E-mail: dinakar@nii.res.in
2008-08-01
A 2S albumin from L. culinaris was purified and crystallized and preliminary crystallographic studies were carried out. Lens culinaris (lentil) is a widely consumed high-protein-content leguminous crop. A 2S albumin protein (26.5 kDa) has been identified using NH{sub 2}-terminal sequencing from a 90% ammonium sulfate saturation fraction of total L. culinaris seed protein extract. The NH{sub 2}-terminal sequence shows very high homology to PA2, an allergy-related protein from Pisum sativum. The 2S albumin protein was purified using a combination of size-exclusion and ion-exchange chromatography. Crystals of the 2S seed albumin obtained using the hanging-drop vapour-diffusion method diffracted to 2.5 Åmore » resolution and were indexed in space group P4{sub 1} (or P4{sub 3}), with unit-cell parameters a = b = 78.6, c = 135.2 Å.« less
Alva, Vikram; Remmert, Michael; Biegert, Andreas; Lupas, Andrei N; Söding, Johannes
2010-01-01
Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co-occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.
Export of FepA::PhoA fusion proteins to the outer membrane of Escherichia coli K-12.
Murphy, C K; Klebba, P E
1989-11-01
A library of fepA::phoA gene fusions was generated in order to study the structure and secretion of the Escherichia coli K-12 ferric enterobactin receptor, FepA. All of the fusion proteins contained various lengths of the amino-terminal portion of FepA fused in frame to the catalytic portion of bacterial alkaline phosphatase. Localization of FepA::PhoA fusion proteins in the cell envelope was dependent on the number of residues of mature FepA present at the amino terminus. Hybrids containing up to one-third of the amino-terminal portion of FepA fractionated with their periplasm, while those containing longer sequences of mature FepA were exported to the outer membrane. Outer membrane-localized fusion proteins expressed FepA sequences on the external face of the outer membrane and alkaline phosphatase moieties in the periplasmic space. From sequence determinations of the fepA::phoA fusion joints, residues within FepA which may be exposed on the periplasmic side of the outer membrane were identified.
Morgan, Alexander A.; Rubenstein, Edward
2013-01-01
Proline is an anomalous amino acid. Its nitrogen atom is covalently locked within a ring, thus it is the only proteinogenic amino acid with a constrained phi angle. Sequences of three consecutive prolines can fold into polyproline helices, structures that join alpha helices and beta pleats as architectural motifs in protein configuration. Triproline helices are participants in protein-protein signaling interactions. Longer spans of repeat prolines also occur, containing as many as 27 consecutive proline residues. Little is known about the frequency, positioning, and functional significance of these proline sequences. Therefore we have undertaken a systematic bioinformatics study of proline residues in proteins. We analyzed the distribution and frequency of 687,434 proline residues among 18,666 human proteins, identifying single residues, dimers, trimers, and longer repeats. Proline accounts for 6.3% of the 10,882,808 protein amino acids. Of all proline residues, 4.4% are in trimers or longer spans. We detected patterns that influence function based on proline location, spacing, and concentration. We propose a classification based on proline-rich, polyproline-rich, and proline-poor status. Whereas singlet proline residues are often found in proteins that display recurring architectural patterns, trimers or longer proline sequences tend be associated with the absence of repetitive structural motifs. Spans of 6 or more are associated with DNA/RNA processing, actin, and developmental processes. We also suggest a role for proline in Kruppel-type zinc finger protein control of DNA expression, and in the nucleation and translocation of actin by the formin complex. PMID:23372670
Using structure to explore the sequence alignment space of remote homologs.
Kuziemko, Andrew; Honig, Barry; Petrey, Donald
2011-10-01
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.
Sequence, Structure, and Context Preferences of Human RNA Binding Proteins.
Dominguez, Daniel; Freese, Peter; Alexis, Maria S; Su, Amanda; Hochman, Myles; Palden, Tsultrim; Bazile, Cassandra; Lambert, Nicole J; Van Nostrand, Eric L; Pratt, Gabriel A; Yeo, Gene W; Graveley, Brenton R; Burge, Christopher B
2018-06-07
RNA binding proteins (RBPs) orchestrate the production, processing, and function of mRNAs. Here, we present the affinity landscapes of 78 human RBPs using an unbiased assay that determines the sequence, structure, and context preferences of these proteins in vitro by deep sequencing of bound RNAs. These data enable construction of "RNA maps" of RBP activity without requiring crosslinking-based assays. We found an unexpectedly low diversity of RNA motifs, implying frequent convergence of binding specificity toward a relatively small set of RNA motifs, many with low compositional complexity. Offsetting this trend, however, we observed extensive preferences for contextual features distinct from short linear RNA motifs, including spaced "bipartite" motifs, biased flanking nucleotide composition, and bias away from or toward RNA structure. Our results emphasize the importance of contextual features in RNA recognition, which likely enable targeting of distinct subsets of transcripts by different RBPs that recognize the same linear motif. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.
Degenerative minimalism in the genome of a psyllid endosymbiont.
Clark, M A; Baumann, L; Thao, M L; Moran, N A; Baumann, P
2001-03-01
Psyllids, like aphids, feed on plant phloem sap and are obligately associated with prokaryotic endosymbionts acquired through vertical transmission from an ancestral infection. We have sequenced 37 kb of DNA of the genome of Carsonella ruddii, the endosymbiont of psyllids, and found that it has a number of unusual properties revealing a more extreme case of degeneration than was previously reported from studies of eubacterial genomes, including that of the aphid endosymbiont Buchnera aphidicola. Among the unusual properties are an exceptionally low guanine-plus-cytosine content (19.9%), almost complete absence of intergenic spaces, operon fusion, and lack of the usual promoter sequences upstream of 16S rDNA. These features suggest the synthesis of long mRNAs and translational coupling. The most extreme instances of base compositional bias occur in the genes encoding proteins that have less highly conserved amino acid sequences; the guanine-plus-cytosine content of some protein-coding sequences is as low as 10%. The shift in base composition has a large effect on proteins: in polypeptides of C. ruddii, half of the residues consist of five amino acids with codons low in guanine plus cytosine. Furthermore, the proteins of C. ruddii are reduced in size, with an average of about 9% fewer amino acids than in homologous proteins of related bacteria. These observations suggest that the C. ruddii genome is not subject to constraints that limit the evolution of other known eubacteria.
Yamamoto, T; Okawa, N; Endo, T; Kaji, A
1991-08-01
The ras gene was fused with the DNA sequence of OmpF signal peptide or with the DNA sequence of OmpF signal peptide plus the amino terminal portion of the OmpF gene. They were placed in plasmids together with the bacteriophage lambda PL promoter. These plasmids were introduced into Escherichia coli strain K-12 and the OmpF signal peptide fusion proteins were expressed. These fusion proteins were identified as 29.0 and 30.0 kDa proteins. However, processed products of these proteins were not found in the extract. The fusion proteins were localized mostly in the cytoplasm and the inner membrane, but none of them was secreted into the periplasmic space. On the other hand, the ras protein alone was found in the cytoplasm and not in the inner membrane. Viable counts of E. coli harbouring these plasmids decreased when these fused proteins were induced. Induction of the ras protein alone did not harm cells. These observations suggest that insertion of the heterologous proteins into the inner membrane may cause the bactericidal effect.
Currin, Andrew; Swainston, Neil; Day, Philip J.
2015-01-01
The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the ‘search space’ of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (K d) and catalytic (k cat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving k cat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the ‘best’ amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole, simultaneously, this offers opportunities for protein improvement not readily available to natural evolution on rapid timescales. Intelligent landscape navigation, informed by sequence-activity relationships and coupled to the emerging methods of synthetic biology, offers scope for the development of novel biocatalysts that are both highly active and robust. PMID:25503938
Heuristic reusable dynamic programming: efficient updates of local sequence alignment.
Hong, Changjin; Tewfik, Ahmed H
2009-01-01
Recomputation of the previously evaluated similarity results between biological sequences becomes inevitable when researchers realize errors in their sequenced data or when the researchers have to compare nearly similar sequences, e.g., in a family of proteins. We present an efficient scheme for updating local sequence alignments with an affine gap model. In principle, using the previous matching result between two amino acid sequences, we perform a forward-backward alignment to generate heuristic searching bands which are bounded by a set of suboptimal paths. Given a correctly updated sequence, we initially predict a new score of the alignment path for each contour to select the best candidates among them. Then, we run the Smith-Waterman algorithm in this confined space. Furthermore, our heuristic alignment for an updated sequence shows that it can be further accelerated by using reusable dynamic programming (rDP), our prior work. In this study, we successfully validate "relative node tolerance bound" (RNTB) in the pruned searching space. Furthermore, we improve the computational performance by quantifying the successful RNTB tolerance probability and switch to rDP on perturbation-resilient columns only. In our searching space derived by a threshold value of 90 percent of the optimal alignment score, we find that 98.3 percent of contours contain correctly updated paths. We also find that our method consumes only 25.36 percent of the runtime cost of sparse dynamic programming (sDP) method, and to only 2.55 percent of that of a normal dynamic programming with the Smith-Waterman algorithm.
Predicting protein contact map using evolutionary and physical constraints by integer programming.
Wang, Zhiyong; Xu, Jinbo
2013-07-01
Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole-contact map. A couple of recent methods predict contact map by using mutual information, taking into consideration contact correlation and enforcing a sparsity restraint, but these methods demand for a very large number of sequence homologs for the protein under consideration and the resultant contact map may be still physically infeasible. This article presents a novel method PhyCMAP for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming. The evolutionary restraints are much more informative than mutual information, and the physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and, thus, significantly improves prediction accuracy. Experimental results confirm that PhyCMAP outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration. http://raptorx.uchicago.edu.
Martinez, Carlos A.; Barr, Kenneth; Kim, Ah-Ram; Reinitz, John
2013-01-01
Synthetic biology offers novel opportunities for elucidating transcriptional regulatory mechanisms and enhancer logic. Complex cis-regulatory sequences—like the ones driving expression of the Drosophila even-skipped gene—have proven difficult to design from existing knowledge, presumably due to the large number of protein-protein interactions needed to drive the correct expression patterns of genes in multicellular organisms. This work discusses two novel computational methods for the custom design of enhancers that employ a sophisticated, empirically validated transcriptional model, optimization algorithms, and synthetic biology. These synthetic elements have both utilitarian and academic value, including improving existing regulatory models as well as evolutionary questions. The first method involves the use of simulated annealing to explore the sequence space for synthetic enhancers whose expression output fit a given search criterion. The second method uses a novel optimization algorithm to find functionally accessible pathways between two enhancer sequences. These paths describe a set of mutations wherein the predicted expression pattern does not significantly vary at any point along the path. Both methods rely on a predictive mathematical framework that maps the enhancer sequence space to functional output. PMID:23732772
Predictive Bcl-2 Family Binding Models Rooted in Experiment or Structure
DeBartolo, Joe; Dutta, Sanjib; Reich, Lothar; Keating, Amy E.
2013-01-01
Proteins of the Bcl-2 family either enhance or suppress programmed cell death and are centrally involved in cancer development and resistance to chemotherapy. BH3 (Bcl-2 homology 3)-only Bcl-2 proteins promote cell death by docking an α-helix into a hydrophobic groove on the surface of one or more of five pro-survival Bcl-2 receptor proteins. There is high structural homology within the pro-death and pro-survival families, yet a high degree of interaction specificity is nevertheless encoded, posing an interesting and important molecular recognition problem. Understanding protein features that dictate Bcl-2 interaction specificity is critical for designing peptide-based cancer therapeutics and diagnostics. In this study, we present peptide SPOT arrays and deep sequencing data from yeast display screening experiments that significantly expand the BH3 sequence space that has been experimentally tested for interaction with five human anti-apoptotic receptors. These data provide rich information about the determinants of Bcl-2 family specificity. To interpret and use the information, we constructed two simple data-based models that can predict affinity and specificity when evaluated on independent data sets within a limited sequence space. We also constructed a novel structure-based statistical potential, called STATIUM, which is remarkably good at predicting Bcl-2 affinity and specificity, especially considering it is not trained on experimental data. We compare the performance of our three models to each other and to alternative structure-based methods and discuss how such tools can guide prediction and design of new Bcl-2 family complexes. PMID:22617328
Muth, Thilo; García-Martín, Juan A; Rausell, Antonio; Juan, David; Valencia, Alfonso; Pazos, Florencio
2012-02-15
We have implemented in a single package all the features required for extracting, visualizing and manipulating fully conserved positions as well as those with a family-dependent conservation pattern in multiple sequence alignments. The program allows, among other things, to run different methods for extracting these positions, combine the results and visualize them in protein 3D structures and sequence spaces. JDet is a multiplatform application written in Java. It is freely available, including the source code, at http://csbg.cnb.csic.es/JDet. The package includes two of our recently developed programs for detecting functional positions in protein alignments (Xdet and S3Det), and support for other methods can be added as plug-ins. A help file and a guided tutorial for JDet are also available.
NASA Astrophysics Data System (ADS)
Tremberger, G.; Dehipawala, Sunil; Cheung, E.; Holden, T.; Sullivan, R.; Nguyen, A.; Lieberman, D.; Cheung, T.
2015-09-01
All metallo-proteins need post-translation metal incorporation. In fact, the isotope ratio of Fe, Cu, and Zn in physiology and oncology have emerged as an important tool. The nickel containing F430 is the prosthetic group of the enzyme methyl coenzyme M reductase which catalyzes the release of methane in the final step of methano-genesis, a prime energy metabolism candidate for life exploration space mission in the solar system. The 3.5 Gyr early life sulfite reductase as a life switch energy metabolism had Fe-Mo clusters. The nitrogenase for nitrogen fixation 3 billion years ago had Mo. The early life arsenite oxidase needed for anoxygenic photosynthesis energy metabolism 2.8 billion years ago had Mo and Fe. The selection pressure in metal incorporation inside a protein would be quantifiable in terms of the related nucleotide sequence complexity with fractal dimension and entropy values. Simulation model showed that the studied metal-required energy metabolism sequences had at least ten times more selection pressure relatively in comparison to the horizontal transferred sequences in Mealybug, guided by the outcome histogram of the correlation R-sq values. The metal energy metabolism sequence group was compared to the circadian clock KaiC sequence group using magnesium atomic level bond shifting mechanism in the protein, and the simulation model would suggest a much higher selection pressure for the energy life switch sequence group. The possibility of using Kepler 444 as an example of ancient life in Galaxy with the associated exoplanets has been proposed and is further discussed in this report. Examples of arsenic metal bonding shift probed by Synchrotron-based X-ray spectroscopy data and Zn controlled FOXP2 regulated pathways in human and chimp brain studied tissue samples are studied in relationship to the sequence bioinformatics. The analysis results suggest that relatively large metal bonding shift amount is associated with low probability correlation R-sq outcome in the bioinformatics simulation.
Ashburner, M; Misra, S; Roote, J; Lewis, S E; Blazej, R; Davis, T; Doyle, C; Galle, R; George, R; Harris, N; Hartzell, G; Harvey, D; Hong, L; Houston, K; Hoskins, R; Johnson, G; Martin, C; Moshrefi, A; Palazzolo, M; Reese, M G; Spradling, A; Tsang, G; Wan, K; Whitelaw, K; Celniker, S
1999-01-01
A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized "Adh region." A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.Before beginning a Hunt, it is wise to ask someone what you are looking for before you begin looking for it. Milne 1926 PMID:10471707
Attractors in Sequence Space: Agent-Based Exploration of MHC I Binding Peptides.
Jäger, Natalie; Wisniewska, Joanna M; Hiss, Jan A; Freier, Anja; Losch, Florian O; Walden, Peter; Wrede, Paul; Schneider, Gisbert
2010-01-12
Ant Colony Optimization (ACO) is a meta-heuristic that utilizes a computational analogue of ant trail pheromones to solve combinatorial optimization problems. The size of the ant colony and the representation of the ants' pheromone trails is unique referring to the given optimization problem. In the present study, we employed ACO to generate novel peptides that stabilize MHC I protein on the plasma membrane of a murine lymphoma cell line. A jury of feedforward neural network classifiers served as fitness function for peptide design by ACO. Bioactive murine MHC I H-2K(b) stabilizing as well as nonstabilizing octapeptides were designed, synthesized and tested. These peptides reveal residue motifs that are relevant for MHC I receptor binding. We demonstrate how the performance of the implemented ACO algorithm depends on the colony size and the size of the search space. The actual peptide design process by ACO constitutes a search path in sequence space that can be visualized as trajectories on a self-organizing map (SOM). By projecting the sequence space on a SOM we visualize the convergence of the different solutions that emerge during the optimization process in sequence space. The SOM representation reveals attractors in sequence space for MHC I binding peptides. The combination of ACO and SOM enables systematic peptide optimization. This technique allows for the rational design of various types of bioactive peptides with minimal experimental effort. Here, we demonstrate its successful application to the design of MHC-I binding and nonbinding peptides which exhibit substantial bioactivity in a cell-based assay. Copyright © 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
An Exploration into Fern Genome Space.
Wolf, Paul G; Sessa, Emily B; Marchant, Daniel Blaine; Li, Fay-Wei; Rothfels, Carl J; Sigel, Erin M; Gitzendanner, Matthew A; Visger, Clayton J; Banks, Jo Ann; Soltis, Douglas E; Soltis, Pamela S; Pryer, Kathleen M; Der, Joshua P
2015-08-26
Ferns are one of the few remaining major clades of land plants for which a complete genome sequence is lacking. Knowledge of genome space in ferns will enable broad-scale comparative analyses of land plant genes and genomes, provide insights into genome evolution across green plants, and shed light on genetic and genomic features that characterize ferns, such as their high chromosome numbers and large genome sizes. As part of an initial exploration into fern genome space, we used a whole genome shotgun sequencing approach to obtain low-density coverage (∼0.4X to 2X) for six fern species from the Polypodiales (Ceratopteris, Pteridium, Polypodium, Cystopteris), Cyatheales (Plagiogyria), and Gleicheniales (Dipteris). We explore these data to characterize the proportion of the nuclear genome represented by repetitive sequences (including DNA transposons, retrotransposons, ribosomal DNA, and simple repeats) and protein-coding genes, and to extract chloroplast and mitochondrial genome sequences. Such initial sweeps of fern genomes can provide information useful for selecting a promising candidate fern species for whole genome sequencing. We also describe variation of genomic traits across our sample and highlight some differences and similarities in repeat structure between ferns and seed plants. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Dictionary-driven protein annotation
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-01-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12202776
Freudl, R; Schwarz, H; Klose, M; Movva, N R; Henning, U
1985-12-16
Information, in addition to that provided by signal sequences, for translocation across the plasma membrane is thought to be present in exported proteins of Escherichia coli. Such information must also exist for the localization of such proteins. To determine the nature of this information, overlapping inframe deletions have been constructed in the ompA gene which codes for a 325-residue major outer membrane protein. In addition, one deletion, encoding only the NH2-terminal part of the protein up to residue 160, was prepared. The location of each product was determined by immunoelectron microscopy. Proteins missing residues 4-45, 43-84, 46-227, 86-227 or 160-325 of the mature protein were all efficiently translocated across the plasma membrane. The first two proteins were found in the outer membrane, the others in the periplasmic space. It has been proposed that export and sorting signals consist of relatively small amino acid sequences near the NH2 terminus of an outer membrane protein. On the basis of sequence homologies it has also been suggested that such proteins possess a common sorting signal. The locations of the partially deleted proteins described here show that a unique export signal does not exist in the OmpA protein. The proposed common sorting signal spans residues 1-14 of OmpA. Since this region is not essential for routing the protein, the existence of a common sorting signal is doubtful. It is suggested that information both for export (if existent) and localization lies within protein conformation which for the former process should be present repeatedly in the polypeptide.
Experimental rugged fitness landscape in protein sequence space.
Hayashi, Yuuki; Aita, Takuyo; Toyota, Hitoshi; Husimi, Yuzuru; Urabe, Itaru; Yomo, Tetsuya
2006-12-20
The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12-130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7x10(4)-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1) the dependence of stationary fitness on library size, which increased gradually, and (2) the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18-24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region.
Experimental Rugged Fitness Landscape in Protein Sequence Space
Hayashi, Yuuki; Aita, Takuyo; Toyota, Hitoshi; Husimi, Yuzuru; Urabe, Itaru; Yomo, Tetsuya
2006-01-01
The fitness landscape in sequence space determines the process of biomolecular evolution. To plot the fitness landscape of protein function, we carried out in vitro molecular evolution beginning with a defective fd phage carrying a random polypeptide of 139 amino acids in place of the g3p minor coat protein D2 domain, which is essential for phage infection. After 20 cycles of random substitution at sites 12–130 of the initial random polypeptide and selection for infectivity, the selected phage showed a 1.7×104-fold increase in infectivity, defined as the number of infected cells per ml of phage suspension. Fitness was defined as the logarithm of infectivity, and we analyzed (1) the dependence of stationary fitness on library size, which increased gradually, and (2) the time course of changes in fitness in transitional phases, based on an original theory regarding the evolutionary dynamics in Kauffman's n-k fitness landscape model. In the landscape model, single mutations at single sites among n sites affect the contribution of k other sites to fitness. Based on the results of these analyses, k was estimated to be 18–24. According to the estimated parameters, the landscape was plotted as a smooth surface up to a relative fitness of 0.4 of the global peak, whereas the landscape had a highly rugged surface with many local peaks above this relative fitness value. Based on the landscapes of these two different surfaces, it appears possible for adaptive walks with only random substitutions to climb with relative ease up to the middle region of the fitness landscape from any primordial or random sequence, whereas an enormous range of sequence diversity is required to climb further up the rugged surface above the middle region. PMID:17183728
Fu, Xiaoran; Apgar, James R.; Keating, Amy E.
2007-01-01
Computational protein design can be used to select sequences that are compatible with a fixed-backbone template. This strategy has been used in numerous instances to engineer novel proteins. However, the fixed-backbone assumption severely restricts the sequence space that is accessible via design. For challenging problems, such as the design of functional proteins, this may not be acceptable. In this paper, we present a method for introducing backbone flexibility into protein design calculations and apply it to the design of diverse helical BH3 ligands that bind to the anti-apoptotic protein Bcl-xL, a member of the Bcl-2 protein family. We demonstrate how normal mode analysis can be used to sample different BH3 backbones, and show that this leads to a larger and more diverse set of low-energy solutions than can be achieved using a native high-resolution Bcl-xL complex crystal structure as a template. We tested several of the designed solutions experimentally and found that this approach worked well when normal mode calculations were used to deform a native BH3 helix structure, but less well when they were used to deform an idealized helix. A subsequent round of design and testing identified a likely source of the problem as inadequate sampling of the helix pitch. In all, we tested seventeen designed BH3 peptide sequences, including several point mutants. Of these, eight bound well to Bcl-xL and four others showed weak but detectable binding. The successful designs showed a diversity of sequences that would have been difficult or impossible to achieve using only a fixed backbone. Thus, introducing backbone flexibility via normal mode analysis effectively broadened the set of sequences identified by computational design, and provided insight into positions important for binding Bcl-xL. PMID:17597151
Degenerative Minimalism in the Genome of a Psyllid Endosymbiont
Clark, Marta A.; Baumann, Linda; Thao, MyLo Ly; Moran, Nancy A.; Baumann, Paul
2001-01-01
Psyllids, like aphids, feed on plant phloem sap and are obligately associated with prokaryotic endosymbionts acquired through vertical transmission from an ancestral infection. We have sequenced 37 kb of DNA of the genome of Carsonella ruddii, the endosymbiont of psyllids, and found that it has a number of unusual properties revealing a more extreme case of degeneration than was previously reported from studies of eubacterial genomes, including that of the aphid endosymbiont Buchnera aphidicola. Among the unusual properties are an exceptionally low guanine-plus-cytosine content (19.9%), almost complete absence of intergenic spaces, operon fusion, and lack of the usual promoter sequences upstream of 16S rDNA. These features suggest the synthesis of long mRNAs and translational coupling. The most extreme instances of base compositional bias occur in the genes encoding proteins that have less highly conserved amino acid sequences; the guanine-plus-cytosine content of some protein-coding sequences is as low as 10%. The shift in base composition has a large effect on proteins: in polypeptides of C. ruddii, half of the residues consist of five amino acids with codons low in guanine plus cytosine. Furthermore, the proteins of C. ruddii are reduced in size, with an average of about 9% fewer amino acids than in homologous proteins of related bacteria. These observations suggest that the C. ruddii genome is not subject to constraints that limit the evolution of other known eubacteria. PMID:11222582
Plucienniczak, A; Schroeder, E; Zettlmeissl, G; Streeck, R E
1985-01-01
The nucleotide sequence of a 7.6 kb vaccinia DNA segment from a genomic region conserved among different orthopox virus has been determined. This segment contains a tight cluster of 12 partly overlapping open reading frames most of which can be correlated with previously identified early and late proteins and mRNAs. Regulatory signals used by vaccinia virus have been studied. Presumptive promoter regions are rich in A, T and carry the consensus sequences TATA and AATAA spaced at 20-24 base pairs. Tandem repeats of a CTATTC consensus sequence are proposed to be involved in the termination of early transcription. PMID:2987815
Pea chloroplast DnaJ-J8 and Toc12 are encoded by the same gene and localized in the stroma.
Chiu, Chi-Chou; Chen, Lih-Jen; Li, Hsou-min
2010-11-01
Toc12 is a novel J domain-containing protein identified in pea (Pisum sativum) chloroplasts. It was shown to be an integral outer membrane protein localizing in the intermembrane space of the chloroplast envelope. Furthermore, Toc12 was shown to associate with an intermembrane space Hsp70, suggesting that Toc12 is important for protein translocation across the chloroplast envelope. Toc12 shares a high degree of sequence similarity with Arabidopsis (Arabidopsis thaliana) DnaJ-J8, which has been suggested to be a soluble protein of the chloroplast stroma. Here, we isolated genes encoding DnaJ-J8 from pea and found that Toc12 is a truncated clone of one of the pea DnaJ-J8s. Protein import analyses indicate that Toc12 and DnaJ-J8s possess a cleavable transit peptide and are localized in the stroma. Arabidopsis mutants with T-DNA insertions in the DnaJ-J8 gene show no defect in chloroplast protein import. Implications of these results in the energetics and mechanisms of chloroplast protein import are discussed.
Fischer, Manuel; Horn, Sebastian; Belkacemi, Anouar; Kojer, Kerstin; Petrungaro, Carmelina; Habich, Markus; Ali, Muna; Küttner, Victoria; Bien, Melanie; Kauff, Frank; Dengjel, Jörn; Herrmann, Johannes M.; Riemer, Jan
2013-01-01
Oxidation of cysteine residues to disulfides drives import of many proteins into the intermembrane space of mitochondria. Recent studies in yeast unraveled the basic principles of mitochondrial protein oxidation, but the kinetics under physiological conditions is unknown. We developed assays to follow protein oxidation in living mammalian cells, which reveal that import and oxidative folding of proteins are kinetically and functionally coupled and depend on the oxidoreductase Mia40, the sulfhydryl oxidase augmenter of liver regeneration (ALR), and the intracellular glutathione pool. Kinetics of substrate oxidation depends on the amount of Mia40 and requires tightly balanced amounts of ALR. Mia40-dependent import of Cox19 in human cells depends on the inner membrane potential. Our observations reveal considerable differences in the velocities of mitochondrial import pathways: whereas preproteins with bipartite targeting sequences are imported within seconds, substrates of Mia40 remain in the cytosol for several minutes and apparently escape premature degradation and oxidation. PMID:23676665
Guaranteed Discrete Energy Optimization on Large Protein Design Problems.
Simoncini, David; Allouche, David; de Givry, Simon; Delmas, Céline; Barbe, Sophie; Schiex, Thomas
2015-12-08
In Computational Protein Design (CPD), assuming a rigid backbone and amino-acid rotamer library, the problem of finding a sequence with an optimal conformation is NP-hard. In this paper, using Dunbrack's rotamer library and Talaris2014 decomposable energy function, we use an exact deterministic method combining branch and bound, arc consistency, and tree-decomposition to provenly identify the global minimum energy sequence-conformation on full-redesign problems, defining search spaces of size up to 10(234). This is achieved on a single core of a standard computing server, requiring a maximum of 66GB RAM. A variant of the algorithm is able to exhaustively enumerate all sequence-conformations within an energy threshold of the optimum. These proven optimal solutions are then used to evaluate the frequencies and amplitudes, in energy and sequence, at which an existing CPD-dedicated simulated annealing implementation may miss the optimum on these full redesign problems. The probability of finding an optimum drops close to 0 very quickly. In the worst case, despite 1,000 repeats, the annealing algorithm remained more than 1 Rosetta unit away from the optimum, leading to design sequences that could differ from the optimal sequence by more than 30% of their amino acids.
Integrating linear optimization with structural modeling to increase HIV neutralization breadth.
Sevy, Alexander M; Panda, Swetasudha; Crowe, James E; Meiler, Jens; Vorobeychik, Yevgeniy
2018-02-01
Computational protein design has been successful in modeling fixed backbone proteins in a single conformation. However, when modeling large ensembles of flexible proteins, current methods in protein design have been insufficient. Large barriers in the energy landscape are difficult to traverse while redesigning a protein sequence, and as a result current design methods only sample a fraction of available sequence space. We propose a new computational approach that combines traditional structure-based modeling using the Rosetta software suite with machine learning and integer linear programming to overcome limitations in the Rosetta sampling methods. We demonstrate the effectiveness of this method, which we call BROAD, by benchmarking the performance on increasing predicted breadth of anti-HIV antibodies. We use this novel method to increase predicted breadth of naturally-occurring antibody VRC23 against a panel of 180 divergent HIV viral strains and achieve 100% predicted binding against the panel. In addition, we compare the performance of this method to state-of-the-art multistate design in Rosetta and show that we can outperform the existing method significantly. We further demonstrate that sequences recovered by this method recover known binding motifs of broadly neutralizing anti-HIV antibodies. Finally, our approach is general and can be extended easily to other protein systems. Although our modeled antibodies were not tested in vitro, we predict that these variants would have greatly increased breadth compared to the wild-type antibody.
Jonscher, Ernst; Erdbeer, Alexander; Günther, Marie; Kurth, Michael
2015-07-25
The family of cysteine rich proteins of the oocyst wall (COWPs) originally described in Cryptosporidium can also be found in Toxoplasma gondii (TgOWPs) localised to the oocyst wall as well. Genome sequence analysis of Eimeria suggests that these proteins may also exist in this genus and led us to the assumption that these proteins may also play a role in oocyst wall formation. In this study, COWP-like encoding sequences had been identified in Eimeria nieschulzi. The predicted gene sequences were subsequently utilized in reporter gene assays to observe time of expression and localisation of the reporter protein in vivo. Both investigated proteins, EnOWP2 and EnOWP6, were expressed during sporulation. The EnOWP2-promoter driven mCherry was found in the cytoplasm and the EnOWP2, respectively EnOWP6, fused to mCherry was initially observed in the extracytoplasmatic space between sporoblast and oocyst wall. This, so far unnamed compartment was designated as circumplasm. Later, the mCherry reporter co-localised with the sporocyst wall of the sporulated oocysts. This observation had been confirmed by confocal microscopy, excystation experiments and IFA. Transcript analysis revealed the intron-exon structure of these genes and confirmed the expression of EnOWP2 and EnOWP6 during sporogony. Our results allow us to assume a role, of both investigated EnOWP proteins, in the sporocyst wall formation of E. nieschulzi. Data mining and sequence comparisons to T. gondii and other Eimeria species allow us to hypothesise a conserved process within the coccidia. A role in oocyst wall formation had not been observed in E. nieschulzi.
ACLAME: a CLAssification of Mobile genetic Elements, update 2010.
Leplae, Raphaël; Lima-Mendez, Gipsi; Toussaint, Ariane
2010-01-01
The ACLAME database is dedicated to the collection, analysis and classification of sequenced mobile genetic elements (MGEs, in particular phages and plasmids). In addition to providing information on the MGEs content, classifications are available at various levels of organization. At the gene/protein level, families group similar sequences that are expected to share the same function. Families of four or more proteins are manually assigned with a functional annotation using the GeneOntology and the locally developed ontology MeGO dedicated to MGEs. At the genome level, evolutionary cohesive modules group sets of protein families shared among MGEs. At the population level, networks display the reticulate evolutionary relationships among MGEs. To increase the coverage of the phage sequence space, ACLAME version 0.4 incorporates 760 high-quality predicted prophages selected from the Prophinder database. Most of the data can be downloaded from the freely accessible ACLAME web site (http://aclame.ulb.ac.be). The BLAST interface for querying the database has been extended and numerous tools for in-depth analysis of the results have been added.
Rampello, Anthony J; Glynn, Steven E
2017-03-24
The i-AAA protease is a component of the mitochondrial quality control machinery that regulates respiration, mitochondrial dynamics, and protein import. The protease is required to select specific substrates for degradation from among the diverse complement of proteins present in mitochondria, yet the rules that govern this selection are unclear. Here, we reconstruct the yeast i-AAA protease, Yme1p, to examine the in vitro degradation of two intermembrane space chaperone subunits, Tim9 and Tim10. Yme1p degrades Tim10 more rapidly than Tim9 despite high sequence and structural similarity, and loss of Tim10 is accelerated by the disruption of conserved disulfide bonds within the substrate. An unstructured N-terminal region of Tim10 is necessary and sufficient to target the substrate to the protease through recognition of a short phenylalanine-rich motif, and the presence of similar motifs in other small Tim proteins predicts robust degradation by the protease. Together, these results identify the first specific degron sequence within a native i-AAA protease substrate. Copyright © 2017 Elsevier Ltd. All rights reserved.
Swellix: a computational tool to explore RNA conformational space.
Sloat, Nathan; Liu, Jui-Wen; Schroeder, Susan J
2017-11-21
The sequence of nucleotides in an RNA determines the possible base pairs for an RNA fold and thus also determines the overall shape and function of an RNA. The Swellix program presented here combines a helix abstraction with a combinatorial approach to the RNA folding problem in order to compute all possible non-pseudoknotted RNA structures for RNA sequences. The Swellix program builds on the Crumple program and can include experimental constraints on global RNA structures such as the minimum number and lengths of helices from crystallography, cryoelectron microscopy, or in vivo crosslinking and chemical probing methods. The conceptual advance in Swellix is to count helices and generate all possible combinations of helices rather than counting and combining base pairs. Swellix bundles similar helices and includes improvements in memory use and efficient parallelization. Biological applications of Swellix are demonstrated by computing the reduction in conformational space and entropy due to naturally modified nucleotides in tRNA sequences and by motif searches in Human Endogenous Retroviral (HERV) RNA sequences. The Swellix motif search reveals occurrences of protein and drug binding motifs in the HERV RNA ensemble that do not occur in minimum free energy or centroid predicted structures. Swellix presents significant improvements over Crumple in terms of efficiency and memory use. The efficient parallelization of Swellix enables the computation of sequences as long as 418 nucleotides with sufficient experimental constraints. Thus, Swellix provides a practical alternative to free energy minimization tools when multiple structures, kinetically determined structures, or complex RNA-RNA and RNA-protein interactions are present in an RNA folding problem.
Global analysis of protein folding using massively parallel design, synthesis and testing
Rocklin, Gabriel J.; Chidyausiku, Tamuka M.; Goreshnik, Inna; Ford, Alex; Houliston, Scott; Lemak, Alexander; Carter, Lauren; Ravichandran, Rashmi; Mulligan, Vikram K.; Chevalier, Aaron; Arrowsmith, Cheryl H.; Baker, David
2017-01-01
Proteins fold into unique native structures stabilized by thousands of weak interactions that collectively overcome the entropic cost of folding. Though these forces are “encoded” in the thousands of known protein structures, “decoding” them is challenging due to the complexity of natural proteins that have evolved for function, not stability. Here we combine computational protein design, next-generation gene synthesis, and a high-throughput protease susceptibility assay to measure folding and stability for over 15,000 de novo designed miniproteins, 1,000 natural proteins, 10,000 point-mutants, and 30,000 negative control sequences, identifying over 2,500 new stable designed proteins in four basic folds. This scale—three orders of magnitude greater than that of previous studies of design or folding—enabled us to systematically examine how sequence determines folding and stability in uncharted protein space. Iteration between design and experiment increased the design success rate from 6% to 47%, produced stable proteins unlike those found in nature for topologies where design was initially unsuccessful, and revealed subtle contributions to stability as designs became increasingly optimized. Our approach achieves the long-standing goal of a tight feedback cycle between computation and experiment, and promises to transform computational protein design into a data-driven science. PMID:28706065
New technologies accelerate the exploration of non-coding RNAs in horticultural plants
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Degao; Mewalal, Ritesh; Hu, Rongbin
Non-coding RNAs (ncRNAs), that is, RNAs not translated into proteins, are crucial regulators of a variety of biological processes in plants. While protein-encoding genes have been relatively well-annotated in sequenced genomes, accounting for a small portion of the genome space in plants, the universe of plant ncRNAs is rapidly expanding. Recent advances in experimental and computational technologies have generated a great momentum for discovery and functional characterization of ncRNAs. Here we summarize the classification and known biological functions of plant ncRNAs, review the application of next-generation sequencing (NGS) technology and ribosome profiling technology to ncRNA discovery in horticultural plants andmore » discuss the application of new technologies, especially the new genome-editing tool clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) systems, to functional characterization of plant ncRNAs.« less
Unassigned MS/MS Spectra: Who Am I?
Pathan, Mohashin; Samuel, Monisha; Keerthikumar, Shivakumar; Mathivanan, Suresh
2017-01-01
Recent advances in high resolution tandem mass spectrometry (MS) has resulted in the accumulation of high quality data. Paralleled with these advances in instrumentation, bioinformatics software have been developed to analyze such quality datasets. In spite of these advances, data analysis in mass spectrometry still remains critical for protein identification. In addition, the complexity of the generated MS/MS spectra, unpredictable nature of peptide fragmentation, sequence annotation errors, and posttranslational modifications has impeded the protein identification process. In a typical MS data analysis, about 60 % of the MS/MS spectra remains unassigned. While some of these could attribute to the low quality of the MS/MS spectra, a proportion can be classified as high quality. Further analysis may reveal how much of the unassigned MS spectra attribute to search space, sequence annotation errors, mutations, and/or posttranslational modifications. In this chapter, the tools used to identify proteins and ways to assign unassigned tandem MS spectra are discussed.
New technologies accelerate the exploration of non-coding RNAs in horticultural plants
Liu, Degao; Mewalal, Ritesh; Hu, Rongbin; Tuskan, Gerald A; Yang, Xiaohan
2017-01-01
Non-coding RNAs (ncRNAs), that is, RNAs not translated into proteins, are crucial regulators of a variety of biological processes in plants. While protein-encoding genes have been relatively well-annotated in sequenced genomes, accounting for a small portion of the genome space in plants, the universe of plant ncRNAs is rapidly expanding. Recent advances in experimental and computational technologies have generated a great momentum for discovery and functional characterization of ncRNAs. Here we summarize the classification and known biological functions of plant ncRNAs, review the application of next-generation sequencing (NGS) technology and ribosome profiling technology to ncRNA discovery in horticultural plants and discuss the application of new technologies, especially the new genome-editing tool clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) systems, to functional characterization of plant ncRNAs. PMID:28698797
Zeil, Catharina; Widmann, Michael; Fademrecht, Silvia; Vogel, Constantin; Pleiss, Jürgen
2016-05-01
The Lactamase Engineering Database (www.LacED.uni-stuttgart.de) was developed to facilitate the classification and analysis of TEM β-lactamases. The current version contains 474 TEM variants. Two hundred fifty-nine variants form a large scale-free network of highly connected point mutants. The network was divided into three subnetworks which were enriched by single phenotypes: one network with predominantly 2be and two networks with 2br phenotypes. Fifteen positions were found to be highly variable, contributing to the majority of the observed variants. Since it is expected that a considerable fraction of the theoretical sequence space is functional, the currently sequenced 474 variants represent only the tip of the iceberg of functional TEM β-lactamase variants which form a huge natural reservoir of highly interconnected variants. Almost 50% of the variants are part of a quartet. Thus, two single mutations that result in functional enzymes can be combined into a functional protein. Most of these quartets consist of the same phenotype, or the mutations are additive with respect to the phenotype. By predicting quartets from triplets, 3,916 unknown variants were constructed. Eighty-seven variants complement multiple quartets and therefore have a high probability of being functional. The construction of a TEM β-lactamase network and subsequent analyses by clustering and quartet prediction are valuable tools to gain new insights into the viable sequence space of TEM β-lactamases and to predict their phenotype. The highly connected sequence space of TEM β-lactamases is ideally suited to network analysis and demonstrates the strengths of network analysis over tree reconstruction methods. Copyright © 2016, American Society for Microbiology. All Rights Reserved.
Piao, Hailan; Froula, Jeff; Du, Changbin; Kim, Tae-Wan; Hawley, Erik R; Bauer, Stefan; Wang, Zhong; Ivanova, Nathalia; Clark, Douglas S; Klenk, Hans-Peter; Hess, Matthias
2014-08-01
Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this "genomic dark matter." Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene context-based approach can be used to assign function to genes that are otherwise categorized as "genomic dark matter" and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications. © 2014 Wiley Periodicals, Inc.
Using distances between Top-n-gram and residue pairs for protein remote homology detection.
Liu, Bin; Xu, Jinghao; Zou, Quan; Xu, Ruifeng; Wang, Xiaolong; Chen, Qingcai
2014-01-01
Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods. Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families. The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp.
Rapid and Programmable Protein Mutagenesis Using Plasmid Recombineering.
Higgins, Sean A; Ouonkap, Sorel V Y; Savage, David F
2017-10-20
Comprehensive and programmable protein mutagenesis is critical for understanding structure-function relationships and improving protein function. There is thus a need for robust and unbiased molecular biological approaches for the construction of the requisite comprehensive protein libraries. Here we demonstrate that plasmid recombineering is a simple and robust in vivo method for the generation of protein mutants for both comprehensive library generation as well as programmable targeting of sequence space. Using the fluorescent protein iLOV as a model target, we build a complete mutagenesis library and find it to be specific and comprehensive, detecting 99.8% of our intended mutations. We then develop a thermostability screen and utilize our comprehensive mutation data to rapidly construct a targeted and multiplexed library that identifies significantly improved variants, thus demonstrating rapid protein engineering in a simple protocol.
Distribution of genotype network sizes in sequence-to-structure genotype-phenotype maps.
Manrubia, Susanna; Cuesta, José A
2017-04-01
An essential quantity to ensure evolvability of populations is the navigability of the genotype space. Navigability, understood as the ease with which alternative phenotypes are reached, relies on the existence of sufficiently large and mutually attainable genotype networks. The size of genotype networks (e.g. the number of RNA sequences folding into a particular secondary structure or the number of DNA sequences coding for the same protein structure) is astronomically large in all functional molecules investigated: an exhaustive experimental or computational study of all RNA folds or all protein structures becomes impossible even for moderately long sequences. Here, we analytically derive the distribution of genotype network sizes for a hierarchy of models which successively incorporate features of increasingly realistic sequence-to-structure genotype-phenotype maps. The main feature of these models relies on the characterization of each phenotype through a prototypical sequence whose sites admit a variable fraction of letters of the alphabet. Our models interpolate between two limit distributions: a power-law distribution, when the ordering of sites in the prototypical sequence is strongly constrained, and a lognormal distribution, as suggested for RNA, when different orderings of the same set of sites yield different phenotypes. Our main result is the qualitative and quantitative identification of those features of sequence-to-structure maps that lead to different distributions of genotype network sizes. © 2017 The Author(s).
A novel and efficient technique for identification and classification of GPCRs.
Gupta, Ravi; Mittal, Ankush; Singh, Kuldip
2008-07-01
G-protein coupled receptors (GPCRs) play a vital role in different biological processes, such as regulation of growth, death, and metabolism of cells. GPCRs are the focus of significant amount of current pharmaceutical research since they interact with more than 50% of prescription drugs. The dipeptide-based support vector machine (SVM) approach is the most accurate technique to identify and classify the GPCRs. However, this approach has two major disadvantages. First, the dimension of dipeptide-based feature vector is equal to 400. The large dimension makes the classification task computationally and memory wise inefficient. Second, it does not consider the biological properties of protein sequence for identification and classification of GPCRs. In this paper, we present a novel-feature-based SVM classification technique. The novel features are derived by applying wavelet-based time series analysis approach on protein sequences. The proposed feature space summarizes the variance information of seven important biological properties of amino acids in a protein sequence. In addition, the dimension of the feature vector for proposed technique is equal to 35. Experiments were performed on GPCRs protein sequences available at GPCRs Database. Our approach achieves an accuracy of 99.9%, 98.06%, 97.78%, and 94.08% for GPCR superfamily, families, subfamilies, and subsubfamilies (amine group), respectively, when evaluated using fivefold cross-validation. Further, an accuracy of 99.8%, 97.26%, and 97.84% was obtained when evaluated on unseen or recall datasets of GPCR superfamily, families, and subfamilies, respectively. Comparison with dipeptide-based SVM technique shows the effectiveness of our approach.
A novel, extremely alkaliphilic and cold-active esterase from Antarctic desert soil.
Hu, Xiao Ping; Heath, Caroline; Taylor, Mark Paul; Tuffin, Marla; Cowan, Don
2012-01-01
A novel, cold-active and highly alkaliphilic esterase was isolated from an Antarctic desert soil metagenomic library by functional screening. The 1,044 bp gene sequence contained several conserved regions common to lipases/esterases, but lacked clear classification based on sequence analysis alone. Moderate (<40%) amino acid sequence similarity to known esterases was apparent (the closest neighbour being a hypothetical protein from Chitinophaga pinensis), despite phylogenetic distance to many of the lipolytic "families". The enzyme functionally demonstrated activity towards shorter chain p-nitrophenyl esters with the optimal activity recorded towards p-nitrophenyl propionate (C3). The enzyme possessed an apparent T(opt) at 20°C and a pH optimum at pH 11. Esterases possessing such extreme alkaliphily are rare and so this enzyme represents an intriguing novel locus in protein sequence space. A metagenomic approach has been shown, in this case, to yield an enzyme with quite different sequential/structural properties to known lipases. It serves as an excellent candidate for analysis of the molecular mechanisms responsible for both cold and alkaline activity and novel structure-function relationships of esterase activity.
Free Energy Landscape - Settlements of Key Residues.
NASA Astrophysics Data System (ADS)
Aroutiounian, Svetlana
2007-03-01
FEL perspective in studies of protein folding transitions reflects notion that since there are ˜10^N conformations to scan in search of lowest free energy state, random search is beyond biological timescale. Protein folding must follow certain fel pathways and folding kinetics of evolutionary selected proteins dominates kinetic traps. Good model for functional robustness of natural proteins - coarse-grained model protein is not very accurate but affords bringing simulations closer to biological realm; Go-like potential secures the fel funnel shape; biochemical contacts signify the funnel bottleneck. Boltzmann-weighted ensemble of protein conformations and histogram method are used to obtain from MC sampling of protein conformational space the approximate probability distribution. The fel is F(rmsd) = -1/βLn[Hist(rmsd)], β=kBT and rmsd is root-mean-square-deviation from native conformation. The sperm whale myoglobin has rich dynamic behavior, is small and large - on computational scale, has a symmetry in architecture and unusual sextet of residue pairs. Main idea: there is a mathematical relation between protein fel and a key residues set providing stability to folding transition. Is the set evolutionary conserved also for functional reasons? Hypothesis: primary sequence determines the key residues positions conserved as stabilizers and the fel is the battlefield for the folding stability. Preliminary results: primary sequence - not the architecture, is the rule settler, indeed.
Beintema, J J; Peumans, W J
1992-03-09
The primary structure of stinging nettle (Urtica dioica) agglutinin has been determined by sequence analysis of peptides obtained from three overlapping proteolytic digests. The sequence of 80 residues consists of two hevein-like domains with the same spacing of half-cystine residues and several other conserved residues as observed earlier in other proteins with hevein-like domains. The hinge region between the two domains is four residues longer than those between the four domains in cereal lectins like wheat germ agglutinin.
Structural alphabets derived from attractors in conformational space
2010-01-01
Background The hierarchical and partially redundant nature of protein structures justifies the definition of frequently occurring conformations of short fragments as 'states'. Collections of selected representatives for these states define Structural Alphabets, describing the most typical local conformations within protein structures. These alphabets form a bridge between the string-oriented methods of sequence analysis and the coordinate-oriented methods of protein structure analysis. Results A Structural Alphabet has been derived by clustering all four-residue fragments of a high-resolution subset of the protein data bank and extracting the high-density states as representative conformational states. Each fragment is uniquely defined by a set of three independent angles corresponding to its degrees of freedom, capturing in simple and intuitive terms the properties of the conformational space. The fragments of the Structural Alphabet are equivalent to the conformational attractors and therefore yield a most informative encoding of proteins. Proteins can be reconstructed within the experimental uncertainty in structure determination and ensembles of structures can be encoded with accuracy and robustness. Conclusions The density-based Structural Alphabet provides a novel tool to describe local conformations and it is specifically suitable for application in studies of protein dynamics. PMID:20170534
Characterization of protein-folding pathways by reduced-space modeling.
Kmiecik, Sebastian; Kolinski, Andrzej
2007-07-24
Ab initio simulations of the folding pathways are currently limited to very small proteins. For larger proteins, some approximations or simplifications in protein models need to be introduced. Protein folding and unfolding are among the basic processes in the cell and are very difficult to characterize in detail by experiment or simulation. Chymotrypsin inhibitor 2 (CI2) and barnase are probably the best characterized experimentally in this respect. For these model systems, initial folding stages were simulated by using CA-CB-side chain (CABS), a reduced-space protein-modeling tool. CABS employs knowledge-based potentials that proved to be very successful in protein structure prediction. With the use of isothermal Monte Carlo (MC) dynamics, initiation sites with a residual structure and weak tertiary interactions were identified. Such structures are essential for the initiation of the folding process through a sequential reduction of the protein conformational space, overcoming the Levinthal paradox in this manner. Furthermore, nucleation sites that initiate a tertiary interactions network were located. The MC simulations correspond perfectly to the results of experimental and theoretical research and bring insights into CI2 folding mechanism: unambiguous sequence of folding events was reported as well as cooperative substructures compatible with those obtained in recent molecular dynamics unfolding studies. The correspondence between the simulation and experiment shows that knowledge-based potentials are not only useful in protein structure predictions but are also capable of reproducing the folding pathways. Thus, the results of this work significantly extend the applicability range of reduced models in the theoretical study of proteins.
Zhang, Lina; Zhang, Chengjin; Gao, Rui; Yang, Runtao
2015-09-09
Bacteriophage virion proteins and non-virion proteins have distinct functions in biological processes, such as specificity determination for host bacteria, bacteriophage replication and transcription. Accurate identification of bacteriophage virion proteins from bacteriophage protein sequences is significant to understand the complex virulence mechanism in host bacteria and the influence of bacteriophages on the development of antibacterial drugs. In this study, an ensemble method for bacteriophage virion protein prediction from bacteriophage protein sequences is put forward with hybrid feature spaces incorporating CTD (composition, transition and distribution), bi-profile Bayes, PseAAC (pseudo-amino acid composition) and PSSM (position-specific scoring matrix). When performing on the training dataset 10-fold cross-validation, the presented method achieves a satisfactory prediction result with a sensitivity of 0.870, a specificity of 0.830, an accuracy of 0.850 and Matthew's correlation coefficient (MCC) of 0.701, respectively. To evaluate the prediction performance objectively, an independent testing dataset is used to evaluate the proposed method. Encouragingly, our proposed method performs better than previous studies with a sensitivity of 0.853, a specificity of 0.815, an accuracy of 0.831 and MCC of 0.662 on the independent testing dataset. These results suggest that the proposed method can be a potential candidate for bacteriophage virion protein prediction, which may provide a useful tool to find novel antibacterial drugs and to understand the relationship between bacteriophage and host bacteria. For the convenience of the vast majority of experimental Int. J. Mol. Sci. 2015, 16,21735 scientists, a user-friendly and publicly-accessible web-server for the proposed ensemble method is established.
An alternative view of protein fold space.
Shindyalov, I N; Bourne, P E
2000-02-15
Comparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures. Substructures are long (50-150 residue) highly repetitive near-contiguous pieces of polypeptide chain that occur frequently in a set of proteins from the PDB defined as structurally non-redundant over the complete polypeptide chain. The substructure classification is based on a previously reported Combinatorial Extension (CE) algorithm that provides a significantly different set of structure alignments than those previously described, having, for example, only a 40% overlap with FSSP. Qualitatively the algorithm provides longer contiguous aligned segments at the price of a slightly higher root-mean-square deviation (rmsd). Clustering these alignments gives a discreet and highly repetitive set of substructures not detectable by sequence similarity alone. In some cases different substructures represent all or different parts of well known folds indicative of the Russian doll effect--the continuity of protein fold space. In other cases they fall into different structure and functional classifications. It is too early to determine whether these newly classified substructures represent new insights into the evolution of a structural framework important to many proteins. What is apparent from on-going work is that these substructures have the potential to be useful probes in finding remote sequence homology and in structure prediction studies. The characteristics of the complete all-by-all comparison of the polypeptide chains present in the PDB and details of the filtering procedure by pair-wise structure alignment that led to the emergent substructure gallery are discussed. Substructure classification, alignments, and tools to analyze them are available at http://cl.sdsc.edu/ce.html.
Towse, Clare-Louise; Akke, Mikael; Daggett, Valerie
2017-04-27
Molecular dynamics (MD) simulations contain considerable information with regard to the motions and fluctuations of a protein, the magnitude of which can be used to estimate conformational entropy. Here we survey conformational entropy across protein fold space using the Dynameomics database, which represents the largest existing data set of protein MD simulations for representatives of essentially all known protein folds. We provide an overview of MD-derived entropies accounting for all possible degrees of dihedral freedom on an unprecedented scale. Although different side chains might be expected to impose varying restrictions on the conformational space that the backbone can sample, we found that the backbone entropy and side chain size are not strictly coupled. An outcome of these analyses is the Dynameomics Entropy Dictionary, the contents of which have been compared with entropies derived by other theoretical approaches and experiment. As might be expected, the conformational entropies scale linearly with the number of residues, demonstrating that conformational entropy is an extensive property of proteins. The calculated conformational entropies of folding agree well with previous estimates. Detailed analysis of specific cases identifies deviations in conformational entropy from the average values that highlight how conformational entropy varies with sequence, secondary structure, and tertiary fold. Notably, α-helices have lower entropy on average than do β-sheets, and both are lower than coil regions.
Cladaras, C; Hadzopoulou-Cladaras, M; Nolte, R T; Atkinson, D; Zannis, V I
1986-01-01
We have isolated and sequenced overlapping cDNA clones covering the entire sequence of human apolipoprotein B-100 (apoB-100). DNA sequence analysis and determination of the mRNA transcription initiation site by S1 nuclease mapping showed that the apoB mRNA consists of 14,112 nucleotides including the 5' and 3' untranslated regions which are 128 and 301 nucleotides respectively. The DNA-derived protein sequence shows that apoB-100 is 513,000 daltons and contains 4560 amino acids including a 24-amino-acid-long signal peptide. The mol. wt of apoB-100 implies that there is one apoB molecule per LDL particle. Computer analysis of the predicted secondary structure of the protein showed that some of the potential alpha helical and beta sheet structures are amphipathic, whereas others have non-amphipathic neutral to apolar character. These latter regions may contribute to the formation of the lipid-binding domains of apoB-100. The protein contains 25 cysteines and 20 potential N-glycosylation sites. The majority of cysteines are distributed in the amino terminal portion of the protein. Four of the potential glycosylation sites are in predicted beta turn structures and may represent true glycosylation positions. ApoB lacks the tandem repeats which are characteristic of other apolipoproteins. The mean hydrophobicity the mean value of H1 and helical hydrophobic moment the mean value of microH profiles of apoB showed the presence of several potential helical regions with strong polar character and high hydrophobic moment. The region with the highest hydrophobic moment, between amino acid residues 3352 and 3369, contains five closely spaced, positively charged residues, and has sequence homology to the LDL receptor binding site of apoE. This region is flanked by three neighbouring regions with positively charged amino acids and high hydrophobic moment that are located between residues 3174 and 3681. One or more of these closely spaced apoB sequences may be involved in the formation of the LDL receptor-binding domain of apoB-100. Blotting analysis of intestinal RNA and hybridization of the blots with carboxy apoB cDNA probes produced a single 15-kb hybridization band whereas hybridization with amino terminal probes produced two hybridization bands of 15 and 8 kb. Our data indicate that both forms of apoB mRNA contain common sequences which extend from the amino terminal of apoB-100 to the vicinity of nucleotide residue 6300. These two messages may have resulted from differential splicing of the same primary apoB mRNA transcript. Images Fig. 4. Fig. 6. PMID:3030729
A Coarse-Grained Protein Model in a Water-like Solvent
NASA Astrophysics Data System (ADS)
Sharma, Sumit; Kumar, Sanat K.; Buldyrev, Sergey V.; Debenedetti, Pablo G.; Rossky, Peter J.; Stanley, H. Eugene
2013-05-01
Simulations employing an explicit atom description of proteins in solvent can be computationally expensive. On the other hand, coarse-grained protein models in implicit solvent miss essential features of the hydrophobic effect, especially its temperature dependence, and have limited ability to capture the kinetics of protein folding. We propose a free space two-letter protein (``H-P'') model in a simple, but qualitatively accurate description for water, the Jagla model, which coarse-grains water into an isotropically interacting sphere. Using Monte Carlo simulations, we design protein-like sequences that can undergo a collapse, exposing the ``Jagla-philic'' monomers to the solvent, while maintaining a ``hydrophobic'' core. This protein-like model manifests heat and cold denaturation in a manner that is reminiscent of proteins. While this protein-like model lacks the details that would introduce secondary structure formation, we believe that these ideas represent a first step in developing a useful, but computationally expedient, means of modeling proteins.
Shao, Qiang
2014-06-05
A comparative study on the folding of multiple three-α-helix bundle proteins including α3D, α3W, and the B domain of protein A (BdpA) is presented. The use of integrated-tempering-sampling molecular dynamics simulations achieves reversible folding and unfolding events in individual short trajectories, which thus provides an efficient approach to sufficiently sample the configuration space of protein and delineate the folding pathway of α-helix bundle. The detailed free energy landscape analyses indicate that the folding mechanism of α-helix bundle is not uniform but sequence dependent. A simple model is then proposed to predict folding mechanism of α-helix bundle on the basis of amino acid composition: α-helical proteins containing higher percentage of hydrophobic residues than charged ones fold via nucleation-condensation mechanism (e.g., α3D and BdpA) whereas proteins having opposite tendency in amino acid composition more likely fold via the framework mechanism (e.g., α3W). The model is tested on various α-helix bundle proteins, and the predicted mechanism is similar to the most approved one for each protein. In addition, the common features in the folding pathway of α-helix bundle protein are also deduced. In summary, the present study provides comprehensive, atomic-level picture of the folding of α-helix bundle proteins.
SLLE for predicting membrane protein types.
Wang, Meng; Yang, Jie; Xu, Zhi-Jie; Chou, Kuo-Chen
2005-01-07
Introduction of the concept of pseudo amino acid composition (PROTEINS: Structure, Function, and Genetics 43 (2001) 246; Erratum: ibid. 44 (2001) 60) has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, and hence can significantly enhance the prediction quality of membrane protein type. As a continuous effort along such a line, the Supervised Locally Linear Embedding (SLLE) technique for nonlinear dimensionality reduction is introduced (Science 22 (2000) 2323). The advantage of using SLLE is that it can reduce the operational space by extracting the essential features from the high-dimensional pseudo amino acid composition space, and that the cluster-tolerant capacity can be increased accordingly. As a consequence by combining these two approaches, high success rates have been observed during the tests of self-consistency, jackknife and independent data set, respectively, by using the simplest nearest neighbour classifier. The current approach represents a new strategy to deal with the problems of protein attribute prediction, and hence may become a useful vehicle in the area of bioinformatics and proteomics.
Zbilut, Joseph P.; Colosimo, Alfredo; Conti, Filippo; Colafranceschi, Mauro; Manetti, Cesare; Valerio, MariaCristina; Webber, Charles L.; Giuliani, Alessandro
2003-01-01
The problem of protein folding vs. aggregation was investigated in acylphosphatase and the amyloid protein Aβ(1–40) by means of nonlinear signal analysis of their chain hydrophobicity. Numerical descriptors of recurrence patterns provided the basis for statistical evaluation of folding/aggregation distinctive features. Static and dynamic approaches were used to elucidate conditions coincident with folding vs. aggregation using comparisons with known protein secondary structure classifications, site-directed mutagenesis studies of acylphosphatase, and molecular dynamics simulations of amyloid protein, Aβ(1–40). The results suggest that a feature derived from principal component space characterized by the smoothness of singular, deterministic hydrophobicity patches plays a significant role in the conditions governing protein aggregation. PMID:14645049
A generalized global alignment algorithm.
Huang, Xiaoqiu; Chao, Kun-Mao
2003-01-22
Homologous sequences are sometimes similar over some regions but different over other regions. Homologous sequences have a much lower global similarity if the different regions are much longer than the similar regions. We present a generalized global alignment algorithm for comparing sequences with intermittent similarities, an ordered list of similar regions separated by different regions. A generalized global alignment model is defined to handle sequences with intermittent similarities. A dynamic programming algorithm is designed to compute an optimal general alignment in time proportional to the product of sequence lengths and in space proportional to the sum of sequence lengths. The algorithm is implemented as a computer program named GAP3 (Global Alignment Program Version 3). The generalized global alignment model is validated by experimental results produced with GAP3 on both DNA and protein sequences. The GAP3 program extends the ability of standard global alignment programs to recognize homologous sequences of lower similarity. The GAP3 program is freely available for academic use at http://bioinformatics.iastate.edu/aat/align/align.html.
Yu, Dongjun; Wu, Xiaowei; Shen, Hongbin; Yang, Jian; Tang, Zhenmin; Qi, Yong; Yang, Jingyu
2012-12-01
Membrane proteins are encoded by ~ 30% in the genome and function importantly in the living organisms. Previous studies have revealed that membrane proteins' structures and functions show obvious cell organelle-specific properties. Hence, it is highly desired to predict membrane protein's subcellular location from the primary sequence considering the extreme difficulties of membrane protein wet-lab studies. Although many models have been developed for predicting protein subcellular locations, only a few are specific to membrane proteins. Existing prediction approaches were constructed based on statistical machine learning algorithms with serial combination of multi-view features, i.e., different feature vectors are simply serially combined to form a super feature vector. However, such simple combination of features will simultaneously increase the information redundancy that could, in turn, deteriorate the final prediction accuracy. That's why it was often found that prediction success rates in the serial super space were even lower than those in a single-view space. The purpose of this paper is investigation of a proper method for fusing multiple multi-view protein sequential features for subcellular location predictions. Instead of serial strategy, we propose a novel parallel framework for fusing multiple membrane protein multi-view attributes that will represent protein samples in complex spaces. We also proposed generalized principle component analysis (GPCA) for feature reduction purpose in the complex geometry. All the experimental results through different machine learning algorithms on benchmark membrane protein subcellular localization datasets demonstrate that the newly proposed parallel strategy outperforms the traditional serial approach. We also demonstrate the efficacy of the parallel strategy on a soluble protein subcellular localization dataset indicating the parallel technique is flexible to suite for other computational biology problems. The software and datasets are available at: http://www.csbio.sjtu.edu.cn/bioinf/mpsp.
MultiSeq: unifying sequence and structure data for evolutionary analysis
Roberts, Elijah; Eargle, John; Wright, Dan; Luthey-Schulten, Zaida
2006-01-01
Background Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Results Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary biology, the environment is also flexible enough to accommodate other usage patterns. The evolutionary approach is supported by the use of predefined metadata, adherence to standard ontological mappings, and the ability for the user to adjust these classifications using an electronic notebook. MultiSeq contains a new algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of a homologous group of distantly related proteins. The method, based on the multidimensional QR factorization of multiple sequence and structure alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. Conclusion MultiSeq is a major extension of the Multiple Alignment tool that is provided as part of VMD, a structural visualization program for analyzing molecular dynamics simulations. Both are freely distributed by the NIH Resource for Macromolecular Modeling and Bioinformatics and MultiSeq is included with VMD starting with version 1.8.5. The MultiSeq website has details on how to download and use the software: PMID:16914055
Cloning of the Escherichia coli endo-1,4-D-glucanase gene and identification of its product.
Park, Y W; Yun, H D
1999-03-01
A plasmid (pYP17) containing a genomic DNA insert from Escherichia coli K-12 that confers the ability to hydrolyze carboxymethylcellulose (CMC) was isolated from a genomic library constructed in the cosmid vector pLAFR3 in E. coli DH5alpha. A small 1.65-kb fragment, designated bcsC (pYP300), was sequenced and found to contain an ORF of 1,104 bp encoding a protein of 368 amino acid residues, with a calculated molecular weight of 41,700 Da. BcsC carries a typical prokaryotic signal peptide of 21 amino acid residues. The predicted amino acid sequence of the BcsC protein is similar to that of CelY of Erwinia chrysanthemi, CMCase of Cellulomonas uda, EngX of Acetobacter xylinum, and CelC of Agrobacterium tumefaciens. Based on these sequence similarities, we propose that the bcsC gene is a member of glycosyl hydrolase family 8. The apparent molecular mass of the protein, when expressed in E. coli, is approximately 40 kDa, and the CMCase activity is found mainly in the extracellular space. The enzyme is optimally active at pH 7 and a temperature of 40 degrees C.
Vasquez, Kevin A; Hatridge, Taylor A; Curtis, Nicholas C; Contreras, Lydia M
2016-02-19
Recent studies have demonstrated that effective protein production requires coordination of multiple cotranslational cellular processes, which are heavily affected by translation timing. Until recently, protein engineering has focused on codon optimization to maximize protein production rates, mostly considering the effect of tRNA abundance. However, as it relates to complex multidomain proteins, it has been hypothesized that strategic translational pauses between domains and between distinct individual structural motifs can prevent interactions between nascent chain fragments that generate kinetically trapped misfolded peptides and thereby enhance protein yields. In this study, we introduce synthetic transient pauses between structural domains in a heterologous model protein based on designed patterns of affinity between the mRNA and the anti-Shine-Dalgarno (aSD) sequence on the ribosome. We demonstrate that optimizing translation attenuation at domain boundaries can predictably affect solubility patterns in bacteria. Exploration of the affinity space showed that modifying less than 1% of the nucleotides (on a small 12 amino acid linker) can vary soluble protein yields up to ∼7-fold without altering the primary sequence of the protein. In the context of longer linkers, where a larger number of distinct structural motifs can fold outside the ribosome, optimal synonymous codon variations resulted in an additional 2.1-fold increase in solubility, relative to that of nonoptimized linkers of the same length. While rational construction of 54 linkers of various affinities showed a significant correlation between protein solubility and predicted affinity, only weaker correlations were observed between tRNA abundance and protein solubility. We also demonstrate that naturally occurring high-affinity clusters are present between structural domains of β-galactosidase, one of Escherichia coli's largest native proteins. Interdomain ribosomal affinity is an important factor that has not previously been explored in the context of protein engineering.
Suckau, Detlev; Resemann, Anja
2009-12-01
The ability to match Top-Down protein sequencing (TDS) results by MALDI-TOF to protein sequences by classical protein database searching was evaluated in this work. Resulting from these analyses were the protein identity, the simultaneous assignment of the N- and C-termini and protein sequences of up to 70 residues from either terminus. In combination with de novo sequencing using the MALDI-TDS data, even fusion proteins were assigned and the detailed sequence around the fusion site was elucidated. MALDI-TDS allowed to efficiently match protein sequences quickly and to validate recombinant protein structures-in particular, protein termini-on the level of undigested proteins.
Majid, Abdul; Ali, Safdar
2015-01-01
We developed genetic programming (GP)-based evolutionary ensemble system for the early diagnosis, prognosis and prediction of human breast cancer. This system has effectively exploited the diversity in feature and decision spaces. First, individual learners are trained in different feature spaces using physicochemical properties of protein amino acids. Their predictions are then stacked to develop the best solution during GP evolution process. Finally, results for HBC-Evo system are obtained with optimal threshold, which is computed using particle swarm optimization. Our novel approach has demonstrated promising results compared to state of the art approaches.
A generative, probabilistic model of local protein structure.
Boomsma, Wouter; Mardia, Kanti V; Taylor, Charles C; Ferkinghoff-Borg, Jesper; Krogh, Anders; Hamelryck, Thomas
2008-07-01
Despite significant progress in recent years, protein structure prediction maintains its status as one of the prime unsolved problems in computational biology. One of the key remaining challenges is an efficient probabilistic exploration of the structural space that correctly reflects the relative conformational stabilities. Here, we present a fully probabilistic, continuous model of local protein structure in atomic detail. The generative model makes efficient conformational sampling possible and provides a framework for the rigorous analysis of local sequence-structure correlations in the native state. Our method represents a significant theoretical and practical improvement over the widely used fragment assembly technique by avoiding the drawbacks associated with a discrete and nonprobabilistic approach.
Generating intrinsically disordered protein conformational ensembles from a Markov chain
NASA Astrophysics Data System (ADS)
Cukier, Robert I.
2018-03-01
Intrinsically disordered proteins (IDPs) sample a diverse conformational space. They are important to signaling and regulatory pathways in cells. An entropy penalty must be payed when an IDP becomes ordered upon interaction with another protein or a ligand. Thus, the degree of conformational disorder of an IDP is of interest. We create a dichotomic Markov model that can explore entropic features of an IDP. The Markov condition introduces local (neighbor residues in a protein sequence) rotamer dependences that arise from van der Waals and other chemical constraints. A protein sequence of length N is characterized by its (information) entropy and mutual information, MIMC, the latter providing a measure of the dependence among the random variables describing the rotamer probabilities of the residues that comprise the sequence. For a Markov chain, the MIMC is proportional to the pair mutual information MI which depends on the singlet and pair probabilities of neighbor residue rotamer sampling. All 2N sequence states are generated, along with their probabilities, and contrasted with the probabilities under the assumption of independent residues. An efficient method to generate realizations of the chain is also provided. The chain entropy, MIMC, and state probabilities provide the ingredients to distinguish different scenarios using the terminologies: MoRF (molecular recognition feature), not-MoRF, and not-IDP. A MoRF corresponds to large entropy and large MIMC (strong dependence among the residues' rotamer sampling), a not-MoRF corresponds to large entropy but small MIMC, and not-IDP corresponds to low entropy irrespective of the MIMC. We show that MorFs are most appropriate as descriptors of IDPs. They provide a reasonable number of high-population states that reflect the dependences between neighbor residues, thus classifying them as IDPs, yet without very large entropy that might lead to a too high entropy penalty.
Thermodynamic prediction of protein neutrality.
Bloom, Jesse D; Silberg, Jonathan J; Wilke, Claus O; Drummond, D Allan; Adami, Christoph; Arnold, Frances H
2005-01-18
We present a simple theory that uses thermodynamic parameters to predict the probability that a protein retains the wild-type structure after one or more random amino acid substitutions. Our theory predicts that for large numbers of substitutions the probability that a protein retains its structure will decline exponentially with the number of substitutions, with the severity of this decline determined by properties of the structure. Our theory also predicts that a protein can gain extra robustness to the first few substitutions by increasing its thermodynamic stability. We validate our theory with simulations on lattice protein models and by showing that it quantitatively predicts previously published experimental measurements on subtilisin and our own measurements on variants of TEM1 beta-lactamase. Our work unifies observations about the clustering of functional proteins in sequence space, and provides a basis for interpreting the response of proteins to substitutions in protein engineering applications.
Thermodynamic prediction of protein neutrality
Bloom, Jesse D.; Silberg, Jonathan J.; Wilke, Claus O.; Drummond, D. Allan; Adami, Christoph; Arnold, Frances H.
2005-01-01
We present a simple theory that uses thermodynamic parameters to predict the probability that a protein retains the wild-type structure after one or more random amino acid substitutions. Our theory predicts that for large numbers of substitutions the probability that a protein retains its structure will decline exponentially with the number of substitutions, with the severity of this decline determined by properties of the structure. Our theory also predicts that a protein can gain extra robustness to the first few substitutions by increasing its thermodynamic stability. We validate our theory with simulations on lattice protein models and by showing that it quantitatively predicts previously published experimental measurements on subtilisin and our own measurements on variants of TEM1 β-lactamase. Our work unifies observations about the clustering of functional proteins in sequence space, and provides a basis for interpreting the response of proteins to substitutions in protein engineering applications. PMID:15644440
Nesprin 4 is an outer nuclear membrane protein that can induce kinesin-mediated cell polarization
Roux, Kyle J.; Crisp, Melissa L.; Liu, Qian; Kim, Daein; Kozlov, Serguei; Stewart, Colin L.; Burke, Brian
2009-01-01
Nucleocytoplasmic coupling is mediated by outer nuclear membrane (ONM) nesprin proteins and inner nuclear membrane Sun proteins. Interactions spanning the perinuclear space create nesprin–Sun complexes connecting the cytoskeleton to nuclear components. A search for proteins displaying a conserved C-terminal sequence present in nesprins 1–3 identified nesprin 4 (Nesp4), a new member of this family. Nesp4 is a kinesin-1-binding protein that displays Sun-dependent localization to the ONM. Expression of Nesp4 is associated with dramatic changes in cellular organization involving relocation of the centrosome and Golgi apparatus relative to the nucleus. These effects can be accounted for entirely by Nesp4's kinesin-binding function. The implication is that Nesp4 may contribute to microtubule-dependent nuclear positioning. PMID:19164528
Nesprin 4 is an outer nuclear membrane protein that can induce kinesin-mediated cell polarization.
Roux, Kyle J; Crisp, Melissa L; Liu, Qian; Kim, Daein; Kozlov, Serguei; Stewart, Colin L; Burke, Brian
2009-02-17
Nucleocytoplasmic coupling is mediated by outer nuclear membrane (ONM) nesprin proteins and inner nuclear membrane Sun proteins. Interactions spanning the perinuclear space create nesprin-Sun complexes connecting the cytoskeleton to nuclear components. A search for proteins displaying a conserved C-terminal sequence present in nesprins 1-3 identified nesprin 4 (Nesp4), a new member of this family. Nesp4 is a kinesin-1-binding protein that displays Sun-dependent localization to the ONM. Expression of Nesp4 is associated with dramatic changes in cellular organization involving relocation of the centrosome and Golgi apparatus relative to the nucleus. These effects can be accounted for entirely by Nesp4's kinesin-binding function. The implication is that Nesp4 may contribute to microtubule-dependent nuclear positioning.
Engineering Signal Peptides for Enhanced Protein Secretion from Lactococcus lactis
Ng, Daphne T. W.
2013-01-01
Lactococcus lactis is an attractive vehicle for biotechnological production of proteins and clinical delivery of therapeutics. In many such applications using this host, it is desirable to maximize secretion of recombinant proteins into the extracellular space, which is typically achieved by using the native signal peptide from a major secreted lactococcal protein, Usp45. In order to further increase protein secretion from L. lactis, inherent limitations of the Usp45 signal peptide (Usp45sp) must be elucidated. Here, we performed extensive mutagenesis on Usp45sp to probe the effects of both the mRNA sequence (silent mutations) and the peptide sequence (amino acid substitutions) on secretion. We screened signal peptides based on their resulting secretion levels of Staphylococcus aureus nuclease and further evaluated them for secretion of Bacillus subtilis α-amylase. Silent mutations alone gave an increase of up to 16% in the secretion of α-amylase through a mechanism consistent with relaxed mRNA folding around the ribosome binding site and enhanced translation. Targeted amino acid mutagenesis in Usp45sp, combined with additional silent mutations from the best clone in the initial screen, yielded an increase of up to 51% in maximum secretion of α-amylase while maintaining secretion at lower induction levels. The best sequence from our screen preserves the tripartite structure of the native signal peptide but increases the positive charge of the n-region. Our study presents the first example of an engineered L. lactis signal peptide with a higher secretion yield than Usp45sp and, more generally, provides strategies for further enhancing protein secretion in bacterial hosts. PMID:23124224
Engineering signal peptides for enhanced protein secretion from Lactococcus lactis.
Ng, Daphne T W; Sarkar, Casim A
2013-01-01
Lactococcus lactis is an attractive vehicle for biotechnological production of proteins and clinical delivery of therapeutics. In many such applications using this host, it is desirable to maximize secretion of recombinant proteins into the extracellular space, which is typically achieved by using the native signal peptide from a major secreted lactococcal protein, Usp45. In order to further increase protein secretion from L. lactis, inherent limitations of the Usp45 signal peptide (Usp45sp) must be elucidated. Here, we performed extensive mutagenesis on Usp45sp to probe the effects of both the mRNA sequence (silent mutations) and the peptide sequence (amino acid substitutions) on secretion. We screened signal peptides based on their resulting secretion levels of Staphylococcus aureus nuclease and further evaluated them for secretion of Bacillus subtilis α-amylase. Silent mutations alone gave an increase of up to 16% in the secretion of α-amylase through a mechanism consistent with relaxed mRNA folding around the ribosome binding site and enhanced translation. Targeted amino acid mutagenesis in Usp45sp, combined with additional silent mutations from the best clone in the initial screen, yielded an increase of up to 51% in maximum secretion of α-amylase while maintaining secretion at lower induction levels. The best sequence from our screen preserves the tripartite structure of the native signal peptide but increases the positive charge of the n-region. Our study presents the first example of an engineered L. lactis signal peptide with a higher secretion yield than Usp45sp and, more generally, provides strategies for further enhancing protein secretion in bacterial hosts.
In Silico Pattern-Based Analysis of the Human Cytomegalovirus Genome
Rigoutsos, Isidore; Novotny, Jiri; Huynh, Tien; Chin-Bow, Stephen T.; Parida, Laxmi; Platt, Daniel; Coleman, David; Shenk, Thomas
2003-01-01
More than 200 open reading frames (ORFs) from the human cytomegalovirus genome have been reported as potentially coding for proteins. We have used two pattern-based in silico approaches to analyze this set of putative viral genes. With the help of an objective annotation method that is based on the Bio-Dictionary, a comprehensive collection of amino acid patterns that describes the currently known natural sequence space of proteins, we have reannotated all of the previously reported putative genes of the human cytomegalovirus. Also, with the help of MUSCA, a pattern-based multiple sequence alignment algorithm, we have reexamined the original human cytomegalovirus gene family definitions. Our analysis of the genome shows that many of the coded proteins comprise amino acid combinations that are unique to either the human cytomegalovirus or the larger group of herpesviruses. We have confirmed that a surprisingly large portion of the analyzed ORFs encode membrane proteins, and we have discovered a significant number of previously uncharacterized proteins that are predicted to be G-protein-coupled receptor homologues. The analysis also indicates that many of the encoded proteins undergo posttranslational modifications such as hydroxylation, phosphorylation, and glycosylation. ORFs encoding proteins with similar functional behavior appear in neighboring regions of the human cytomegalovirus genome. All of the results of the present study can be found and interactively explored online (http://cbcsrv.watson.ibm.com/virus/). PMID:12634390
Zeldovich, Konstantin B; Chen, Peiqiu; Shakhnovich, Boris E; Shakhnovich, Eugene I
2007-07-01
In this work we develop a microscopic physical model of early evolution where phenotype--organism life expectancy--is directly related to genotype--the stability of its proteins in their native conformations-which can be determined exactly in the model. Simulating the model on a computer, we consistently observe the "Big Bang" scenario whereby exponential population growth ensues as soon as favorable sequence-structure combinations (precursors of stable proteins) are discovered. Upon that, random diversity of the structural space abruptly collapses into a small set of preferred proteins. We observe that protein folds remain stable and abundant in the population at timescales much greater than mutation or organism lifetime, and the distribution of the lifetimes of dominant folds in a population approximately follows a power law. The separation of evolutionary timescales between discovery of new folds and generation of new sequences gives rise to emergence of protein families and superfamilies whose sizes are power-law distributed, closely matching the same distributions for real proteins. On the population level we observe emergence of species--subpopulations that carry similar genomes. Further, we present a simple theory that relates stability of evolving proteins to the sizes of emerging genomes. Together, these results provide a microscopic first-principles picture of how first-gene families developed in the course of early evolution.
In silico pattern-based analysis of the human cytomegalovirus genome.
Rigoutsos, Isidore; Novotny, Jiri; Huynh, Tien; Chin-Bow, Stephen T; Parida, Laxmi; Platt, Daniel; Coleman, David; Shenk, Thomas
2003-04-01
More than 200 open reading frames (ORFs) from the human cytomegalovirus genome have been reported as potentially coding for proteins. We have used two pattern-based in silico approaches to analyze this set of putative viral genes. With the help of an objective annotation method that is based on the Bio-Dictionary, a comprehensive collection of amino acid patterns that describes the currently known natural sequence space of proteins, we have reannotated all of the previously reported putative genes of the human cytomegalovirus. Also, with the help of MUSCA, a pattern-based multiple sequence alignment algorithm, we have reexamined the original human cytomegalovirus gene family definitions. Our analysis of the genome shows that many of the coded proteins comprise amino acid combinations that are unique to either the human cytomegalovirus or the larger group of herpesviruses. We have confirmed that a surprisingly large portion of the analyzed ORFs encode membrane proteins, and we have discovered a significant number of previously uncharacterized proteins that are predicted to be G-protein-coupled receptor homologues. The analysis also indicates that many of the encoded proteins undergo posttranslational modifications such as hydroxylation, phosphorylation, and glycosylation. ORFs encoding proteins with similar functional behavior appear in neighboring regions of the human cytomegalovirus genome. All of the results of the present study can be found and interactively explored online (http://cbcsrv.watson.ibm.com/virus/).
Bastolla, Ugo
2014-01-01
The properties of biomolecules depend both on physics and on the evolutionary process that formed them. These two points of view produce a powerful synergism. Physics sets the stage and the constraints that molecular evolution has to obey, and evolutionary theory helps in rationalizing the physical properties of biomolecules, including protein folding thermodynamics. To complete the parallelism, protein thermodynamics is founded on the statistical mechanics in the space of protein structures, and molecular evolution can be viewed as statistical mechanics in the space of protein sequences. In this review, we will integrate both points of view, applying them to detecting selection on the stability of the folded state of proteins. We will start discussing positive design, which strengthens the stability of the folded against the unfolded state of proteins. Positive design justifies why statistical potentials for protein folding can be obtained from the frequencies of structural motifs. Stability against unfolding is easier to achieve for longer proteins. On the contrary, negative design, which consists in destabilizing frequently formed misfolded conformations, is more difficult to achieve for longer proteins. The folding rate can be enhanced by strengthening short-range native interactions, but this requirement contrasts with negative design, and evolution has to trade-off between them. Finally, selection can accelerate functional movements by favoring low frequency normal modes of the dynamics of the native state that strongly correlate with the functional conformation change. PMID:24970217
NASA Astrophysics Data System (ADS)
Mass, T.; Drake, J.; Haramaty, L.; Zelzion, U.; Bhattacharya, D.; Rosenthal, Y.; Falkowski, P. G.
2012-12-01
Atmospheric CO 2 levels are rising rapidly, resulting in a decrease in both oceanic pH, and the carbonate saturation state (Ω). It has been hypothesized that calcifying marine organisms, including reef-building corals, will be affected by the decline of the carbonate saturation state. However, it is still unclear how corals will respond to these changes, as their skeletal formation is biologically mediated and occurs in isolated space rather than directly from seawater. In corals new skeletal material is precipitated in the subcalicoblastic space between the skeleton and the calicoblastic epithelium which, does not exceed a few nanometers and contains the ''calcifying fluid''. The goal of our project is to understand how these fluids respond to changes in the surrounding seawater and in turn affects the biologically mediated calcification mechanisms at the molecular, cellular and tissue levels. While it is generally thought that an organic matrix, which contain a suite of proteins, lipids and poly-saccharides, take part in calcification process, the specific mechanism by which the mineral is precipitated is unknown. The organic matrix composed of two fractions: the soluble organic matrix (SOM) and the insoluble organic matrix (IOM). It is suggested that the IOM plays a role as structural proteins forming a framework for crystal growth whereas the SOM plays a role in nucleation and crystal growth. To address this question we have investigated both the structural framework proteins (Drake et al abstract submitted to the AGU fall meeting) the role of proteins in nucleation and crystal growth (this work). Here, we established cell cultures and sequenced the 458-megabase genome of the stony coral, Stylophora pistillata, using next-generation sequencing technology. This genome contains 21,678 predicted protein-coding genes. Many of the known protein components of invertebrate skeletal matrices are acidic and/or contain repeated sequences. We searched for genes encoding proteins with the following characteristics: (1) high content (>35%) of acidic amino acids (aspartate and glutamate), (2) at least 150 residues, and (3) a signal peptide. This approach revealed eight coral acidic proteins (CAPs). We confirmed the sequence of four candidates (CAP1-3 and 8). A search for similarity in the UniProt database and published coral's genomes and ESTs reveals high similarity of CAPs 2, 3, and 8 to both invertebrate and vertebrate acidic rich proteins. CAP1, however, has no significant matches in any of the databases. We expressed and purified these four proteins to examine their role in coral bio-mineralization. We show that the pure CAPs bind Ca+2, and furthermore, individual CAPs can precipitate calcium carbonate in vitro in artificial seawater. These results strongly suggest that aragonite precipitation in the calcifying region of corals is promoted by a highly conserved set of acidic proteins. Based purely on thermodynamic grounds, the predicted change in surface ocean pH should not affect the binding ability of these acidic proteins. To the extent these proteins are responsible for calcification in this and other corals, we suggest that corals will continue to calcify at pH changes predicted to occur in this century.
Ye, Wenwu; Wang, Yang; Shen, Danyu; Li, Delong; Pu, Tianhuizi; Jiang, Zide; Zhang, Zhengguang; Zheng, Xiaobo; Tyler, Brett M; Wang, Yuanchao
2016-07-01
On the basis of its downy mildew-like morphology, the litchi downy blight pathogen was previously named Peronophythora litchii. Recently, however, it was proposed to transfer this pathogen to Phytophthora clade 4. To better characterize this unusual oomycete species and important fruit pathogen, we obtained the genome sequence of Phytophthora litchii and compared it to those from other oomycete species. P. litchii has a small genome with tightly spaced genes. On the basis of a multilocus phylogenetic analysis, the placement of P. litchii in the genus Phytophthora is strongly supported. Effector proteins predicted included 245 RxLR, 30 necrosis-and-ethylene-inducing protein-like, and 14 crinkler proteins. The typical motifs, phylogenies, and activities of these effectors were typical for a Phytophthora species. However, like the genome features of the analyzed downy mildews, P. litchii exhibited a streamlined genome with a relatively small number of genes in both core and species-specific protein families. The low GC content and slight codon preferences of P. litchii sequences were similar to those of the analyzed downy mildews and a subset of Phytophthora species. Taken together, these observations suggest that P. litchii is a Phytophthora pathogen that is in the process of acquiring downy mildew-like genomic and morphological features. Thus P. litchii may provide a novel model for investigating morphological development and genomic adaptation in oomycete pathogens.
O'Donoghue, Patrick; Luthey-Schulten, Zaida
2005-02-25
We present a new algorithm, based on the multidimensional QR factorization, to remove redundancy from a multiple structural alignment by choosing representative protein structures that best preserve the phylogenetic tree topology of the homologous group. The classical QR factorization with pivoting, developed as a fast numerical solution to eigenvalue and linear least-squares problems of the form Ax=b, was designed to re-order the columns of A by increasing linear dependence. Removing the most linear dependent columns from A leads to the formation of a minimal basis set which well spans the phase space of the problem at hand. By recasting the problem of redundancy in multiple structural alignments into this framework, in which the matrix A now describes the multiple alignment, we adapted the QR factorization to produce a minimal basis set of protein structures which best spans the evolutionary (phase) space. The non-redundant and representative profiles obtained from this procedure, termed evolutionary profiles, are shown in initial results to outperform well-tested profiles in homology detection searches over a large sequence database. A measure of structural similarity between homologous proteins, Q(H), is presented. By properly accounting for the effect and presence of gaps, a phylogenetic tree computed using this metric is shown to be congruent with the maximum-likelihood sequence-based phylogeny. The results indicate that evolutionary information is indeed recoverable from the comparative analysis of protein structure alone. Applications of the QR ordering and this structural similarity metric to analyze the evolution of structure among key, universally distributed proteins involved in translation, and to the selection of representatives from an ensemble of NMR structures are also discussed.
PuLSE: Quality control and quantification of peptide sequences explored by phage display libraries.
Shave, Steven; Mann, Stefan; Koszela, Joanna; Kerr, Alastair; Auer, Manfred
2018-01-01
The design of highly diverse phage display libraries is based on assumption that DNA bases are incorporated at similar rates within the randomized sequence. As library complexity increases and expected copy numbers of unique sequences decrease, the exploration of library space becomes sparser and the presence of truly random sequences becomes critical. We present the program PuLSE (Phage Library Sequence Evaluation) as a tool for assessing randomness and therefore diversity of phage display libraries. PuLSE runs on a collection of sequence reads in the fastq file format and generates tables profiling the library in terms of unique DNA sequence counts and positions, translated peptide sequences, and normalized 'expected' occurrences from base to residue codon frequencies. The output allows at-a-glance quantitative quality control of a phage library in terms of sequence coverage both at the DNA base and translated protein residue level, which has been missing from toolsets and literature. The open source program PuLSE is available in two formats, a C++ source code package for compilation and integration into existing bioinformatics pipelines and precompiled binaries for ease of use.
Mapping membrane activity in undiscovered peptide sequence space using machine learning
Fulan, Benjamin M.; Wong, Gerard C. L.
2016-01-01
There are some ∼1,100 known antimicrobial peptides (AMPs), which permeabilize microbial membranes but have diverse sequences. Here, we develop a support vector machine (SVM)-based classifier to investigate ⍺-helical AMPs and the interrelated nature of their functional commonality and sequence homology. SVM is used to search the undiscovered peptide sequence space and identify Pareto-optimal candidates that simultaneously maximize the distance σ from the SVM hyperplane (thus maximize its “antimicrobialness”) and its ⍺-helicity, but minimize mutational distance to known AMPs. By calibrating SVM machine learning results with killing assays and small-angle X-ray scattering (SAXS), we find that the SVM metric σ correlates not with a peptide’s minimum inhibitory concentration (MIC), but rather its ability to generate negative Gaussian membrane curvature. This surprising result provides a topological basis for membrane activity common to AMPs. Moreover, we highlight an important distinction between the maximal recognizability of a sequence to a trained AMP classifier (its ability to generate membrane curvature) and its maximal antimicrobial efficacy. As mutational distances are increased from known AMPs, we find AMP-like sequences that are increasingly difficult for nature to discover via simple mutation. Using the sequence map as a discovery tool, we find a unexpectedly diverse taxonomy of sequences that are just as membrane-active as known AMPs, but with a broad range of primary functions distinct from AMP functions, including endogenous neuropeptides, viral fusion proteins, topogenic peptides, and amyloids. The SVM classifier is useful as a general detector of membrane activity in peptide sequences. PMID:27849600
A critical analysis of computational protein design with sparse residue interaction graphs
Georgiev, Ivelin S.
2017-01-01
Protein design algorithms enumerate a combinatorial number of candidate structures to compute the Global Minimum Energy Conformation (GMEC). To efficiently find the GMEC, protein design algorithms must methodically reduce the conformational search space. By applying distance and energy cutoffs, the protein system to be designed can thus be represented using a sparse residue interaction graph, where the number of interacting residue pairs is less than all pairs of mutable residues, and the corresponding GMEC is called the sparse GMEC. However, ignoring some pairwise residue interactions can lead to a change in the energy, conformation, or sequence of the sparse GMEC vs. the original or the full GMEC. Despite the widespread use of sparse residue interaction graphs in protein design, the above mentioned effects of their use have not been previously analyzed. To analyze the costs and benefits of designing with sparse residue interaction graphs, we computed the GMECs for 136 different protein design problems both with and without distance and energy cutoffs, and compared their energies, conformations, and sequences. Our analysis shows that the differences between the GMECs depend critically on whether or not the design includes core, boundary, or surface residues. Moreover, neglecting long-range interactions can alter local interactions and introduce large sequence differences, both of which can result in significant structural and functional changes. Designs on proteins with experimentally measured thermostability show it is beneficial to compute both the full and the sparse GMEC accurately and efficiently. To this end, we show that a provable, ensemble-based algorithm can efficiently compute both GMECs by enumerating a small number of conformations, usually fewer than 1000. This provides a novel way to combine sparse residue interaction graphs with provable, ensemble-based algorithms to reap the benefits of sparse residue interaction graphs while avoiding their potential inaccuracies. PMID:28358804
Revealing the global map of protein folding space by large-scale simulations
NASA Astrophysics Data System (ADS)
Sinner, Claude; Lutz, Benjamin; Verma, Abhinav; Schug, Alexander
2015-12-01
The full characterization of protein folding is a remarkable long-standing challenge both for experiment and simulation. Working towards a complete understanding of this process, one needs to cover the full diversity of existing folds and identify the general principles driving the process. Here, we want to understand and quantify the diversity in folding routes for a large and representative set of protein topologies covering the full range from all alpha helical topologies towards beta barrels guided by the key question: Does the majority of the observed routes contribute to the folding process or only a particular route? We identified a set of two-state folders among non-homologous proteins with a sequence length of 40-120 residues. For each of these proteins, we ran native-structure based simulations both with homogeneous and heterogeneous contact potentials. For each protein, we simulated dozens of folding transitions in continuous uninterrupted simulations and constructed a large database of kinetic parameters. We investigate folding routes by tracking the formation of tertiary structure interfaces and discuss whether a single specific route exists for a topology or if all routes are equiprobable. These results permit us to characterize the complete folding space for small proteins in terms of folding barrier ΔG‡, number of routes, and the route specificity RT.
Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.
Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai
2015-12-01
The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences. Copyright © 2015 Elsevier Ltd. All rights reserved.
Zhang, Jiongmin; Jia, Ke; Jia, Jinmeng; Qian, Ying
2018-04-27
Comparing and classifying functions of gene products are important in today's biomedical research. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most widely used indicators for protein interaction. Among the various approaches proposed, those based on the vector space model are relatively simple, but their effectiveness is far from satisfying. We propose a Hierarchical Vector Space Model (HVSM) for computing semantic similarity between different genes or their products, which enhances the basic vector space model by introducing the relation between GO terms. Besides the directly annotated terms, HVSM also takes their ancestors and descendants related by "is_a" and "part_of" relations into account. Moreover, HVSM introduces the concept of a Certainty Factor to calibrate the semantic similarity based on the number of terms annotated to genes. To assess the performance of our method, we applied HVSM to Homo sapiens and Saccharomyces cerevisiae protein-protein interaction datasets. Compared with TCSS, Resnik, and other classic similarity measures, HVSM achieved significant improvement for distinguishing positive from negative protein interactions. We also tested its correlation with sequence, EC, and Pfam similarity using online tool CESSM. HVSM showed an improvement of up to 4% compared to TCSS, 8% compared to IntelliGO, 12% compared to basic VSM, 6% compared to Resnik, 8% compared to Lin, 11% compared to Jiang, 8% compared to Schlicker, and 11% compared to SimGIC using AUC scores. CESSM test showed HVSM was comparable to SimGIC, and superior to all other similarity measures in CESSM as well as TCSS. Supplementary information and the software are available at https://github.com/kejia1215/HVSM .
Dean, Kimberly M; Grayhack, Elizabeth J
2012-12-01
We have developed a robust and sensitive method, called RNA-ID, to screen for cis-regulatory sequences in RNA using fluorescence-activated cell sorting (FACS) of yeast cells bearing a reporter in which expression of both superfolder green fluorescent protein (GFP) and yeast codon-optimized mCherry red fluorescent protein (RFP) is driven by the bidirectional GAL1,10 promoter. This method recapitulates previously reported progressive inhibition of translation mediated by increasing numbers of CGA codon pairs, and restoration of expression by introduction of a tRNA with an anticodon that base pairs exactly with the CGA codon. This method also reproduces effects of paromomycin and context on stop codon read-through. Five key features of this method contribute to its effectiveness as a selection for regulatory sequences: The system exhibits greater than a 250-fold dynamic range, a quantitative and dose-dependent response to known inhibitory sequences, exquisite resolution that allows nearly complete physical separation of distinct populations, and a reproducible signal between different cells transformed with the identical reporter, all of which are coupled with simple methods involving ligation-independent cloning, to create large libraries. Moreover, we provide evidence that there are sequences within a 9-nt library that cause reduced GFP fluorescence, suggesting that there are novel cis-regulatory sequences to be found even in this short sequence space. This method is widely applicable to the study of both RNA-mediated and codon-mediated effects on expression.
Folding and Stabilization of Native-Sequence-Reversed Proteins
Zhang, Yuanzhao; Weber, Jeffrey K; Zhou, Ruhong
2016-01-01
Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols. PMID:27113844
Folding and Stabilization of Native-Sequence-Reversed Proteins
NASA Astrophysics Data System (ADS)
Zhang, Yuanzhao; Weber, Jeffrey K.; Zhou, Ruhong
2016-04-01
Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols.
Jaeger, Alex M.; Makley, Leah N.; Gestwicki, Jason E.; Thiele, Dennis J.
2014-01-01
The heat shock transcription factor 1 (HSF1) activates expression of a variety of genes involved in cell survival, including protein chaperones, the protein degradation machinery, anti-apoptotic proteins, and transcription factors. Although HSF1 activation has been linked to amelioration of neurodegenerative disease, cancer cells exhibit a dependence on HSF1 for survival. Indeed, HSF1 drives a program of gene expression in cancer cells that is distinct from that activated in response to proteotoxic stress, and HSF1 DNA binding activity is elevated in cycling cells as compared with arrested cells. Active HSF1 homotrimerizes and binds to a DNA sequence consisting of inverted repeats of the pentameric sequence nGAAn, known as heat shock elements (HSEs). Recent comprehensive ChIP-seq experiments demonstrated that the architecture of HSEs is very diverse in the human genome, with deviations from the consensus sequence in the spacing, orientation, and extent of HSE repeats that could influence HSF1 DNA binding efficacy and the kinetics and magnitude of target gene expression. To understand the mechanisms that dictate binding specificity, HSF1 was purified as either a monomer or trimer and used to evaluate DNA-binding site preferences in vitro using fluorescence polarization and thermal denaturation profiling. These results were compared with quantitative chromatin immunoprecipitation assays in vivo. We demonstrate a role for specific orientations of extended HSE sequences in driving preferential HSF1 DNA binding to target loci in vivo. These studies provide a biochemical basis for understanding differential HSF1 target gene recognition and transcription in neurodegenerative disease and in cancer. PMID:25204655
Open-pNovo: De Novo Peptide Sequencing with Thousands of Protein Modifications.
Yang, Hao; Chi, Hao; Zhou, Wen-Jing; Zeng, Wen-Feng; He, Kun; Liu, Chao; Sun, Rui-Xiang; He, Si-Min
2017-02-03
De novo peptide sequencing has improved remarkably, but sequencing full-length peptides with unexpected modifications is still a challenging problem. Here we present an open de novo sequencing tool, Open-pNovo, for de novo sequencing of peptides with arbitrary types of modifications. Although the search space increases by ∼300 times, Open-pNovo is close to or even ∼10-times faster than the other three proposed algorithms. Furthermore, considering top-1 candidates on three MS/MS data sets, Open-pNovo can recall over 90% of the results obtained by any one traditional algorithm and report 5-87% more peptides, including 14-250% more modified peptides. On a high-quality simulated data set, ∼85% peptides with arbitrary modifications can be recalled by Open-pNovo, while hardly any results can be recalled by others. In summary, Open-pNovo is an excellent tool for open de novo sequencing and has great potential for discovering unexpected modifications in the real biological applications.
Savidor, Alon; Barzilay, Rotem; Elinger, Dalia; Yarden, Yosef; Lindzen, Moshit; Gabashvili, Alexandra; Adiv Tal, Ophir; Levin, Yishai
2017-06-01
Traditional "bottom-up" proteomic approaches use proteolytic digestion, LC-MS/MS, and database searching to elucidate peptide identities and their parent proteins. Protein sequences absent from the database cannot be identified, and even if present in the database, complete sequence coverage is rarely achieved even for the most abundant proteins in the sample. Thus, sequencing of unknown proteins such as antibodies or constituents of metaproteomes remains a challenging problem. To date, there is no available method for full-length protein sequencing, independent of a reference database, in high throughput. Here, we present Database-independent Protein Sequencing, a method for unambiguous, rapid, database-independent, full-length protein sequencing. The method is a novel combination of non-enzymatic, semi-random cleavage of the protein, LC-MS/MS analysis, peptide de novo sequencing, extraction of peptide tags, and their assembly into a consensus sequence using an algorithm named "Peptide Tag Assembler." As proof-of-concept, the method was applied to samples of three known proteins representing three size classes and to a previously un-sequenced, clinically relevant monoclonal antibody. Excluding leucine/isoleucine and glutamic acid/deamidated glutamine ambiguities, end-to-end full-length de novo sequencing was achieved with 99-100% accuracy for all benchmarking proteins and the antibody light chain. Accuracy of the sequenced antibody heavy chain, including the entire variable region, was also 100%, but there was a 23-residue gap in the constant region sequence. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
Deiana, Antonio; Giansanti, Andrea
2010-04-21
Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding. In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins. Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated.
2010-01-01
Background Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding. Results In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins. Conclusions Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated. PMID:20409339
Linke, Christian; Siemens, Nikolai; Middleditch, Martin J.; Kreikemeyer, Bernd; Baker, Edward N.
2012-01-01
The extracellular protein Epf from Streptococcus pyogenes is important for streptococcal adhesion to human epithelial cells. However, Epf has no sequence identity to any protein of known structure or function. Thus, several predicted domains of the 205 kDa protein Epf were cloned separately and expressed in Escherichia coli. The N-terminal domain of Epf was crystallized in space groups P21 and P212121 in the presence of the protease chymotrypsin. Mass spectrometry showed that the species crystallized corresponded to a fragment comprising residues 52–357 of Epf. Complete data sets were collected to 2.0 and 1.6 Å resolution, respectively, at the Australian Synchrotron. PMID:22750867
A Stochastic Evolutionary Model for Protein Structure Alignment and Phylogeny
Challis, Christopher J.; Schmidler, Scott C.
2012-01-01
We present a stochastic process model for the joint evolution of protein primary and tertiary structure, suitable for use in alignment and estimation of phylogeny. Indels arise from a classic Links model, and mutations follow a standard substitution matrix, whereas backbone atoms diffuse in three-dimensional space according to an Ornstein–Uhlenbeck process. The model allows for simultaneous estimation of evolutionary distances, indel rates, structural drift rates, and alignments, while fully accounting for uncertainty. The inclusion of structural information enables phylogenetic inference on time scales not previously attainable with sequence evolution models. The model also provides a tool for testing evolutionary hypotheses and improving our understanding of protein structural evolution. PMID:22723302
Crystallographic studies of bovine beta2-microglobulin.
Becker, J W; Ziffer, J A; Edelman, G M; Cunningham, B A
1977-01-01
Crystals of the bovine milk protein lactollin yield x-ray diffraction data extending to a resolution of 2.8 A. Lactollin is a bovine analogue of beta2-microglobulin, a protein that is homologous in amino acid sequence to the constant domains of immunoglobulins and is the light chain of the human and murine major histocompatability antigens. The protein crystallizes in the orthorhombic space group P2(1)2(1)2(1) with a = 77.4, b = 47.9, and c = 34.3 A. The unit cell parameters and physical chemical solution studies indicate that the molecule exists in the crystal and in solution as a single polypeptide chain of 12,000 daltons. Images PMID:71731
Integration of surface-active, periodically sequenced peptides into lipid-based microbubbles.
Badami, Joseph V; Desir, Pierre; Tu, Raymond S
2014-07-29
The development of microbubbles toward functional, "theranostic" particles requires the incorporation of constituents with high binding specificity and therapeutic efficacy. Integrating peptides or proteins into the shell of lipid-based microbubbles can provide a means to access both receptor-ligand interactions and therapeutic properties. Simultaneously, peptides or proteins can define the characteristic monolayer mechanics of lipid bubbles and eliminate the need for post-bubble generation modification. The ability to engineer peptide sequences de novo that effectively partition into the bubble monolayer remains parametrically daunting. This work contributes to this effort using two simple amphipathic helical peptides that examine the role of local electrostatics and secondary structure. The two periodically sequenced peptides both have three positive charges, but peptide "K-2.5" spaces those charges 2.5 amino acids apart, while peptide "K-6.0" spaces the charges six amino acids apart. Size populations were determined for bubbles containing each peptide species using light scattering, and a quantitative method was developed to clearly define the fraction of peptides binding onto the microbubble monolayer. The impact of both the initial peptide concentration and the zwitterionic:anionic lipid ratio on peptide binding was also evaluated. Our results indicate that the lipid ratio affected only K-6.0 binding, which appears to be an outcome of the greater ensemble average α-helical population of the K-6.0. These findings provide further insights into the role of charge separation on peptide secondary structure, establishing a simple design metric for peptide binding onto microbubble systems.
Sequence Complexity of Amyloidogenic Regions in Intrinsically Disordered Human Proteins
Das, Swagata; Pal, Uttam; Das, Supriya; Bagga, Khyati; Roy, Anupam; Mrigwani, Arpita; Maiti, Nakul C.
2014-01-01
An amyloidogenic region (AR) in a protein sequence plays a significant role in protein aggregation and amyloid formation. We have investigated the sequence complexity of AR that is present in intrinsically disordered human proteins. More than 80% human proteins in the disordered protein databases (DisProt+IDEAL) contained one or more ARs. With decrease of protein disorder, AR content in the protein sequence was decreased. A probability density distribution analysis and discrete analysis of AR sequences showed that ∼8% residue in a protein sequence was in AR and the region was in average 8 residues long. The residues in the AR were high in sequence complexity and it seldom overlapped with low complexity regions (LCR), which was largely abundant in disorder proteins. The sequences in the AR showed mixed conformational adaptability towards α-helix, β-sheet/strand and coil conformations. PMID:24594841
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vivian, J. P.; Porter, C.; Wilce, J. A.
2006-11-01
A preparation of replication terminator protein (RTP) of B. subtilis and a 37-base-pair TerI sequence (comprising two binding sites for RTP) has been purified and crystallized. The replication terminator protein (RTP) of Bacillus subtilis binds to specific DNA sequences that halt the progression of the replisome in a polar manner. These terminator complexes flank a defined region of the chromosome into which they allow replication forks to enter but not exit. Forcing the fusion of replication forks in a specific zone is thought to allow the coordination of post-replicative processes. The functional terminator complex comprises two homodimers each of 29more » kDa bound to overlapping binding sites. A preparation of RTP and a 37-base-pair TerI sequence (comprising two binding sites for RTP) has been purified and crystallized. A data set to 3.9 Å resolution with 97.0% completeness and an R{sub sym} of 12% was collected from a single flash-cooled crystal using synchrotron radiation. The diffraction data are consistent with space group P622, with unit-cell parameters a = b = 118.8, c = 142.6 Å.« less
How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis.
Tian, Pengfei; Best, Robert B
2017-10-17
Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance. Published by Elsevier Inc.
Mallik, Saurav; Kundu, Sudip
2017-07-01
Is the order in which biomolecular subunits self-assemble into functional macromolecular complexes imprinted in their sequence-space? Here, we demonstrate that the temporal order of macromolecular complex self-assembly can be efficiently captured using the landscape of residue-level coevolutionary constraints. This predictive power of coevolutionary constraints is irrespective of the structural, functional, and phylogenetic classification of the complex and of the stoichiometry and quaternary arrangement of the constituent monomers. Combining this result with a number of structural attributes estimated from the crystal structure data, we find indications that stronger coevolutionary constraints at interfaces formed early in the assembly hierarchy probably promotes coordinated fixation of mutations that leads to high-affinity binding with higher surface area, increased surface complementarity and elevated number of molecular contacts, compared to those that form late in the assembly. Proteins 2017; 85:1183-1189. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Nanoscale Bio-engineering Solutions for Space Exploration: The Nanopore Sequencer
NASA Technical Reports Server (NTRS)
Stolc, Viktor; Cozmuta, Ioana
2004-01-01
Characterization of biological systems at the molecular level and extraction of essential information for nano-engineering design to guide the nano-fabrication of solid-state sensors and molecular identification devices is a computational challenge. The alpha hemolysin protein ion channel is used as a model system for structural analysis of nucleic acids like DNA. Applied voltage draws a DNA strand and surrounding ionic solution through the biological nanopore. The subunits in the DNA strand block ion flow by differing amounts. Atomistic scale simulations are employed using NASA supercomputers to study DNA translocation, with the aim to enhance single DNA subunit identification. Compared to protein channels, solid-state nanopores offer a better temporal control of the translocation of DNA and the possibility to easily tune its chemistry to increase the signal resolution. Potential applications for NASA missions, besides real-time genome sequencing include astronaut health, life detection and decoding of various genomes.
Nanoscale Bioengineering Solutions for Space Exploration the Nanopore Sequencer
NASA Technical Reports Server (NTRS)
Ioana, Cozmuta; Viktor, Stoic
2005-01-01
Characterization of biological systems at the molecular level and extraction of essential information for nano-engineering design to guide the nano-fabrication of solid-state sensors and molecular identification devices is a computational challenge. The alpha hemolysin protein ion channel is used as a model system for structural analysis of nucleic acids like DNA. Applied voltage draws a DNA strand and surrounding ionic solution through the biological nanopore. The subunits in the DNA strand block ion flow by differing amounts. Atomistic scale simulations are employed using NASA supercomputers to study DNA translocation. with the aim to enhance single DNA subunit identification. Compared to protein channels, solid-state nanopores offer a better temporal control of the translocation of DNA and the possibility to easily tune its chemistry to increase the signal resolution. Potential applications for NASA missions, besides real-time genome sequencing include astronaut health, life detection and decoding of various genomes. http://phenomrph.arc.nasa.gov/index.php
Mature DIABLO/Smac Is Produced by the IMP Protease Complex on the Mitochondrial Inner Membrane
Burri, Lena; Strahm, Yvan; Hawkins, Christine J.; Gentle, Ian E.; Puryer, Michelle A.; Verhagen, Anne; Callus, Bernard; Vaux, David; Lithgow, Trevor
2005-01-01
DIABLO/Smac is a mitochondrial protein that can promote apoptosis by promoting the release and activation of caspases. To do so, DIABLO/Smac must first be processed by a mitochondrial protease and then released into the cytosol, and we show this in an intact cellular system. We propose that the precursor form of DIABLO/Smac enters the mitochondria through a stop-transfer pathway and is processed to its active form by the inner membrane peptidase (IMP) complex. Catalytic subunits of the mammalian IMP complex were identified based on sequence conservation and functional complementation, and the novel sequence motif RX5P in Imp1 and NX5S in Imp2 distinguish the two catalytic subunits. DIABLO/Smac is one of only a few specific proteins identified as substrates for the IMP complex in the mitochondrial intermembrane space. PMID:15814844
Jaffrey, S R; Haile, D J; Klausner, R D; Harford, J B
1993-09-25
To assess the influence of RNA sequence/structure on the interaction RNAs with the iron-responsive element binding protein (IRE-BP), twenty eight altered RNAs were tested as competitors for an RNA corresponding to the ferritin H chain IRE. All changes in the loop of the predicted IRE hairpin and in the unpaired cytosine residue characteristically found in IRE stems significantly decreased the apparent affinity of the RNA for the IRE-BP. Similarly, alteration in the spacing and/or orientation of the loop and the unpaired cytosine of the stem by either increasing or decreasing the number of base pairs separating them significantly reduced efficacy as a competitor. It is inferred that the IRE-BP forms multiple contacts with its cognate RNA, and that these contacts, acting in concert, provide the basis for the high affinity of this interaction.
Kim, K H; Hemenway, C
1997-05-26
The putative subgenomic RNA (sgRNA) promoter regions upstream of the potato virus X (PVX) triple block and coat protein (CP) genes contain sequences common to other potexviruses. The importance of these sequences to PVX sgRNA accumulation was determined by inoculation of Nicotiana tabacum NT1 cell suspension protoplasts with transcripts derived from wild-type and modified PVX cDNA clones. Analyses of RNA accumulation by S1 nuclease digestion and primer extension indicated that a conserved octanucleotide sequence element and the spacing between this element and the start-site for sgRNA synthesis are critical for accumulation of the two major sgRNA species. The impact of mutations on CP sgRNA levels was also reflected in the accumulation of CP. In contrast, genomic minus- and plus-strand RNA accumulation were not significantly affected by mutations in these regions. Studies involving inoculation of tobacco plants with the modified transcripts suggested that the conserved octanucleotide element functions in sgRNA accumulation and some other aspect of the infection process.
Helix-packing motifs in membrane proteins.
Walters, R F S; DeGrado, W F
2006-09-12
The fold of a helical membrane protein is largely determined by interactions between membrane-imbedded helices. To elucidate recurring helix-helix interaction motifs, we dissected the crystallographic structures of membrane proteins into a library of interacting helical pairs. The pairs were clustered according to their three-dimensional similarity (rmsd =1.5 A), allowing 90% of the library to be assigned to clusters consisting of at least five members. Surprisingly, three quarters of the helical pairs belong to one of five tightly clustered motifs whose structural features can be understood in terms of simple principles of helix-helix packing. Thus, the universe of common transmembrane helix-pairing motifs is relatively simple. The largest cluster, which comprises 29% of the library members, consists of an antiparallel motif with left-handed packing angles, and it is frequently stabilized by packing of small side chains occurring every seven residues in the sequence. Right-handed parallel and antiparallel structures show a similar tendency to segregate small residues to the helix-helix interface but spaced at four-residue intervals. Position-specific sequence propensities were derived for the most populated motifs. These structural and sequential motifs should be quite useful for the design and structural prediction of membrane proteins.
Li, Zhan-Chao; Zhou, Xi-Bin; Dai, Zong; Zou, Xiao-Yong
2009-07-01
A prior knowledge of protein structural classes can provide useful information about its overall structure, so it is very important for quick and accurate determination of protein structural class with computation method in protein science. One of the key for computation method is accurate protein sample representation. Here, based on the concept of Chou's pseudo-amino acid composition (AAC, Chou, Proteins: structure, function, and genetics, 43:246-255, 2001), a novel method of feature extraction that combined continuous wavelet transform (CWT) with principal component analysis (PCA) was introduced for the prediction of protein structural classes. Firstly, the digital signal was obtained by mapping each amino acid according to various physicochemical properties. Secondly, CWT was utilized to extract new feature vector based on wavelet power spectrum (WPS), which contains more abundant information of sequence order in frequency domain and time domain, and PCA was then used to reorganize the feature vector to decrease information redundancy and computational complexity. Finally, a pseudo-amino acid composition feature vector was further formed to represent primary sequence by coupling AAC vector with a set of new feature vector of WPS in an orthogonal space by PCA. As a showcase, the rigorous jackknife cross-validation test was performed on the working datasets. The results indicated that prediction quality has been improved, and the current approach of protein representation may serve as a useful complementary vehicle in classifying other attributes of proteins, such as enzyme family class, subcellular localization, membrane protein types and protein secondary structure, etc.
Gaur, Vineet; Sethi, Dhruv K; Salunke, Dinakar M
2008-01-01
Food allergies appear to be one of the foremost causes of hypersensitivity reactions. Nut allergies account for most food allergies and are often permanent. The 360 kDa hexameric protein Pru du amandin, a known allergen, was purified from almonds (Prunus dulcis) by ammonium sulfate fractionation and ion-exchange chromatography. The protein was identified by a BLAST homology search against the nonredundant sequence database. Pru du amandin belongs to the 11S legumin family of seed storage proteins characterized by the presence of a cupin motif. Crystals were obtained by the hanging-drop vapour-diffusion method. The crystals belong to space group P4(1) (or P4(3)), with unit-cell parameters a = b = 150.7, c = 164.9 A.
Gaur, Vineet; Sethi, Dhruv K.; Salunke, Dinakar M.
2008-01-01
Food allergies appear to be one of the foremost causes of hypersensitivity reactions. Nut allergies account for most food allergies and are often permanent. The 360 kDa hexameric protein Pru du amandin, a known allergen, was purified from almonds (Prunus dulcis) by ammonium sulfate fractionation and ion-exchange chromatography. The protein was identified by a BLAST homology search against the nonredundant sequence database. Pru du amandin belongs to the 11S legumin family of seed storage proteins characterized by the presence of a cupin motif. Crystals were obtained by the hanging-drop vapour-diffusion method. The crystals belong to space group P41 (or P43), with unit-cell parameters a = b = 150.7, c = 164.9 Å. PMID:18097098
Huang, Yu-An; You, Zhu-Hong; Chen, Xing
2018-01-01
Drug-Target Interactions (DTI) play a crucial role in discovering new drug candidates and finding new proteins to target for drug development. Although the number of detected DTI obtained by high-throughput techniques has been increasing, the number of known DTI is still limited. On the other hand, the experimental methods for detecting the interactions among drugs and proteins are costly and inefficient. Therefore, computational approaches for predicting DTI are drawing increasing attention in recent years. In this paper, we report a novel computational model for predicting the DTI using extremely randomized trees model and protein amino acids information. More specifically, the protein sequence is represented as a Pseudo Substitution Matrix Representation (Pseudo-SMR) descriptor in which the influence of biological evolutionary information is retained. For the representation of drug molecules, a novel fingerprint feature vector is utilized to describe its substructure information. Then the DTI pair is characterized by concatenating the two vector spaces of protein sequence and drug substructure. Finally, the proposed method is explored for predicting the DTI on four benchmark datasets: Enzyme, Ion Channel, GPCRs and Nuclear Receptor. The experimental results demonstrate that this method achieves promising prediction accuracies of 89.85%, 87.87%, 82.99% and 81.67%, respectively. For further evaluation, we compared the performance of Extremely Randomized Trees model with that of the state-of-the-art Support Vector Machine classifier. And we also compared the proposed model with existing computational models, and confirmed 15 potential drug-target interactions by looking for existing databases. The experiment results show that the proposed method is feasible and promising for predicting drug-target interactions for new drug candidate screening based on sizeable features. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Akter, Mahfuza; Inoue, Chika; Komori, Hirofumi; Matsuda, Nana; Sakurai, Takeshi; Kataoka, Kunishige; Higuchi, Yoshiki; Shibata, Naoki
2016-10-01
Multicopper oxidases oxidize various phenolic and nonphenolic compounds by using molecular oxygen as an electron acceptor to produce water. A multicopper oxidase protein, CueO, from Escherichia coli is involved in copper homeostasis in the bacterial cell. Although X-ray crystallographic studies have been conducted, the reduction mechanism of oxygen and the proton-transfer pathway remain unclear owing to the difficulty in identifying H atoms from X-ray diffraction data alone. To elucidate the reaction mechanism using neutron crystallography, a preparation system for obtaining large, high-quality single crystals of deuterated CueO was developed. Tiny crystals were obtained from the deuterated CueO initially prepared from the original construct. The X-ray crystal structure of the deuterated CueO showed that the protein contained an incompletely truncated signal sequence at the N-terminus, which resulted in the heterogeneity of the protein sample for crystallization. Here, a new CueO expression system that had an HRV3C cleavage site just after the signal sequence was constructed. Deuterated CueO from the new construct was expressed in cells cultured in deuterated algae-extract medium and the signal sequence was completely eliminated by HRV3C protease. The deuteration level of the purified protein was estimated by MALDI-TOF mass spectrometry to be at least 83.2% compared with nondeuterated protein. Nondeuterated CueO crystallized in space group P2 1 , with unit-cell parameters a = 49.51, b = 88.79, c = 53.95 Å, β = 94.24°, and deuterated CueO crystallized in space group P2 1 2 1 2 1 , with unit-cell parameters a = 49.91, b = 106.92, c = 262.89 Å. The crystallographic parameters for the crystals of the new construct were different from those previously reported for nondeuterated crystals. The nondeuterated and deuterated CueO from the new construct had similar UV-Vis spectra, enzymatic activities and overall structure and geometry of the ligands of the Cu atoms in the active site to those of previously reported CueO structures. These results indicate that the CueO protein prepared using the new construct is suitable for further neutron diffraction studies.
Analysis of sequence repeats of proteins in the PDB.
Mary Rajathei, David; Selvaraj, Samuel
2013-12-01
Internal repeats in protein sequences play a significant role in the evolution of protein structure and function. Applications of different bioinformatics tools help in the identification and characterization of these repeats. In the present study, we analyzed sequence repeats in a non-redundant set of proteins available in the Protein Data Bank (PDB). We used RADAR for detecting internal repeats in a protein, PDBeFOLD for assessing structural similarity, PDBsum for finding functional involvement and Pfam for domain assignment of the repeats in a protein. Through the analysis of sequence repeats, we found that identity of the sequence repeats falls in the range of 20-40% and, the superimposed structures of the most of the sequence repeats maintain similar overall folding. Analysis sequence repeats at the functional level reveals that most of the sequence repeats are involved in the function of the protein through functionally involved residues in the repeat regions. We also found that sequence repeats in single and two domain proteins often contained conserved sequence motifs for the function of the domain. Copyright © 2013 Elsevier Ltd. All rights reserved.
Virador, Victoria M; Reyes Grajeda, Juan P; Blanco-Labra, Alejandro; Mendiola-Olaya, Elizabeth; Smith, Gary M; Moreno, Abel; Whitaker, John R
2010-01-27
The full-length cDNA sequence (P93622_VITVI) of polyphenol oxidase (PPO) cDNA from grape Vitis vinifera L., cv Grenache, was found to encode a translated protein of 607 amino acids with an expected molecular weight of ca. 67 kDa and a predicted pI of 6.83. The translated amino acid sequence was 99%, identical to that of a white grape berry PPO (1) (5 out of 607 amino acid potential sequence differences). The protein was purified from Grenache grape berries by using traditional methods, and it was crystallized with ammonium acetate by the hanging-drop vapor diffusion method. The crystals were orthorhombic, space group C222(1). The structure was obtained at 2.2 A resolution using synchrotron radiation using the 39 kDa isozyme of sweet potato PPO (PDB code: 1BT1 ) as a phase donor. The basic symmetry of the cell parameters (a, b, and c and alpha, beta, and gamma) as well as in the number of asymmetric units in the unit cell of the crystals of PPO, differed between the two proteins. The structures of the two enzymes are quite similar in overall fold, the location of the helix bundles at the core, and the active site in which three histidines bind each of the two catalytic copper ions, and one of the histidines is engaged in a thioether linkage with a cysteine residue. The possibility that the formation of the Cys-His thioether linkage constitutes the activation step is proposed. No evidence of phosphorylation or glycoslyation was found in the electron density map. The mass of the crystallized protein appears to be only 38.4 kDa, and the processing that occurs in the grape berry that leads to this smaller size is discussed.
Modular protein domains: an engineering approach toward functional biomaterials.
Lin, Charng-Yu; Liu, Julie C
2016-08-01
Protein domains and peptide sequences are a powerful tool for conferring specific functions to engineered biomaterials. Protein sequences with a wide variety of functionalities, including structure, bioactivity, protein-protein interactions, and stimuli responsiveness, have been identified, and advances in molecular biology continue to pinpoint new sequences. Protein domains can be combined to make recombinant proteins with multiple functionalities. The high fidelity of the protein translation machinery results in exquisite control over the sequence of recombinant proteins and the resulting properties of protein-based materials. In this review, we discuss protein domains and peptide sequences in the context of functional protein-based materials, composite materials, and their biological applications. Copyright © 2016 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Noirel, Josselin; Simonson, Thomas
2008-11-01
Following Kimura's neutral theory of molecular evolution [M. Kimura, The Neutral Theory of Molecular Evolution (Cambridge University Press, Cambridge, 1983) (reprinted in 1986)], it has become common to assume that the vast majority of viable mutations of a gene confer little or no functional advantage. Yet, in silico models of protein evolution have shown that mutational robustness of sequences could be selected for, even in the context of neutral evolution. The evolution of a biological population can be seen as a diffusion on the network of viable sequences. This network is called a "neutral network." Depending on the mutation rate μ and the population size N, the biological population can evolve purely randomly (μN ≪1) or it can evolve in such a way as to select for sequences of higher mutational robustness (μN ≫1). The stringency of the selection depends not only on the product μN but also on the exact topology of the neutral network, the special arrangement of which was named "superfunnel." Even though the relation between mutation rate, population size, and selection was thoroughly investigated, a study of the salient topological features of the superfunnel that could affect the strength of the selection was wanting. This question is addressed in this study. We use two different models of proteins: on lattice and off lattice. We compare neutral networks computed using these models to random networks. From this, we identify two important factors of the topology that determine the stringency of the selection for mutationally robust sequences. First, the presence of highly connected nodes ("hubs") in the network increases the selection for mutationally robust sequences. Second, the stringency of the selection increases when the correlation between a sequence's mutational robustness and its neighbors' increases. The latter finding relates a global characteristic of the neutral network to a local one, which is attainable through experiments or molecular modeling.
Noirel, Josselin; Simonson, Thomas
2008-11-14
Following Kimura's neutral theory of molecular evolution [M. Kimura, The Neutral Theory of Molecular Evolution (Cambridge University Press, Cambridge, 1983) (reprinted in 1986)], it has become common to assume that the vast majority of viable mutations of a gene confer little or no functional advantage. Yet, in silico models of protein evolution have shown that mutational robustness of sequences could be selected for, even in the context of neutral evolution. The evolution of a biological population can be seen as a diffusion on the network of viable sequences. This network is called a "neutral network." Depending on the mutation rate mu and the population size N, the biological population can evolve purely randomly (muN<1) or it can evolve in such a way as to select for sequences of higher mutational robustness (muN>1). The stringency of the selection depends not only on the product muN but also on the exact topology of the neutral network, the special arrangement of which was named "superfunnel." Even though the relation between mutation rate, population size, and selection was thoroughly investigated, a study of the salient topological features of the superfunnel that could affect the strength of the selection was wanting. This question is addressed in this study. We use two different models of proteins: on lattice and off lattice. We compare neutral networks computed using these models to random networks. From this, we identify two important factors of the topology that determine the stringency of the selection for mutationally robust sequences. First, the presence of highly connected nodes ("hubs") in the network increases the selection for mutationally robust sequences. Second, the stringency of the selection increases when the correlation between a sequence's mutational robustness and its neighbors' increases. The latter finding relates a global characteristic of the neutral network to a local one, which is attainable through experiments or molecular modeling.
Structurally complex and highly active RNA ligases derived from random RNA sequences
NASA Technical Reports Server (NTRS)
Ekland, E. H.; Szostak, J. W.; Bartel, D. P.
1995-01-01
Seven families of RNA ligases, previously isolated from random RNA sequences, fall into three classes on the basis of secondary structure and regiospecificity of ligation. Two of the three classes of ribozymes have been engineered to act as true enzymes, catalyzing the multiple-turnover transformation of substrates into products. The most complex of these ribozymes has a minimal catalytic domain of 93 nucleotides. An optimized version of this ribozyme has a kcat exceeding one per second, a value far greater than that of most natural RNA catalysts and approaching that of comparable protein enzymes. The fact that such a large and complex ligase emerged from a very limited sampling of sequence space implies the existence of a large number of distinct RNA structures of equivalent complexity and activity.
Structure based re-design of the binding specificity of anti-apoptotic Bcl-xL
Chen, T. Scott; Palacios, Hector; Keating, Amy E.
2012-01-01
Many native proteins are multi-specific and interact with numerous partners, which can confound analysis of their functions. Protein design provides a potential route to generating synthetic variants of native proteins with more selective binding profiles. Re-designed proteins could be used as research tools, diagnostics or therapeutics. In this work, we used a library screening approach to re-engineer the multi-specific anti-apoptotic protein Bcl-xL to remove its interactions with many of its binding partners, making it a high affinity and selective binder of the BH3 region of pro-apoptotic protein Bad. To overcome the enormity of the potential Bcl-xL sequence space, we developed and applied a computational/experimental framework that used protein structure information to generate focused combinatorial libraries. Sequence features were identified using structure-based modeling, and an optimization algorithm based on integer programming was used to select degenerate codons that maximally covered these features. A constraint on library size was used to ensure thorough sampling. Using yeast surface display to screen a designed library of Bcl-xL variants, we successfully identified a protein with ~1,000-fold improvement in binding specificity for the BH3 region of Bad over the BH3 region of Bim. Although negative design was targeted only against the BH3 region of Bim, the best re-designed protein was globally specific against binding to 10 other peptides corresponding to native BH3 motifs. Our design framework demonstrates an efficient route to highly specific protein binders and may readily be adapted for application to other design problems. PMID:23154169
Bandeira, Nuno; Clauser, Karl R; Pevzner, Pavel A
2007-07-01
Despite significant advances in the identification of known proteins, the analysis of unknown proteins by MS/MS still remains a challenging open problem. Although Klaus Biemann recognized the potential of MS/MS for sequencing of unknown proteins in the 1980s, low throughput Edman degradation followed by cloning still remains the main method to sequence unknown proteins. The automated interpretation of MS/MS spectra has been limited by a focus on individual spectra and has not capitalized on the information contained in spectra of overlapping peptides. Indeed the powerful shotgun DNA sequencing strategies have not been extended to automated protein sequencing. We demonstrate, for the first time, the feasibility of automated shotgun protein sequencing of protein mixtures by utilizing MS/MS spectra of overlapping and possibly modified peptides generated via multiple proteases of different specificities. We validate this approach by generating highly accurate de novo reconstructions of multiple regions of various proteins in western diamondback rattlesnake venom. We further argue that shotgun protein sequencing has the potential to overcome the limitations of current protein sequencing approaches and thus catalyze the otherwise impractical applications of proteomics methodologies in studies of unknown proteins.
Wan, Shixiang; Zou, Quan
2017-01-01
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Gene Unprediction with Spurio: A tool to identify spurious protein sequences.
Höps, Wolfram; Jeffryes, Matt; Bateman, Alex
2018-01-01
We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.
A proteomic analysis of leaf sheaths from rice.
Shen, Shihua; Matsubae, Masami; Takao, Toshifumi; Tanaka, Naoki; Komatsu, Setsuko
2002-10-01
The proteins extracted from the leaf sheaths of rice seedlings were separated by 2-D PAGE, and analyzed by Edman sequencing and mass spectrometry, followed by database searching. Image analysis revealed 352 protein spots on 2-D PAGE after staining with Coomassie Brilliant Blue. The amino acid sequences of 44 of 84 proteins were determined; for 31 of these proteins, a clear function could be assigned, whereas for 12 proteins, no function could be assigned. Forty proteins did not yield amino acid sequence information, because they were N-terminally blocked, or the obtained sequences were too short and/or did not give unambiguous results. Fifty-nine proteins were analyzed by mass spectrometry; all of these proteins were identified by matching to the protein database. The amino acid sequences of 19 of 27 proteins analyzed by mass spectrometry were similar to the results of Edman sequencing. These results suggest that 2-D PAGE combined with Edman sequencing and mass spectrometry analysis can be effectively used to identify plant proteins.
Self-Organizing Hidden Markov Model Map (SOHMMM).
Ferles, Christos; Stafylopatis, Andreas
2013-12-01
A hybrid approach combining the Self-Organizing Map (SOM) and the Hidden Markov Model (HMM) is presented. The Self-Organizing Hidden Markov Model Map (SOHMMM) establishes a cross-section between the theoretic foundations and algorithmic realizations of its constituents. The respective architectures and learning methodologies are fused in an attempt to meet the increasing requirements imposed by the properties of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein chain molecules. The fusion and synergy of the SOM unsupervised training and the HMM dynamic programming algorithms bring forth a novel on-line gradient descent unsupervised learning algorithm, which is fully integrated into the SOHMMM. Since the SOHMMM carries out probabilistic sequence analysis with little or no prior knowledge, it can have a variety of applications in clustering, dimensionality reduction and visualization of large-scale sequence spaces, and also, in sequence discrimination, search and classification. Two series of experiments based on artificial sequence data and splice junction gene sequences demonstrate the SOHMMM's characteristics and capabilities. Copyright © 2013 Elsevier Ltd. All rights reserved.
Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae
Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim; Krogsgaard, Steen; Nielsen, Jens
2008-01-01
Background Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other related fungi. Here we proposed the gene prediction by construction of an A. oryzae Expressed Sequence Tag (EST) library, sequencing and assembly. We enhanced the function assignment by our developed annotation strategy. The resulting better annotation was used to reconstruct the metabolic network leading to a genome scale metabolic model of A. oryzae. Results Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted in assignment of new putative functions to 1,469 hypothetical proteins already present in the A. oryzae genome database. Using the substantially improved annotated genome we reconstructed the metabolic network of A. oryzae. This network contains 729 enzymes, 1,314 enzyme-encoding genes, 1,073 metabolites and 1,846 (1,053 unique) biochemical reactions. The metabolic reactions are compartmentalized into the cytosol, the mitochondria, the peroxisome and the extracellular space. Transport steps between the compartments and the extracellular space represent 281 reactions, of which 161 are unique. The metabolic model was validated and shown to correctly describe the phenotypic behavior of A. oryzae grown on different carbon sources. Conclusion A much enhanced annotation of the A. oryzae genome was performed and a genome-scale metabolic model of A. oryzae was reconstructed. The model accurately predicted the growth and biomass yield on different carbon sources. The model serves as an important resource for gaining further insight into our understanding of A. oryzae physiology. PMID:18500999
Deciphering functional glycosaminoglycan motifs in development.
Townley, Robert A; Bülow, Hannes E
2018-03-23
Glycosaminoglycans (GAGs) such as heparan sulfate, chondroitin/dermatan sulfate, and keratan sulfate are linear glycans, which when attached to protein backbones form proteoglycans. GAGs are essential components of the extracellular space in metazoans. Extensive modifications of the glycans such as sulfation, deacetylation and epimerization create structural GAG motifs. These motifs regulate protein-protein interactions and are thereby repsonsible for many of the essential functions of GAGs. This review focusses on recent genetic approaches to characterize GAG motifs and their function in defined signaling pathways during development. We discuss a coding approach for GAGs that would enable computational analyses of GAG sequences such as alignments and the computation of position weight matrices to describe GAG motifs. Copyright © 2018 Elsevier Ltd. All rights reserved.
Linke, Christian; Siemens, Nikolai; Middleditch, Martin J; Kreikemeyer, Bernd; Baker, Edward N
2012-07-01
The extracellular protein Epf from Streptococcus pyogenes is important for streptococcal adhesion to human epithelial cells. However, Epf has no sequence identity to any protein of known structure or function. Thus, several predicted domains of the 205 kDa protein Epf were cloned separately and expressed in Escherichia coli. The N-terminal domain of Epf was crystallized in space groups P2(1) and P2(1)2(1)2(1) in the presence of the protease chymotrypsin. Mass spectrometry showed that the species crystallized corresponded to a fragment comprising residues 52-357 of Epf. Complete data sets were collected to 2.0 and 1.6 Å resolution, respectively, at the Australian Synchrotron.
Arnold, Roland; Goldenberg, Florian; Mewes, Hans-Werner; Rattei, Thomas
2014-01-01
The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith–Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads. PMID:24165881
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gaur, Vineet; Sethi, Dhruv K.; Salunke, Dinakar M., E-mail: dinakar@nii.res.in
The purification, identification, crystallization and preliminary crystallographic studies of an allergy-related protein, Pru du amandin, from P. dulcis nuts are reported. Food allergies appear to be one of the foremost causes of hypersensitivity reactions. Nut allergies account for most food allergies and are often permanent. The 360 kDa hexameric protein Pru du amandin, a known allergen, was purified from almonds (Prunus dulcis) by ammonium sulfate fractionation and ion-exchange chromatography. The protein was identified by a BLAST homology search against the nonredundant sequence database. Pru du amandin belongs to the 11S legumin family of seed storage proteins characterized by the presencemore » of a cupin motif. Crystals were obtained by the hanging-drop vapour-diffusion method. The crystals belong to space group P4{sub 1} (or P4{sub 3}), with unit-cell parameters a = b = 150.7, c = 164.9 Å.« less
Signal-3L: A 3-layer approach for predicting signal peptides.
Shen, Hong-Bin; Chou, Kuo-Chen
2007-11-16
Functioning as an "address tag" that directs nascent proteins to their proper cellular and extracellular locations, signal peptides have become a crucial tool in finding new drugs or reprogramming cells for gene therapy. To effectively and timely use such a tool, however, the first important thing is to develop an automated method for rapidly and accurately identifying the signal peptide for a given nascent protein. With the avalanche of new protein sequences generated in the post-genomic era, the challenge has become even more urgent and critical. In this paper, we have developed a novel method for predicting signal peptide sequences and their cleavage sites in human, plant, animal, eukaryotic, Gram-positive, and Gram-negative protein sequences, respectively. The new predictor is called Signal-3L that consists of three prediction engines working, respectively, for the following three progressively deepening layers: (1) identifying a query protein as secretory or non-secretory by an ensemble classifier formed by fusing many individual OET-KNN (optimized evidence-theoretic K nearest neighbor) classifiers operated in various dimensions of PseAA (pseudo amino acid) composition spaces; (2) selecting a set of candidates for the possible signal peptide cleavage sites of a query secretory protein by a subsite-coupled discrimination algorithm; (3) determining the final cleavage site by fusing the global sequence alignment outcome for each of the aforementioned candidates through a voting system. Signal-3L is featured by high success prediction rates with short computational time, and hence is particularly useful for the analysis of large-scale datasets. Signal-3L is freely available as a web-server at http://chou.med.harvard.edu/bioinf/Signal-3L/ or http://202.120.37.186/bioinf/Signal-3L, where, to further support the demand of the related areas, the signal peptides identified by Signal-3L for all the protein entries in Swiss-Prot databank that do not have signal peptide annotations or are annotated with uncertain terms but are classified by Signal-3L as secretory proteins are provided in a downloadable file. The large-scale file is prepared with Microsoft Excel and named "Tab-Signal-3L.xls", and will be updated once a year to include new protein entries and reflect the continuous development of Signal-3L.
Single proteins that serve linked functions in intracellular and extracellular microenvironments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Radisky, Derek C.; Stallings-Mann, Melody; Hirai, Yohei
2009-06-03
Maintenance of organ homeostasis and control of appropriate response to environmental alterations requires intimate coordination of cellular function and tissue organization. An important component of this coordination may be provided by proteins that can serve distinct, but linked, functions on both sides of the plasma membrane. Here we present a novel hypothesis in which non-classical secretion can provide a mechanism through which single proteins can integrate complex tissue functions. Single genes can exert a complex, dynamic influence through a number of different processes that act to multiply the function of the gene product(s). Alternative splicing can create many different transcriptsmore » that encode proteins of diverse, even antagonistic, function from a single gene. Posttranslational modifications can alter the stability, activity, localization, and even basic function of proteins. A protein can exist in different subcellular localizations. More recently, it has become clear that single proteins can function both inside and outside the cell. These proteins often lack defined secretory signal sequences, and transit the plasma membrane by mechanisms separate from the classical ER/Golgi secretory process. When examples of such proteins are examined individually, the multifunctionality and lack of a signal sequence are puzzling - why should a protein with a well known function in one context function in such a distinct fashion in another? We propose that one reason for a single protein to perform intracellular and extracellular roles is to coordinate organization and maintenance of a global tissue function. Here, we describe in detail three specific examples of proteins that act in this fashion, outlining their specific functions in the extracellular space and in the intracellular space, and we discuss how these functions may be linked. We present epimorphin/syntaxin-2, which may coordinate morphogenesis of secretory organs (as epimorphin) with control of protein secretion (as syntaxin-2), amphoterin/high mobility group box-1 (HMGB1), which may link inflammation (as amphoterin) with regulation of gene expression (as HMGB1), and tissue transglutaminase, which affects delivery of and response to apoptotic signals by serving a related function on both sides of the plasma membrane. As it is notable that all three of these proteins have been reported to transit the plasma membrane through non-classical secretory mechanisms, we will also discuss why coordinated inside/outside functions may be found in some examples of proteins which transit the plasma membrane through non-classical mechanisms and how this relationship can be used to identify additional proteins that share these characteristics.« less
G23D: Online tool for mapping and visualization of genomic variants on 3D protein structures.
Solomon, Oz; Kunik, Vered; Simon, Amos; Kol, Nitzan; Barel, Ortal; Lev, Atar; Amariglio, Ninette; Somech, Raz; Rechavi, Gidi; Eyal, Eran
2016-08-26
Evaluation of the possible implications of genomic variants is an increasingly important task in the current high throughput sequencing era. Structural information however is still not routinely exploited during this evaluation process. The main reasons can be attributed to the partial structural coverage of the human proteome and the lack of tools which conveniently convert genomic positions, which are the frequent output of genomic pipelines, to proteins and structure coordinates. We present G23D, a tool for conversion of human genomic coordinates to protein coordinates and protein structures. G23D allows mapping of genomic positions/variants on evolutionary related (and not only identical) protein three dimensional (3D) structures as well as on theoretical models. By doing so it significantly extends the space of variants for which structural insight is feasible. To facilitate interpretation of the variant consequence, pathogenic variants, functional sites and polymorphism sites are displayed on protein sequence and structure diagrams alongside the input variants. G23D also provides modeling of the mutant structure, analysis of intra-protein contacts and instant access to functional predictions and predictions of thermo-stability changes. G23D is available at http://www.sheba-cancer.org.il/G23D . G23D extends the fraction of variants for which structural analysis is applicable and provides better and faster accessibility for structural data to biologists and geneticists who routinely work with genomic information.
Higashibata, Akira; Hashizume, Toko; Nemoto, Kanako; Higashitani, Nahoko; Etheridge, Timothy; Mori, Chihiro; Harada, Shunsuke; Sugimoto, Tomoko; Szewczyk, Nathaniel J; Baba, Shoji A; Mogami, Yoshihiro; Fukui, Keiji; Higashitani, Atsushi
2016-01-01
Although muscle atrophy is a serious problem during spaceflight, little is known about the sequence of molecular events leading to atrophy in response to microgravity. We carried out a spaceflight experiment using Caenorhabditis elegans onboard the Japanese Experiment Module of the International Space Station. Worms were synchronously cultured in liquid media with bacterial food for 4 days under microgravity or on a 1-G centrifuge. Worms were visually observed for health and movement and then frozen. Upon return, we analyzed global gene and protein expression using DNA microarrays and mass spectrometry. Body length and fat accumulation were also analyzed. We found that in worms grown from the L1 larval stage to adulthood under microgravity, both gene and protein expression levels for muscular thick filaments, cytoskeletal elements, and mitochondrial metabolic enzymes decreased relative to parallel cultures on the 1-G centrifuge (95% confidence interval (P⩽0.05)). In addition, altered movement and decreased body length and fat accumulation were observed in the microgravity-cultured worms relative to the 1-G cultured worms. These results suggest protein expression changes that may account for the progressive muscular atrophy observed in astronauts. PMID:28725720
Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models.
Park, Byungkyu; Im, Jinyong; Tuvshinjargal, Narankhuu; Lee, Wook; Han, Kyungsook
2014-11-01
As many structures of protein-DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein-DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Shotgun Protein Sequencing with Meta-contig Assembly*
Guthals, Adrian; Clauser, Karl R.; Bandeira, Nuno
2012-01-01
Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings. PMID:22798278
Shotgun protein sequencing with meta-contig assembly.
Guthals, Adrian; Clauser, Karl R; Bandeira, Nuno
2012-10-01
Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings.
Deshmukh, Lalit; Schwieters, Charles D; Grishaev, Alexander; Clore, G Marius
2016-06-03
Nucleic-acid-related events in the HIV-1 replication cycle are mediated by nucleocapsid, a small protein comprising two zinc knuckles connected by a short flexible linker and flanked by disordered termini. Combining experimental NMR residual dipolar couplings, solution X-ray scattering and protein engineering with ensemble simulated annealing, we obtain a quantitative description of the configurational space sampled by the two zinc knuckles, the linker and disordered termini in the absence of nucleic acids. We first compute the conformational ensemble (with an optimal size of three members) of an engineered nucleocapsid construct lacking the N- and C-termini that satisfies the experimental restraints, and then validate this ensemble, as well as characterize the disordered termini, using the experimental data from the full-length nucleocapsid construct. The experimental and computational strategy is generally applicable to multidomain proteins. Differential flexibility within the linker results in asymmetric motion of the zinc knuckles which may explain their functionally distinct roles despite high sequence identity. One of the configurations (populated at a level of ≈40 %) closely resembles that observed in various ligand-bound forms, providing evidence for conformational selection and a mechanistic link between protein dynamics and function. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Tins, B; Cassar-Pullicino, V; Haddaway, M; Nachtrab, U
2012-01-01
Objectives The bulk of spinal imaging is still performed with conventional two-dimensional sequences. This study assesses the suitability of three-dimensional sampling perfection with application-optimised contrasts using a different flip angle evolutions (SPACE) sequence for routine spinal imaging. Methods 62 MRI examinations of the spine were evaluated by 2 examiners in consensus for the depiction of anatomy and presence of artefact. We noted pathologies that might be missed using the SPACE sequence only or the SPACE and a sagittal T1 weighted sequence. The reference standards were sagittal and axial T1 weighted and T2 weighted sequences. At a later date the evaluation was repeated by one of the original examiners and an additional examiner. Results There was good agreement of the single evaluations and consensus evaluation for the conventional sequences: κ>0.8, confidence interval (CI)>0.6–1.0. For the SPACE sequence, depiction of anatomy was very good for 84% of cases, with high interobserver agreement, but there was poor interobserver agreement for other cases. For artefact assessment of SPACE, κ=0.92, CI=0.92–1.0. The SPACE sequence was superior to conventional sequences for depiction of anatomy and artefact resistance. The SPACE sequence occasionally missed bone marrow oedema. In conjunction with sagittal T1 weighted sequences, no abnormality was missed. The isotropic SPACE sequence was superior to conventional sequences in imaging difficult anatomy such as in scoliosis and spondylolysis. Conclusion The SPACE sequence allows excellent assessment of anatomy owing to high spatial resolution and resistance to artefact. The sensitivity for bone marrow abnormalities is limited. PMID:22374284
Tins, B; Cassar-Pullicino, V; Haddaway, M; Nachtrab, U
2012-08-01
The bulk of spinal imaging is still performed with conventional two-dimensional sequences. This study assesses the suitability of three-dimensional sampling perfection with application-optimised contrasts using a different flip angle evolutions (SPACE) sequence for routine spinal imaging. 62 MRI examinations of the spine were evaluated by 2 examiners in consensus for the depiction of anatomy and presence of artefact. We noted pathologies that might be missed using the SPACE sequence only or the SPACE and a sagittal T(1) weighted sequence. The reference standards were sagittal and axial T(1) weighted and T(2) weighted sequences. At a later date the evaluation was repeated by one of the original examiners and an additional examiner. There was good agreement of the single evaluations and consensus evaluation for the conventional sequences: κ>0.8, confidence interval (CI)>0.6-1.0. For the SPACE sequence, depiction of anatomy was very good for 84% of cases, with high interobserver agreement, but there was poor interobserver agreement for other cases. For artefact assessment of SPACE, κ=0.92, CI=0.92-1.0. The SPACE sequence was superior to conventional sequences for depiction of anatomy and artefact resistance. The SPACE sequence occasionally missed bone marrow oedema. In conjunction with sagittal T(1) weighted sequences, no abnormality was missed. The isotropic SPACE sequence was superior to conventional sequences in imaging difficult anatomy such as in scoliosis and spondylolysis. The SPACE sequence allows excellent assessment of anatomy owing to high spatial resolution and resistance to artefact. The sensitivity for bone marrow abnormalities is limited.
NASA Astrophysics Data System (ADS)
Sethaphong, Latsavongsakda
This work examines smart material properties of rational self-assembly and molecular recognition found in nano-biosystems. Exploiting the sequence and structural information encoded within nucleic acids and proteins will permit programmed synthesis of nanomaterials and help create molecular machines that may carry out new roles involving chemical catalysis and bioenergy. Responsive to different ionic environments thru self-reorgnization, nucleic acids (NA) are nature's signature smart material; organisms such as viruses and bacteria use features of NAs to react to their environment and orchestrate their lifecycle. Furthermore, nucleic acid systems (both RNA and DNA) are currently exploited as scaffolds; recent applications have been showcased to build bioelectronics and biotemplated nanostructures via directed assembly of multidimensional nanoelectronic devices 1. Since the most stable and rudimentary structure of nucleic acids is the helical duplex, these were modeled in order to examine the influence of the microenvironment, sequence, and cation-dependent perturbations of their canonical forms. Due to their negatively charged phosphate backbone, NA's rely on counterions to overcome the inherent repulsive forces that arise from the assembly of two complementary strands. As a realistic model system, we chose the HIV-TAR helix (PDB ID: 397D) to study specific sequence motifs on cation sequestration. At physiologically relevant concentrations of sodium and potassium ions, we observed sequence based effects where purine stretches were adept in retaining high residency cations. The transitional space between adenine and guanosine nucleotides (ApG step) in a sequence proved the most favorable. This work was the first to directly show these subtle interactions of sequence based cationic sequestration and may be useful for controlling metallization of nucleic acids in conductive nanowires. Extending the study further, we explored the degree to which the structure of NA duplexes alone interacted with cations distinct from a specific sequence. Under physiologically relevant conditions, a duplex of RNA polyguanine-polycitidine was highly responsive and able to sequester cations to the middle of the purine stretches. The least responsive structure was a DNA polyadenine-polythymine duplex. A random sequence DNA duplex contorted into an RNA-like helix resulted in cationic dynamics similar to RNA systems. These studies showed that cation diffusive binding events in nucleic acid duplex structures are sequence specific and heavily influenced by structural aspects helical forms to account for much of the differences observed. Although structural information in nucleic acids is encoded within their sequence, linking amino acid sequence to protein structure is murkier; the structural information within proteins is encoded by the folding process itself: a complex phenomenon driven toward the equilibrium state of the active conformation. Upwards of two thirds of a protein's sequence can be substituted with similar amino acids without significantly perturbing its function; conserved residues of about 10% seem to be vital; since evolutionary selection pressure in proteins operates 3-dimenionally, a linear sequence is partially informative. We explored this problem by folding de-novo the cytosolic portion of the membrane protein, cellulose synthase, CESA1 from upland cotton, Gossypium hirsutum (Ghcesa1). The cytoplasmic region was generated by homology modeling and refined with molecular dynamics. These mutations impair local structural flexibility which likely results in cellulose that is produced at a lower rate and is less crystalline. Additional modeling of fragments of cellulose synthases from the model plant, Arabidopsis thaliana, offered novel insights into the function of conserved cytosolic domains within plant cellulose synthases. Transport mechanisms related to the transmembrane region revealed significant differences between plants and a bacterial complex. These studies generated possible mutations that may allow for the creation of new synthases and identified other avenues of research in order to develop technologies that may alter the crystallinity and other useful properties of cellulose. 1. Karplus, K., SAM-T08, HMM-based protein structure prediction. Nucleic Acids Research, 2009. 37: p. W492-W497.
NASA Astrophysics Data System (ADS)
Drake, J.; Mass, T.; Haramaty, L.; Zelzion, U.; Bhattacharya, D.; Falkowski, P. G.
2012-12-01
Carbonate formation by biological organisms is catalyzed by a set of proteins. In corals, the proteins form a subset of a poorly characterized skeletal organic matrix (SOM). This matrix is not simply cells occluded in the mineral, but is instead a suite of biomolecules secreted from cells for the purpose of nucleation and/or scaffolding. However, the mechanism(s) for SOM's role in biomineral formation remain to be elucidated, in part because, for many organisms including stony corals, the organic molecules have yet to be characterized much less modeled. In an effort to understand the calcification process, we sequenced the SOM protein complex in the zooxanthellate coral, Stylophora pistillata, by liquid chromatography-tandem mass spectrometry. Our analysis reveals several 'framework' proteins as well as three highly acidic proteins (proteins that contain >30% aspartic and glutamic acids). The SOM framework proteins show sequence homology with other stony corals as well as with calcite biomineralizers. Several of these proteins exhibit calcium-binding domains, while others are likely involved in attachment of the coral calicoblastic layer to the newly formed skeleton substrate. We have begun to express and purify the framework proteins to (1) confirm and visualize their presence in the extracted SOM and in intact skeleton by antibody staining and immunolocalization, and (2) test their interaction with the highly acidic SOM proteins that may direct aragonite nucleation. This work is the first comprehensive proteomic analysis of coral SOM. Together with our genomic work investigating highly acidic SOM candidates (Mass et al. 2012 AGU Fall Meeting abstract), this will allow us to construct a three-dimensional model of the coral calcifying space to better understand the mechanisms of coral biomineralization.
Evaluation of pleural and pericardial effusions by magnetic resonance imaging.
Tscholakoff, D; Sechtem, U; de Geer, G; Schmidt, H; Higgins, C B
1987-08-01
MR examinations of 36 patients with pleural and/or pericardial effusions were retrospectively evaluated. The purpose of this study was to determine of MR imaging is capable of differentiating between pleural and pericardial effusions of different compositions using standard electrocardiogram (ECG)-gated and non-gated spin echo pulse sequences. Additional data was obtained from experimental pleural effusions in 10 dogs. The results of this study indicate that old hemorrhages into the pleural or pericardial space can be differentiated from other pleural or pericardial effusions. However, further differentiation between transudates, exudates and sanguinous effusions is not possible on MR images acquired with standard spin echo pulse sequences. Respiratory and cardiac motion are responsible for signal loss, particularly on first echo images. This was documented in experiments in dogs with induced effusions of known composition; "negative" T2 values consistent with fluid motion during imaging sequences were observed in 80% of cases. However, postmortem studies of the dogs with experimental effusions showed differences between effusions with low protein concentrations and higher protein concentrations. We conclude from our study that characterization of pleural and pericardial effusions on standard ECG-gated and non-gated MR examinations is limited to the positive identification of hemorrhage. Motion of the fluid due to cardiac and respiratory activity causes artifactual and unpredictable changes in intensity values negating the more subtle differences in intensity associated with increasing protein content.
Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions.
Mai, Te-Lun; Hu, Geng-Ming; Chen, Chi-Ming
2016-07-01
Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.
MIPS: a database for protein sequences, homology data and yeast genome information.
Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F
1997-01-01
The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498
BLAST and FASTA similarity searching for multiple sequence alignment.
Pearson, William R
2014-01-01
BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry-homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today's very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.
Characterization and Prediction of Protein Phosphorylation Hotspots in Arabidopsis thaliana.
Christian, Jan-Ole; Braginets, Rostyslav; Schulze, Waltraud X; Walther, Dirk
2012-01-01
The regulation of protein function by modulating the surface charge status via sequence-locally enriched phosphorylation sites (P-sites) in so called phosphorylation "hotspots" has gained increased attention in recent years. We set out to identify P-hotspots in the model plant Arabidopsis thaliana. We analyzed the spacing of experimentally detected P-sites within peptide-covered regions along Arabidopsis protein sequences as available from the PhosPhAt database. Confirming earlier reports (Schweiger and Linial, 2010), we found that, indeed, P-sites tend to cluster and that distributions between serine and threonine P-sites to their respected closest next P-site differ significantly from those for tyrosine P-sites. The ability to predict P-hotspots by applying available computational P-site prediction programs that focus on identifying single P-sites was observed to be severely compromised by the inevitable interference of nearby P-sites. We devised a new approach, named HotSPotter, for the prediction of phosphorylation hotspots. HotSPotter is based primarily on local amino acid compositional preferences rather than sequence position-specific motifs and uses support vector machines as the underlying classification engine. HotSPotter correctly identified experimentally determined phosphorylation hotspots in A. thaliana with high accuracy. Applied to the Arabidopsis proteome, HotSPotter-predicted 13,677 candidate P-hotspots in 9,599 proteins corresponding to 7,847 unique genes. Hotspot containing proteins are involved predominantly in signaling processes confirming the surmised modulating role of hotspots in signaling and interaction events. Our study provides new bioinformatics means to identify phosphorylation hotspots and lays the basis for further investigating novel candidate P-hotspots. All phosphorylation hotspot annotations and predictions have been made available as part of the PhosPhAt database at http://phosphat.mpimp-golm.mpg.de.
Isolation and characterization of target sequences of the chicken CdxA homeobox gene.
Margalit, Y; Yarus, S; Shapira, E; Gruenbaum, Y; Fainsod, A
1993-01-01
The DNA binding specificity of the chicken homeodomain protein CDXA was studied. Using a CDXA-glutathione-S-transferase fusion protein, DNA fragments containing the binding site for this protein were isolated. The sources of DNA were oligonucleotides with random sequence and chicken genomic DNA. The DNA fragments isolated were sequenced and tested in DNA binding assays. Sequencing revealed that most DNA fragments are AT rich which is a common feature of homeodomain binding sites. By electrophoretic mobility shift assays it was shown that the different target sequences isolated bind to the CDXA protein with different affinities. The specific sequences bound by the CDXA protein in the genomic fragments isolated, were determined by DNase I footprinting. From the footprinted sequences, the CDXA consensus binding site was determined. The CDXA protein binds the consensus sequence A, A/T, T, A/T, A, T, A/G. The CAUDAL binding site in the ftz promoter is also included in this consensus sequence. When tested, some of the genomic target sequences were capable of enhancing the transcriptional activity of reporter plasmids when introduced into CDXA expressing cells. This study determined the DNA sequence specificity of the CDXA protein and it also shows that this protein can further activate transcription in cells in culture. Images PMID:7909943
Peptide library synthesis on spectrally encoded beads for multiplexed protein/peptide bioassays
NASA Astrophysics Data System (ADS)
Nguyen, Huy Q.; Brower, Kara; Harink, Björn; Baxter, Brian; Thorn, Kurt S.; Fordyce, Polly M.
2017-02-01
Protein-peptide interactions are essential for cellular responses. Despite their importance, these interactions remain largely uncharacterized due to experimental challenges associated with their measurement. Current techniques (e.g. surface plasmon resonance, fluorescence polarization, and isothermal calorimetry) either require large amounts of purified material or direct fluorescent labeling, making high-throughput measurements laborious and expensive. In this report, we present a new technology for measuring antibody-peptide interactions in vitro that leverages spectrally encoded beads for biological multiplexing. Specific peptide sequences are synthesized directly on encoded beads with a 1:1 relationship between peptide sequence and embedded code, thereby making it possible to track many peptide sequences throughout the course of an experiment within a single small volume. We demonstrate the potential of these bead-bound peptide libraries by: (1) creating a set of 46 peptides composed of 3 commonly used epitope tags (myc, FLAG, and HA) and single amino-acid scanning mutants; (2) incubating with a mixture of fluorescently-labeled antimyc, anti-FLAG, and anti-HA antibodies; and (3) imaging these bead-bound libraries to simultaneously identify the embedded spectral code (and thus the sequence of the associated peptide) and quantify the amount of each antibody bound. To our knowledge, these data demonstrate the first customized peptide library synthesized directly on spectrally encoded beads. While the implementation of the technology provided here is a high-affinity antibody/protein interaction with a small code space, we believe this platform can be broadly applicable to any range of peptide screening applications, with the capability to multiplex into libraries of hundreds to thousands of peptides in a single assay.
Bode, Manuela; Woellhaf, Michael W.; Bohnert, Maria; van der Laan, Martin; Sommer, Frederik; Jung, Martin; Zimmermann, Richard; Schroda, Michael; Herrmann, Johannes M.
2015-01-01
Members of the twin Cx9C protein family constitute the largest group of proteins in the intermembrane space (IMS) of mitochondria. Despite their conserved nature and their essential role in the biogenesis of the respiratory chain, the molecular function of twin Cx9C proteins is largely unknown. We performed a SILAC-based quantitative proteomic analysis to identify interaction partners of the conserved twin Cx9C protein Cox19. We found that Cox19 interacts in a dynamic manner with Cox11, a copper transfer protein that facilitates metalation of the Cu(B) center of subunit 1 of cytochrome c oxidase. The interaction with Cox11 is critical for the stable accumulation of Cox19 in mitochondria. Cox19 consists of a helical hairpin structure that forms a hydrophobic surface characterized by two highly conserved tyrosine-leucine dipeptides. These residues are essential for Cox19 function and its specific binding to a cysteine-containing sequence in Cox11. Our observations suggest that an oxidative modification of this cysteine residue of Cox11 stimulates Cox19 binding, pointing to a redox-regulated interplay of Cox19 and Cox11 that is critical for copper transfer in the IMS and thus for biogenesis of cytochrome c oxidase. PMID:25926683
Okura, Hiromichi; Takahashi, Tsuyoshi; Mihara, Hisakazu
2012-06-01
Successful approaches of de novo protein design suggest a great potential to create novel structural folds and to understand natural rules of protein folding. For these purposes, smaller and simpler de novo proteins have been developed. Here, we constructed smaller proteins by removing the terminal sequences from stable de novo vTAJ proteins and compared stabilities between mutant and original proteins. vTAJ proteins were screened from an α3β3 binary-patterned library which was designed with polar/ nonpolar periodicities of α-helix and β-sheet. vTAJ proteins have the additional terminal sequences due to the method of constructing the genetically repeated library sequences. By removing the parts of the sequences, we successfully obtained the stable smaller de novo protein mutants with fewer amino acid alphabets than the originals. However, these mutants showed the differences on ANS binding properties and stabilities against denaturant and pH change. The terminal sequences, which were designed just as flexible linkers not as secondary structure units, sufficiently affected these physicochemical details. This study showed implications for adjusting protein stabilities by designing N- and C-terminal sequences.
On the Origin of Protein Superfamilies and Superfolds
NASA Astrophysics Data System (ADS)
Magner, Abram; Szpankowski, Wojciech; Kihara, Daisuke
2015-02-01
Distributions of protein families and folds in genomes are highly skewed, having a small number of prevalent superfamiles/superfolds and a large number of families/folds of a small size. Why are the distributions of protein families and folds skewed? Why are there only a limited number of protein families? Here, we employ an information theoretic approach to investigate the protein sequence-structure relationship that leads to the skewed distributions. We consider that protein sequences and folds constitute an information theoretic channel and computed the most efficient distribution of sequences that code all protein folds. The identified distributions of sequences and folds are found to follow a power law, consistent with those observed for proteins in nature. Importantly, the skewed distributions of sequences and folds are suggested to have different origins: the skewed distribution of sequences is due to evolutionary pressure to achieve efficient coding of necessary folds, whereas that of folds is based on the thermodynamic stability of folds. The current study provides a new information theoretic framework for proteins that could be widely applied for understanding protein sequences, structures, functions, and interactions.
Optimizing high performance computing workflow for protein functional annotation.
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-09-10
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Optimizing high performance computing workflow for protein functional annotation
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-01-01
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296
Metamorphic Proteins: Emergence of Dual Protein Folds from One Primary Sequence.
Lella, Muralikrishna; Mahalakshmi, Radhakrishnan
2017-06-20
Every amino acid exhibits a different propensity for distinct structural conformations. Hence, decoding how the primary amino acid sequence undergoes the transition to a defined secondary structure and its final three-dimensional fold is presently considered predictable with reasonable certainty. However, protein sequences that defy the first principles of secondary structure prediction (they attain two different folds) have recently been discovered. Such proteins, aptly named metamorphic proteins, decrease the conformational constraint by increasing flexibility in the secondary structure and thereby result in efficient functionality. In this review, we discuss the major factors driving the conformational switch related both to protein sequence and to structure using illustrative examples. We discuss the concept of an evolutionary transition in sequence and structure, the functional impact of the tertiary fold, and the pressure of intrinsic and external factors that give rise to metamorphic proteins. We mainly focus on the major components of protein architecture, namely, the α-helix and β-sheet segments, which are involved in conformational switching within the same or highly similar sequences. These chameleonic sequences are widespread in both cytosolic and membrane proteins, and these folds are equally important for protein structure and function. We discuss the implications of metamorphic proteins and chameleonic peptide sequences in de novo peptide design.
Marsella, Luca; Sirocco, Francesco; Trovato, Antonio; Seno, Flavio; Tosatto, Silvio C.E.
2009-01-01
Motivation: Proteins with solenoid repeats evolve more quickly than non-repetitive ones and their periodicity may be rapidly hidden at sequence level, while still evident in structure. In order to identify these repeats, we propose here a novel method based on a metric characterizing amino-acid properties (polarity, secondary structure, molecular volume, codon diversity, electric charge) using five previously derived numerical functions. Results: The five spectra of the candidate sequences coding for structural repeats, obtained by Discrete Fourier Transform (DFT), show common features allowing determination of repeat periodicity with excellent results. Moreover it is possible to introduce a phase space parameterized by two quantities related to the Fourier spectra which allow for a clear distinction between a non-homologous set of globular proteins and proteins with solenoid repeats. The DFT method is shown to be competitive with other state of the art methods in the detection of solenoid structures, while improving its performance especially in the identification of periodicities, since it is able to recognize the actual repeat length in most cases. Moreover it highlights the relevance of local structural propensities in determining solenoid repeats. Availability: A web tool implementing the algorithm presented in the article (REPETITA) is available with additional details on the data sets at the URL: http://protein.bio.unipd.it/repetita/. Contact: silvio.tosatto@unipd.it PMID:19478001
Philippe, Cécile; Krupovic, Mart; Jaomanjaka, Fety; Claisse, Olivier; Petrel, Melina; le Marrec, Claire
2018-01-16
The Gluconobacter phage GC1 is a novel member of the Tectiviridae family isolated from a juice sample collected during dry white wine making. The bacteriophage infects Gluconobacter cerinus , an acetic acid bacterium which represents a spoilage microorganism during wine making, mainly because it is able to produce ethyl alcohol and transform it into acetic acid. Transmission electron microscopy revealed tail-less icosahedral particles with a diameter of ~78 nm. The linear double-stranded DNA genome of GC1 (16,523 base pairs) contains terminal inverted repeats and carries 36 open reading frames, only a handful of which could be functionally annotated. These encode for the key proteins involved in DNA replication (protein-primed family B DNA polymerase) as well as in virion structure and assembly (major capsid protein, genome packaging ATPase (adenosine triphosphatase) and several minor capsid proteins). GC1 is the first tectivirus infecting an alphaproteobacterial host and is thus far the only temperate tectivirus of gram-negative bacteria. Based on distinctive sequence and life-style features, we propose that GC1 represents a new genus within the Tectiviridae , which we tentatively named " Gammatectivirus ". Furthermore, GC1 helps to bridge the gap in the sequence space between alphatectiviruses and betatectiviruses.
Soler, Miguel A; Rodriguez, Alex; Russo, Anna; Adedeji, Abimbola Feyisara; Dongmo Foumthuim, Cedrix J; Cantarutti, Cristina; Ambrosetti, Elena; Casalis, Loredana; Corazza, Alessandra; Scoles, Giacinto; Marasco, Daniela; Laio, Alessandro; Fortuna, Sara
2017-01-25
The oriented immobilization of proteins, key for the development of novel responsive biomaterials, relies on the availability of effective probes. These are generally provided by standard approaches based on in vivo maturation and in vitro selection of antibodies and/or aptamers. These techniques can suffer technical problems when a non-immunogenic epitope needs to be targeted. Here we propose a strategy to circumvent this issue by in silico design. In our method molecular binders, in the form of cyclic peptides, are computationally evolved by stochastically exploring their sequence and structure space to identify high-affinity peptides for a chosen epitope of a target globular protein: here a solvent-exposed site of β2-microglobulin (β2m). Designed sequences were screened by explicit solvent molecular dynamics simulations (MD) followed by experimental validation. Five candidates gave dose-response surface plasmon resonance signals with dissociation constants in the micromolar range. One of them was further analyzed by means of isothermal titration calorimetry, nuclear magnetic resonance, and 250 ns of MD. Atomic-force microscopy imaging showed that this peptide is able to immobilize β2m on a gold surface. In short, we have shown by a variety of experimental techniques that it is possible to capture a protein through an epitope of choice by computational design.
Lysøe, Erik; Harris, Linda J.; Walkowiak, Sean; Subramaniam, Rajagopal; Divon, Hege H.; Riiser, Even S.; Llorens, Carlos; Gabaldón, Toni; Kistler, H. Corby; Jonkers, Wilfried; Kolseth, Anna-Karin; Nielsen, Kristian F.; Thrane, Ulf; Frandsen, Rasmus J. N.
2014-01-01
Fusarium avenaceum is a fungus commonly isolated from soil and associated with a wide range of host plants. We present here three genome sequences of F. avenaceum, one isolated from barley in Finland and two from spring and winter wheat in Canada. The sizes of the three genomes range from 41.6–43.1 MB, with 13217–13445 predicted protein-coding genes. Whole-genome analysis showed that the three genomes are highly syntenic, and share>95% gene orthologs. Comparative analysis to other sequenced Fusaria shows that F. avenaceum has a very large potential for producing secondary metabolites, with between 75 and 80 key enzymes belonging to the polyketide, non-ribosomal peptide, terpene, alkaloid and indole-diterpene synthase classes. In addition to known metabolites from F. avenaceum, fuscofusarin and JM-47 were detected for the first time in this species. Many protein families are expanded in F. avenaceum, such as transcription factors, and proteins involved in redox reactions and signal transduction, suggesting evolutionary adaptation to a diverse and cosmopolitan ecology. We found that 20% of all predicted proteins were considered to be secreted, supporting a life in the extracellular space during interaction with plant hosts. PMID:25409087
Elman RNN based classification of proteins sequences on account of their mutual information.
Mishra, Pooja; Nath Pandey, Paras
2012-10-21
In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.
Widespread evidence of cooperative DNA binding by transcription factors in Drosophila development.
Kazemian, Majid; Pham, Hannah; Wolfe, Scot A; Brodsky, Michael H; Sinha, Saurabh
2013-09-01
Regulation of eukaryotic gene transcription is often combinatorial in nature, with multiple transcription factors (TFs) regulating common target genes, often through direct or indirect mutual interactions. Many individual examples of cooperative binding by directly interacting TFs have been identified, but it remains unclear how pervasive this mechanism is during animal development. Cooperative TF binding should be manifest in genomic sequences as biased arrangements of TF-binding sites. Here, we explore the extent and diversity of such arrangements related to gene regulation during Drosophila embryogenesis. We used the DNA-binding specificities of 322 TFs along with chromatin accessibility information to identify enriched spacing and orientation patterns of TF-binding site pairs. We developed a new statistical approach for this task, specifically designed to accurately assess inter-site spacing biases while accounting for the phenomenon of homotypic site clustering commonly observed in developmental regulatory regions. We observed a large number of short-range distance preferences between TF-binding site pairs, including examples where the preference depends on the relative orientation of the binding sites. To test whether these binding site patterns reflect physical interactions between the corresponding TFs, we analyzed 27 TF pairs whose binding sites exhibited short distance preferences. In vitro protein-protein binding experiments revealed that >65% of these TF pairs can directly interact with each other. For five pairs, we further demonstrate that they bind cooperatively to DNA if both sites are present with the preferred spacing. This study demonstrates how DNA-binding motifs can be used to produce a comprehensive map of sequence signatures for different mechanisms of combinatorial TF action.
Eisenberg, David; Marcotte, Edward M.; Pellegrini, Matteo; Thompson, Michael J.; Yeates, Todd O.
2002-10-15
A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.
Protein Information Resource: a community resource for expert annotation of protein data
Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy
2001-01-01
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-International databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041
Gao, Yujuan; Wang, Sheng; Deng, Minghua; Xu, Jinbo
2018-05-08
Protein dihedral angles provide a detailed description of protein local conformation. Predicted dihedral angles can be used to narrow down the conformational space of the whole polypeptide chain significantly, thus aiding protein tertiary structure prediction. However, direct angle prediction from sequence alone is challenging. In this article, we present a novel method (named RaptorX-Angle) to predict real-valued angles by combining clustering and deep learning. Tested on a subset of PDB25 and the targets in the latest two Critical Assessment of protein Structure Prediction (CASP), our method outperforms the existing state-of-art method SPIDER2 in terms of Pearson Correlation Coefficient (PCC) and Mean Absolute Error (MAE). Our result also shows approximately linear relationship between the real prediction errors and our estimated bounds. That is, the real prediction error can be well approximated by our estimated bounds. Our study provides an alternative and more accurate prediction of dihedral angles, which may facilitate protein structure prediction and functional study.
Büssow, Konrad; Hoffmann, Steve; Sievert, Volker
2002-12-19
Functional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. A Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer. The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information.
DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe.
Wang, Tianmin; Mori, Hiroshi; Zhang, Chong; Kurokawa, Ken; Xing, Xin-Hui; Yamada, Takuji
2015-03-21
Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature-based enzyme functional prediction tool to assign Enzyme Commission (EC) digits. DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes. Our results offer preliminarily confirmation of the existence of the hypothesized huge number of "hidden enzymes" in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Evolution, substrate specificity and subfamily classification of glycoside hydrolase family 5 (GH5).
Aspeborg, Henrik; Coutinho, Pedro M; Wang, Yang; Brumer, Harry; Henrissat, Bernard
2012-09-20
The large Glycoside Hydrolase family 5 (GH5) groups together a wide range of enzymes acting on β-linked oligo- and polysaccharides, and glycoconjugates from a large spectrum of organisms. The long and complex evolution of this family of enzymes and its broad sequence diversity limits functional prediction. With the objective of improving the differentiation of enzyme specificities in a knowledge-based context, and to obtain new evolutionary insights, we present here a new, robust subfamily classification of family GH5. About 80% of the current sequences were assigned into 51 subfamilies in a global analysis of all publicly available GH5 sequences and associated biochemical data. Examination of subfamilies with catalytically-active members revealed that one third are monospecific (containing a single enzyme activity), although new functions may be discovered with biochemical characterization in the future. Furthermore, twenty subfamilies presently have no characterization whatsoever and many others have only limited structural and biochemical data. Mapping of functional knowledge onto the GH5 phylogenetic tree revealed that the sequence space of this historical and industrially important family is far from well dispersed, highlighting targets in need of further study. The analysis also uncovered a number of GH5 proteins which have lost their catalytic machinery, indicating evolution towards novel functions. Overall, the subfamily division of GH5 provides an actively curated resource for large-scale protein sequence annotation for glycogenomics; the subfamily assignments are openly accessible via the Carbohydrate-Active Enzyme database at http://www.cazy.org/GH5.html.
Ma, Jun; Wu, Kaiming; Zhao, Zhenxian; Miao, Rong; Xu, Zhe
2017-03-01
Esophageal squamous cell carcinoma is one of the most aggressive malignancies worldwide. Special AT-rich sequence binding protein 1 is a nuclear matrix attachment region binding protein which participates in higher order chromatin organization and tissue-specific gene expression. However, the role of special AT-rich sequence binding protein 1 in esophageal squamous cell carcinoma remains unknown. In this study, western blot and quantitative real-time polymerase chain reaction analysis were performed to identify differentially expressed special AT-rich sequence binding protein 1 in a series of esophageal squamous cell carcinoma tissue samples. The effects of special AT-rich sequence binding protein 1 silencing by two short-hairpin RNAs on cell proliferation, migration, and invasion were assessed by the CCK-8 assay and transwell assays in esophageal squamous cell carcinoma in vitro. Special AT-rich sequence binding protein 1 was significantly upregulated in esophageal squamous cell carcinoma tissue samples and cell lines. Silencing of special AT-rich sequence binding protein 1 inhibited the proliferation of KYSE450 and EC9706 cells which have a relatively high level of special AT-rich sequence binding protein 1, and the ability of migration and invasion of KYSE450 and EC9706 cells was distinctly suppressed. Special AT-rich sequence binding protein 1 could be a potential target for the treatment of esophageal squamous cell carcinoma and inhibition of special AT-rich sequence binding protein 1 may provide a new strategy for the prevention of esophageal squamous cell carcinoma invasion and metastasis.
Automatic Prediction of Protein 3D Structures by Probabilistic Multi-template Homology Modeling.
Meier, Armin; Söding, Johannes
2015-10-01
Homology modeling predicts the 3D structure of a query protein based on the sequence alignment with one or more template proteins of known structure. Its great importance for biological research is owed to its speed, simplicity, reliability and wide applicability, covering more than half of the residues in protein sequence space. Although multiple templates have been shown to generally increase model quality over single templates, the information from multiple templates has so far been combined using empirically motivated, heuristic approaches. We present here a rigorous statistical framework for multi-template homology modeling. First, we find that the query proteins' atomic distance restraints can be accurately described by two-component Gaussian mixtures. This insight allowed us to apply the standard laws of probability theory to combine restraints from multiple templates. Second, we derive theoretically optimal weights to correct for the redundancy among related templates. Third, a heuristic template selection strategy is proposed. We improve the average GDT-ha model quality score by 11% over single template modeling and by 6.5% over a conventional multi-template approach on a set of 1000 query proteins. Robustness with respect to wrong constraints is likewise improved. We have integrated our multi-template modeling approach with the popular MODELLER homology modeling software in our free HHpred server http://toolkit.tuebingen.mpg.de/hhpred and also offer open source software for running MODELLER with the new restraints at https://bitbucket.org/soedinglab/hh-suite.
Roca, Alberto I
2014-01-01
The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein alignments. As an alternative, ProfileGrids are a new protein sequence alignment visualization paradigm that represents an alignment as a color-coded matrix of the residue frequency occurring at every homologous position in the aligned protein family. The JProfileGrid software program was used to analyze the BioVis contest data sets to generate figures for comparison with the Sequence Logo reference images. The ProfileGrid representation allows for the clear and effective analysis of protein multiple sequence alignments. This includes both a general overview of the conservation and diversity sequence patterns as well as the interactive ability to query the details of the protein residue distributions in the alignment. The JProfileGrid software is free and available from http://www.ProfileGrid.org.
Construction of an agglutination tool: recombinant Fab fragments biotinylated in vitro.
Czerwinski, Marcin; Krop-Watorek, Anna; Wasniowska, Kazimiera; Smolarek, Dorota; Spitalnik, Steven L
2009-11-30
The pComb3H vector system is used for constructing and panning recombinant antibody libraries. It allows for expression of monovalent Fab fragments, either on the surface of M13 phage, or in the form of soluble proteins secreted into the periplasmic space of bacteria. We constructed a modified pComb3H vector containing cDNA encoding for a 23-amino acid fragment of the Escherichia coli biotin carboxy carrier protein (BCCP), which is an acceptor sequence for biotinylation. The vector was used to express the Fab fragment recognizing human glycophorin A. The purified Fab fragment containing this biotin acceptor sequence was effectively biotinylated in vitro using biotin ligase (BirA). The specificity and avidity of the biotinylated Fab fragments were similar to the previously produced, unmodified Fab fragments. An avidin-alkaline phosphatase conjugate was used to detect the recombinant Fab fragments, instead of secondary antibody. In addition, when biotinylated Fab fragments were mixed with avidin, red blood cells were directly agglutinated.
A topological approach for protein classification
Cang, Zixuan; Mu, Lin; Wu, Kedi; ...
2015-11-04
Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.
Dong, Zheng; Zhou, Hongyu; Tao, Peng
2018-02-01
PAS domains are widespread in archaea, bacteria, and eukaryota, and play important roles in various functions. In this study, we aim to explore functional evolutionary relationship among proteins in the PAS domain superfamily in view of the sequence-structure-dynamics-function relationship. We collected protein sequences and crystal structure data from RCSB Protein Data Bank of the PAS domain superfamily belonging to three biological functions (nucleotide binding, photoreceptor activity, and transferase activity). Protein sequences were aligned and then used to select sequence-conserved residues and build phylogenetic tree. Three-dimensional structure alignment was also applied to obtain structure-conserved residues. The protein dynamics were analyzed using elastic network model (ENM) and validated by molecular dynamics (MD) simulation. The result showed that the proteins with same function could be grouped by sequence similarity, and proteins in different functional groups displayed statistically significant difference in their vibrational patterns. Interestingly, in all three functional groups, conserved amino acid residues identified by sequence and structure conservation analysis generally have a lower fluctuation than other residues. In addition, the fluctuation of conserved residues in each biological function group was strongly correlated with the corresponding biological function. This research suggested a direct connection in which the protein sequences were related to various functions through structural dynamics. This is a new attempt to delineate functional evolution of proteins using the integrated information of sequence, structure, and dynamics. © 2017 The Protein Society.
Dissecting the relationship between protein structure and sequence variation
NASA Astrophysics Data System (ADS)
Shahmoradi, Amir; Wilke, Claus; Wilke Lab Team
2015-03-01
Over the past decade several independent works have shown that some structural properties of proteins are capable of predicting protein evolution. The strength and significance of these structure-sequence relations, however, appear to vary widely among different proteins, with absolute correlation strengths ranging from 0 . 1 to 0 . 8 . Here we present the results from a comprehensive search for the potential biophysical and structural determinants of protein evolution by studying more than 200 structural and evolutionary properties in a dataset of 209 monomeric enzymes. We discuss the main protein characteristics responsible for the general patterns of protein evolution, and identify sequence divergence as the main determinant of the strengths of virtually all structure-evolution relationships, explaining ~ 10 - 30 % of observed variation in sequence-structure relations. In addition to sequence divergence, we identify several protein structural properties that are moderately but significantly coupled with the strength of sequence-structure relations. In particular, proteins with more homogeneous back-bone hydrogen bond energies, large fractions of helical secondary structures and low fraction of beta sheets tend to have the strongest sequence-structure relation. BEACON-NSF center for the study of evolution in action.
Protein Interaction Profile Sequencing (PIP-seq).
Foley, Shawn W; Gregory, Brian D
2016-10-10
Every eukaryotic RNA transcript undergoes extensive post-transcriptional processing from the moment of transcription up through degradation. This regulation is performed by a distinct cohort of RNA-binding proteins which recognize their target transcript by both its primary sequence and secondary structure. Here, we describe protein interaction profile sequencing (PIP-seq), a technique that uses ribonuclease-based footprinting followed by high-throughput sequencing to globally assess both protein-bound RNA sequences and RNA secondary structure. PIP-seq utilizes single- and double-stranded RNA-specific nucleases in the absence of proteins to infer RNA secondary structure. These libraries are also compared to samples that undergo nuclease digestion in the presence of proteins in order to find enriched protein-bound sequences. Combined, these four libraries provide a comprehensive, transcriptome-wide view of RNA secondary structure and RNA protein interaction sites from a single experimental technique. © 2016 by John Wiley & Sons, Inc. Copyright © 2016 John Wiley & Sons, Inc.
Lu, Hui-Meng; Yin, Da-Chuan; Ye, Ya-Jing; Luo, Hui-Min; Geng, Li-Qiang; Li, Hai-Sheng; Guo, Wei-Hong; Shang, Peng
2009-01-01
As the most widely utilized technique to determine the 3-dimensional structure of protein molecules, X-ray crystallography can provide structure of the highest resolution among the developed techniques. The resolution obtained via X-ray crystallography is known to be influenced by many factors, such as the crystal quality, diffraction techniques, and X-ray sources, etc. In this paper, the authors found that the protein sequence could also be one of the factors. We extracted information of the resolution and the sequence of proteins from the Protein Data Bank (PDB), classified the proteins into different clusters according to the sequence similarity, and statistically analyzed the relationship between the sequence similarity and the best resolution obtained. The results showed that there was a pronounced correlation between the sequence similarity and the obtained resolution. These results indicate that protein structure itself is one variable that may affect resolution when X-ray crystallography is used.
Ridley, R G; Patel, H V; Gerber, G E; Morton, R C; Freeman, K B
1986-01-01
A cDNA clone spanning the entire amino acid sequence of the nuclear-encoded uncoupling protein of rat brown adipose tissue mitochondria has been isolated and sequenced. With the exception of the N-terminal methionine the deduced N-terminus of the newly synthesized uncoupling protein is identical to the N-terminal 30 amino acids of the native uncoupling protein as determined by protein sequencing. This proves that the protein contains no N-terminal mitochondrial targeting prepiece and that a targeting region must reside within the amino acid sequence of the mature protein. Images PMID:3012461
MIPS: a database for genomes and protein sequences.
Mewes, H W; Heumann, K; Kaps, A; Mayer, K; Pfeiffer, F; Stocker, S; Frishman, D
1999-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried near Munich, Germany, develops and maintains genome oriented databases. It is commonplace that the amount of sequence data available increases rapidly, but not the capacity of qualified manual annotation at the sequence databases. Therefore, our strategy aims to cope with the data stream by the comprehensive application of analysis tools to sequences of complete genomes, the systematic classification of protein sequences and the active support of sequence analysis and functional genomics projects. This report describes the systematic and up-to-date analysis of genomes (PEDANT), a comprehensive database of the yeast genome (MYGD), a database reflecting the progress in sequencing the Arabidopsis thaliana genome (MATD), the database of assembled, annotated human EST clusters (MEST), and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). MIPS provides access through its WWW server (http://www.mips.biochem.mpg.de) to a spectrum of generic databases, including the above mentioned as well as a database of protein families (PROTFAM), the MITOP database, and the all-against-all FASTA database. PMID:9847138
Spaces of ideal convergent sequences.
Mursaleen, M; Sharma, Sunil K
2014-01-01
In the present paper, we introduce some sequence spaces using ideal convergence and Musielak-Orlicz function ℳ = (M(k)). We also examine some topological properties of the resulting sequence spaces.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cang, Zixuan; Mu, Lin; Wu, Kedi
Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.
Understanding protein evolution: from protein physics to Darwinian selection.
Zeldovich, Konstantin B; Shakhnovich, Eugene I
2008-01-01
Efforts in whole-genome sequencing and structural proteomics start to provide a global view of the protein universe, the set of existing protein structures and sequences. However, approaches based on the selection of individual sequences have not been entirely successful at the quantitative description of the distribution of structures and sequences in the protein universe because evolutionary pressure acts on the entire organism, rather than on a particular molecule. In parallel to this line of study, studies in population genetics and phenomenological molecular evolution established a mathematical framework to describe the changes in genome sequences in populations of organisms over time. Here, we review both microscopic (physics-based) and macroscopic (organism-level) models of protein-sequence evolution and demonstrate that bridging the two scales provides the most complete description of the protein universe starting from clearly defined, testable, and physiologically relevant assumptions.
ChemProt-2.0: visual navigation in a disease chemical biology database
Kim Kjærulff, Sonny; Wich, Louis; Kringelum, Jens; Jacobsen, Ulrik P.; Kouskoumvekaki, Irene; Audouze, Karine; Lund, Ole; Brunak, Søren; Oprea, Tudor I.; Taboureau, Olivier
2013-01-01
ChemProt-2.0 (http://www.cbs.dtu.dk/services/ChemProt-2.0) is a public available compilation of multiple chemical–protein annotation resources integrated with diseases and clinical outcomes information. The database has been updated to >1.15 million compounds with 5.32 millions bioactivity measurements for 15 290 proteins. Each protein is linked to quality-scored human protein–protein interactions data based on more than half a million interactions, for studying diseases and biological outcomes (diseases, pathways and GO terms) through protein complexes. In ChemProt-2.0, therapeutic effects as well as adverse drug reactions have been integrated allowing for suggesting proteins associated to clinical outcomes. New chemical structure fingerprints were computed based on the similarity ensemble approach. Protein sequence similarity search was also integrated to evaluate the promiscuity of proteins, which can help in the prediction of off-target effects. Finally, the database was integrated into a visual interface that enables navigation of the pharmacological space for small molecules. Filtering options were included in order to facilitate and to guide dynamic search of specific queries. PMID:23185041
Darré, Leonardo; Machado, Matías Rodrigo; Brandner, Astrid Febe; González, Humberto Carlos; Ferreira, Sebastián; Pantano, Sergio
2015-02-10
Modeling of macromolecular structures and interactions represents an important challenge for computational biology, involving different time and length scales. However, this task can be facilitated through the use of coarse-grained (CG) models, which reduce the number of degrees of freedom and allow efficient exploration of complex conformational spaces. This article presents a new CG protein model named SIRAH, developed to work with explicit solvent and to capture sequence, temperature, and ionic strength effects in a topologically unbiased manner. SIRAH is implemented in GROMACS, and interactions are calculated using a standard pairwise Hamiltonian for classical molecular dynamics simulations. We present a set of simulations that test the capability of SIRAH to produce a qualitatively correct solvation on different amino acids, hydrophilic/hydrophobic interactions, and long-range electrostatic recognition leading to spontaneous association of unstructured peptides and stable structures of single polypeptides and protein-protein complexes.
Trindade, Inês B.; Fonseca, Bruno M.; Matias, Pedro M.; Louro, Ricardo O.; Moe, Elin
2016-01-01
Siderophore-binding proteins (SIPs) perform a key role in iron acquisition in multiple organisms. In the genome of the marine bacterium Shewanella frigidimarina NCIMB 400, the gene tagged as SFRI_RS12295 encodes a protein from this family. Here, the cloning, expression, purification and crystallization of this protein are reported, together with its preliminary X-ray crystallographic analysis to 1.35 Å resolution. The SIP crystals belonged to the monoclinic space group P21, with unit-cell parameters a = 48.04, b = 78.31, c = 67.71 Å, α = 90, β = 99.94, γ = 90°, and are predicted to contain two molecules per asymmetric unit. Structure determination by molecular replacement and the use of previously determined ∼2 Å resolution SIP structures with ∼30% sequence identity as templates are ongoing. PMID:27599855
Time, space, and disorder in the expanding proteome universe.
Minde, David-Paul; Dunker, A Keith; Lilley, Kathryn S
2017-04-01
Proteins are highly dynamic entities. Their myriad functions require specific structures, but proteins' dynamic nature ranges all the way from the local mobility of their amino acid constituents to mobility within and well beyond single cells. A truly comprehensive view of the dynamic structural proteome includes: (i) alternative sequences, (ii) alternative conformations, (iii) alternative interactions with a range of biomolecules, (iv) cellular localizations, (v) alternative behaviors in different cell types. While these aspects have traditionally been explored one protein at a time, we highlight recently emerging global approaches that accelerate comprehensive insights into these facets of the dynamic nature of protein structure. Computational tools that integrate and expand on multiple orthogonal data types promise to enable the transition from a disjointed list of static snapshots to a structurally explicit understanding of the dynamics of cellular mechanisms. © 2017 The Authors. Proteomics Published by Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Choice of High-Efficacy Strains for the Annual Influenza Vaccine
NASA Astrophysics Data System (ADS)
Deem, Michael
2005-03-01
We introduce a model of protein evolution to explain limitations in the immune system response to vaccination and disease [1]. The phenomenon of original antigenic sin, wherein vaccination creates memory sequences that can increase susceptibility to future exposures to the same disease, is explained as stemming from localization of the immune system response in antibody sequence space. This localization is a result of the roughness in sequence space of the evolved antibody affinity constant for antigen and is observed for diseases with high year-to-year mutation rates, such as influenza. We show that the order parameter within this theory correlates well with efficacies of the H3N2 influenza A component of the annual vaccine between 1971 and 2004 [2,3]. This new measure of antigenic distance predicts vaccine efficacy significantly more accurately than do current state-of-the-art phylogenetic sequence analyses or ferret antisera inhibition assays. We discuss how this new measure of antigenic distance may be used in the context of annual influenza vaccine design and monitoring of vaccine efficacy. 1) M. W. Deem and H. Y. Lee, Phys. Rev. Lett. 91 (2003) 068101. 2) E. T. Munoz and M. W. Deem,q-bio.BM/0408016. 3) V. Gupta, D. J. Earl, and M. W. Deem, ``Choice of High-Efficacy Strains for the Annual Influenza Vaccine,'' submitted.
Ołdziej, S; Czaplewski, C; Liwo, A; Chinchio, M; Nanias, M; Vila, J A; Khalili, M; Arnautova, Y A; Jagielska, A; Makowski, M; Schafroth, H D; Kaźmierkiewicz, R; Ripoll, D R; Pillardy, J; Saunders, J A; Kang, Y K; Gibson, K D; Scheraga, H A
2005-05-24
Recent improvements in the protein-structure prediction method developed in our laboratory, based on the thermodynamic hypothesis, are described. The conformational space is searched extensively at the united-residue level by using our physics-based UNRES energy function and the conformational space annealing method of global optimization. The lowest-energy coarse-grained structures are then converted to an all-atom representation and energy-minimized with the ECEPP/3 force field. The procedure was assessed in two recent blind tests of protein-structure prediction. During the first blind test, we predicted large fragments of alpha and alpha+beta proteins [60-70 residues with C(alpha) rms deviation (rmsd) <6 A]. However, for alpha+beta proteins, significant topological errors occurred despite low rmsd values. In the second exercise, we predicted whole structures of five proteins (two alpha and three alpha+beta, with sizes of 53-235 residues) with remarkably good accuracy. In particular, for the genomic target TM0487 (a 102-residue alpha+beta protein from Thermotoga maritima), we predicted the complete, topologically correct structure with 7.3-A C(alpha) rmsd. So far this protein is the largest alpha+beta protein predicted based solely on the amino acid sequence and a physics-based potential-energy function and search procedure. For target T0198, a phosphate transport system regulator PhoU from T. maritima (a 235-residue mainly alpha-helical protein), we predicted the topology of the whole six-helix bundle correctly within 8 A rmsd, except the 32 C-terminal residues, most of which form a beta-hairpin. These and other examples described in this work demonstrate significant progress in physics-based protein-structure prediction.
Process for Encapsulating Protein Crystals
NASA Technical Reports Server (NTRS)
Morrison, Dennis R.; Mosier, Benjamin
2003-01-01
A process for growing protein crystals encapsulated within membranes has been invented. This process begins with the encapsulation of a nearly saturated aqueous protein solution inside semipermeable membranes to form microcapsules. The encapsulation is effected by use of special formulations of a dissolved protein and a surfactant in an aqueous first liquid phase, which is placed into contact with a second, immiscible liquid phase that contains one or more polymers that are insoluble in the first phase. The second phase becomes formed into the semipermeable membranes that surround microglobules of the first phase, thereby forming the microcapsules. Once formed, the microcapsules are then dehydrated osmotically by exposure to a concentrated salt or polymer solution. The dehydration forms supersaturated solutions inside the microcapsules, thereby enabling nucleation and growth of protein crystals inside the microcapsules. By suitable formulation of the polymer or salt solution and of other physical and chemical parameters, one can control the rate of transport of water out of the microcapsules through the membranes and thereby create physicochemical conditions that favor the growth, within each microcapsule, of one or a few large crystals suitable for analysis by x-ray diffraction. The membrane polymer can be formulated to consist of low-molecular-weight molecules that do not interfere with the x-ray diffraction analysis of the encapsulated crystals. During dehydration, an electrostatic field can be applied to exert additional control over the rate of dehydration. This protein-crystal-encapsulation process is expected to constitute the basis of protein-growth experiments to be performed on the space shuttle and the International Space Station. As envisioned, the experiments would involve the exposure of immiscible liquids to each other in sequences of steps under microgravitational conditions. The experiments are expected to contribute to knowledge of the precise conditions under which protein crystals form. By enhancing the ability to grow crystals suitable for x-ray diffraction analysis, this knowledge can be expected to benefit not only the space program but also medicine and the pharmaceutical industry.
Tanaka, Junko; Doi, Nobuhide; Takashima, Hideaki; Yanagawa, Hiroshi
2010-01-01
Screening of functional proteins from a random-sequence library has been used to evolve novel proteins in the field of evolutionary protein engineering. However, random-sequence proteins consisting of the 20 natural amino acids tend to aggregate, and the occurrence rate of functional proteins in a random-sequence library is low. From the viewpoint of the origin of life, it has been proposed that primordial proteins consisted of a limited set of amino acids that could have been abundantly formed early during chemical evolution. We have previously found that members of a random-sequence protein library constructed with five primitive amino acids show high solubility (Doi et al., Protein Eng Des Sel 2005;18:279–284). Although such a library is expected to be appropriate for finding functional proteins, the functionality may be limited, because they have no positively charged amino acid. Here, we constructed three libraries of 120-amino acid, random-sequence proteins using alphabets of 5, 12, and 20 amino acids by preselection using mRNA display (to eliminate sequences containing stop codons and frameshifts) and characterized and compared the structural properties of random-sequence proteins arbitrarily chosen from these libraries. We found that random-sequence proteins constructed with the 12-member alphabet (including five primitive amino acids and positively charged amino acids) have higher solubility than those constructed with the 20-member alphabet, though other biophysical properties are very similar in the two libraries. Thus, a library of moderate complexity constructed from 12 amino acids may be a more appropriate resource for functional screening than one constructed from 20 amino acids. PMID:20162614
Diversity of the P2 protein among nontypeable Haemophilus influenzae isolates.
Bell, J; Grass, S; Jeanteur, D; Munson, R S
1994-01-01
The genes for outer membrane protein P2 of four nontypeable Haemophilus influenzae strains were cloned and sequenced. The derived amino acid sequences were compared with the outer membrane protein P2 sequence from H. influenzae type b MinnA and the sequences of P2 from three additional nontypeable H. influenzae strains. The sequences were 76 to 94% identical. The sequences had regions with considerable variability separated by regions which were highly conserved. The variable regions mapped to putative surface-exposed loops of the protein. PMID:8188390
Kimura, M; Kimura, J; Hatakeyama, T
1988-11-21
The complete amino acid sequences of ribosomal proteins S11 from the Gram-positive eubacterium Bacillus stearothermophilus and of S19 from the archaebacterium Halobacterium marismortui have been determined. A search for homologous sequences of these proteins revealed that they belong to the ribosomal protein S11 family. Homologous proteins have previously been sequenced from Escherichia coli as well as from chloroplast, yeast and mammalian ribosomes. A pairwise comparison of the amino acid sequences showed that Bacillus protein S11 shares 68% identical residues with S11 from Escherichia coli and a slightly lower homology (52%) with the homologous chloroplast protein. The halophilic protein S19 is more related to the eukaryotic (45-49%) than to the eubacterial counterparts (35%).
Prestat, Emmanuel; David, Maude M.; Hultman, Jenni; ...
2014-09-26
A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associatedmore » functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.« less
Yefremova, Yelena; Al-Majdoub, Mahmoud; Opuni, Kwabena F M; Koy, Cornelia; Cui, Weidong; Yan, Yuetian; Gross, Michael L; Glocker, Michael O
2015-03-01
Mass spectrometric de-novo sequencing was applied to review the amino acid sequence of a commercially available recombinant protein G´ with great scientific and economic importance. Substantial deviations to the published amino acid sequence (Uniprot Q54181) were found by the presence of 46 additional amino acids at the N-terminus, including a so-called "His-tag" as well as an N-terminal partial α-N-gluconoylation and α-N-phosphogluconoylation, respectively. The unexpected amino acid sequence of the commercial protein G' comprised 241 amino acids and resulted in a molecular mass of 25,998.9 ± 0.2 Da for the unmodified protein. Due to the higher mass that is caused by its extended amino acid sequence compared with the original protein G' (185 amino acids), we named this protein "protein G'e." By means of mass spectrometric peptide mapping, the suggested amino acid sequence, as well as the N-terminal partial α-N-gluconoylations, was confirmed with 100% sequence coverage. After the protein G'e sequence was determined, we were able to determine the expression vector pET-28b from Novagen with the Xho I restriction enzyme cleavage site as the best option that was used for cloning and expressing the recombinant protein G'e in E. coli. A dissociation constant (K(d)) value of 9.4 nM for protein G'e was determined thermophoretically, showing that the N-terminal flanking sequence extension did not cause significant changes in the binding affinity to immunoglobulins.
Odronitz, Florian; Kollmar, Martin
2006-11-29
Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein.
Rodriguez-Rivas, Juan; Marsili, Simone; Juan, David; Valencia, Alfonso
2016-12-27
Protein-protein interactions are fundamental for the proper functioning of the cell. As a result, protein interaction surfaces are subject to strong evolutionary constraints. Recent developments have shown that residue coevolution provides accurate predictions of heterodimeric protein interfaces from sequence information. So far these approaches have been limited to the analysis of families of prokaryotic complexes for which large multiple sequence alignments of homologous sequences can be compiled. We explore the hypothesis that coevolution points to structurally conserved contacts at protein-protein interfaces, which can be reliably projected to homologous complexes with distantly related sequences. We introduce a domain-centered protocol to study the interplay between residue coevolution and structural conservation of protein-protein interfaces. We show that sequence-based coevolutionary analysis systematically identifies residue contacts at prokaryotic interfaces that are structurally conserved at the interface of their eukaryotic counterparts. In turn, this allows the prediction of conserved contacts at eukaryotic protein-protein interfaces with high confidence using solely mutational patterns extracted from prokaryotic genomes. Even in the context of high divergence in sequence (the twilight zone), where standard homology modeling of protein complexes is unreliable, our approach provides sequence-based accurate information about specific details of protein interactions at the residue level. Selected examples of the application of prokaryotic coevolutionary analysis to the prediction of eukaryotic interfaces further illustrate the potential of this approach.
Sequence repeats and protein structure
NASA Astrophysics Data System (ADS)
Hoang, Trinh X.; Trovato, Antonio; Seno, Flavio; Banavar, Jayanth R.; Maritan, Amos
2012-11-01
Repeats are frequently found in known protein sequences. The level of sequence conservation in tandem repeats correlates with their propensities to be intrinsically disordered. We employ a coarse-grained model of a protein with a two-letter amino acid alphabet, hydrophobic (H) and polar (P), to examine the sequence-structure relationship in the realm of repeated sequences. A fraction of repeated sequences comprises a distinct class of bad folders, whose folding temperatures are much lower than those of random sequences. Imperfection in sequence repetition improves the folding properties of the bad folders while deteriorating those of the good folders. Our results may explain why nature has utilized repeated sequences for their versatility and especially to design functional proteins that are intrinsically unstructured at physiological temperatures.
Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation
Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.
2013-01-01
The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392
Rapid identification of sequences for orphan enzymes to power accurate protein annotation.
Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G
2013-01-01
The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.
Cheng, Yali; Avis, Tyler J; Bolduc, Sébastien; Zhao, Yingyi; Anguenot, Raphaël; Neveu, Bertrand; Labbé, Caroline; Belzile, François; Bélanger, Richard R
2008-12-01
Secretion of recombinant proteins aims to reproduce the correct posttranslational modifications of the expressed protein while simplifying its recovery. In this study, secretion signal sequences from an abundantly secreted 34-kDa protein (P34) from Pseudozyma flocculosa were cloned. The efficiency of these sequences in the secretion of recombinant green fluorescent protein (GFP) was investigated in two Pseudozyma species and compared with other secretion signal sequences, from S. cerevisiae and Pseudozyma spp. The results indicate that various secretion signal sequences were functional and that the P34 signal peptide was the most effective secretion signal sequence in both P. flocculosa and P. antarctica. The cells correctly processed the secretion signal sequences, including P34 signal peptide, and mature GFP was recovered from the culture medium. This is the first report of functional secretion signal sequences in P. flocculosa. These sequences can be used to test the secretion of other recombinant proteins and for studying the secretion pathway in P. flocculosa and P. antarctica.
Modeling repetitive, non‐globular proteins
Basu, Koli; Campbell, Robert L.; Guo, Shuaiqi; Sun, Tianjun
2016-01-01
Abstract While ab initio modeling of protein structures is not routine, certain types of proteins are more straightforward to model than others. Proteins with short repetitive sequences typically exhibit repetitive structures. These repetitive sequences can be more amenable to modeling if some information is known about the predominant secondary structure or other key features of the protein sequence. We have successfully built models of a number of repetitive structures with novel folds using knowledge of the consensus sequence within the sequence repeat and an understanding of the likely secondary structures that these may adopt. Our methods for achieving this success are reviewed here. PMID:26914323
Evaluating the protein coding potential of exonized transposable element sequences
Piriyapongsa, Jittima; Rutledge, Mark T; Patel, Sanil; Borodovsky, Mark; Jordan, I King
2007-01-01
Background Transposable element (TE) sequences, once thought to be merely selfish or parasitic members of the genomic community, have been shown to contribute a wide variety of functional sequences to their host genomes. Analysis of complete genome sequences have turned up numerous cases where TE sequences have been incorporated as exons into mRNAs, and it is widely assumed that such 'exonized' TEs encode protein sequences. However, the extent to which TE-derived sequences actually encode proteins is unknown and a matter of some controversy. We have tried to address this outstanding issue from two perspectives: i-by evaluating ascertainment biases related to the search methods used to uncover TE-derived protein coding sequences (CDS) and ii-through a probabilistic codon-frequency based analysis of the protein coding potential of TE-derived exons. Results We compared the ability of three classes of sequence similarity search methods to detect TE-derived sequences among data sets of experimentally characterized proteins: 1-a profile-based hidden Markov model (HMM) approach, 2-BLAST methods and 3-RepeatMasker. Profile based methods are more sensitive and more selective than the other methods evaluated. However, the application of profile-based search methods to the detection of TE-derived sequences among well-curated experimentally characterized protein data sets did not turn up many more cases than had been previously detected and nowhere near as many cases as recent genome-wide searches have. We observed that the different search methods used were complementary in the sense that they yielded largely non-overlapping sets of hits and differed in their ability to recover known cases of TE-derived CDS. The probabilistic analysis of TE-derived exon sequences indicates that these sequences have low protein coding potential on average. In particular, non-autonomous TEs that do not encode protein sequences, such as Alu elements, are frequently exonized but unlikely to encode protein sequences. Conclusion The exaptation of the numerous TE sequences found in exons as bona fide protein coding sequences may prove to be far less common than has been suggested by the analysis of complete genomes. We hypothesize that many exonized TE sequences actually function as post-transcriptional regulators of gene expression, rather than coding sequences, which may act through a variety of double stranded RNA related regulatory pathways. Indeed, their relatively high copy numbers and similarity to sequences dispersed throughout the genome suggests that exonized TE sequences could serve as master regulators with a wide scope of regulatory influence. Reviewers: This article was reviewed by Itai Yanai, Kateryna D. Makova, Melissa Wilson (nominated by Kateryna D. Makova) and Cedric Feschotte (nominated by John M. Logsdon Jr.). PMID:18036258
Buck, Patrick M.; Kumar, Sandeep; Singh, Satish K.
2013-01-01
The various roles that aggregation prone regions (APRs) are capable of playing in proteins are investigated here via comprehensive analyses of multiple non-redundant datasets containing randomly generated amino acid sequences, monomeric proteins, intrinsically disordered proteins (IDPs) and catalytic residues. Results from this study indicate that the aggregation propensities of monomeric protein sequences have been minimized compared to random sequences with uniform and natural amino acid compositions, as observed by a lower average aggregation propensity and fewer APRs that are shorter in length and more often punctuated by gate-keeper residues. However, evidence for evolutionary selective pressure to disrupt these sequence regions among homologous proteins is inconsistent. APRs are less conserved than average sequence identity among closely related homologues (≥80% sequence identity with a parent) but APRs are more conserved than average sequence identity among homologues that have at least 50% sequence identity with a parent. Structural analyses of APRs indicate that APRs are three times more likely to contain ordered versus disordered residues and that APRs frequently contribute more towards stabilizing proteins than equal length segments from the same protein. Catalytic residues and APRs were also found to be in structural contact significantly more often than expected by random chance. Our findings suggest that proteins have evolved by optimizing their risk of aggregation for cellular environments by both minimizing aggregation prone regions and by conserving those that are important for folding and function. In many cases, these sequence optimizations are insufficient to develop recombinant proteins into commercial products. Rational design strategies aimed at improving protein solubility for biotechnological purposes should carefully evaluate the contributions made by candidate APRs, targeted for disruption, towards protein structure and activity. PMID:24146608
2014-01-01
Background The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein alignments. As an alternative, ProfileGrids are a new protein sequence alignment visualization paradigm that represents an alignment as a color-coded matrix of the residue frequency occurring at every homologous position in the aligned protein family. Results The JProfileGrid software program was used to analyze the BioVis contest data sets to generate figures for comparison with the Sequence Logo reference images. Conclusions The ProfileGrid representation allows for the clear and effective analysis of protein multiple sequence alignments. This includes both a general overview of the conservation and diversity sequence patterns as well as the interactive ability to query the details of the protein residue distributions in the alignment. The JProfileGrid software is free and available from http://www.ProfileGrid.org. PMID:25237393
Defining functional distance using manifold embeddings of gene ontology annotations
Lerman, Gilad; Shakhnovich, Boris E.
2007-01-01
Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules. PMID:17595300
Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum.
VanBuren, Robert; Bryant, Doug; Edger, Patrick P; Tang, Haibao; Burgess, Diane; Challabathula, Dinakar; Spittle, Kristi; Hall, Richard; Gu, Jenny; Lyons, Eric; Freeling, Michael; Bartels, Dorothea; Ten Hallers, Boudewijn; Hastie, Alex; Michael, Todd P; Mockler, Todd C
2015-11-26
Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE). Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetium genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a 'near-complete' draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.
NASA Astrophysics Data System (ADS)
Marchetti, Paul S.; Bhattacharyya, Lokesh; Ellis, Paul D.; Brewer, C. Fred
Solid-state 113Cd NMR spectroscopy of static powder samples of 113Cd-substituted metalloproteins, parvalbumin, concanavalin A, and pea and lentil lectins, was carried out. Cross polarization followed by application of a train of uniformly spaced π pulses was employed to investigate the origin of residual cadmium NMR linewidths observed previously in these proteins. Fourier transformation of the resulting spin-echo train yielded spectra consisting of uniformly spaced lines having linewidths of the order of 1-2 ppm. The observed linewidths were not influenced by temperature as low as -50°C or by extent of protein hydration. Since the echo-train pulse sequence is able to eliminate inhomogeneous but not homogeneous contributions to the linewidths, there is a predominant inhomogeneous contribution to cadmium linewidths in the protein CP/MAS spectra. However, significant changes in spectral intensities were observed with change in temperature and extent of protein hydration. These intensity changes are attributed for parvalbumin and concanavalin A to changes in cross-polarization efficiency with temperature and hydration. For pea and lentil lectins, this effect is attributed to the elimination of static disorder at the pea and lentil S2 metal-ion sites due to sugar binding.
NASA Astrophysics Data System (ADS)
Miao, Xijiang; Mukhopadhyay, Rishi; Valafar, Homayoun
2008-10-01
Advances in NMR instrumentation and pulse sequence design have resulted in easier acquisition of Residual Dipolar Coupling (RDC) data. However, computational and theoretical analysis of this type of data has continued to challenge the international community of investigators because of their complexity and rich information content. Contemporary use of RDC data has required a-priori assignment, which significantly increases the overall cost of structural analysis. This article introduces a novel algorithm that utilizes unassigned RDC data acquired from multiple alignment media ( nD-RDC, n ⩾ 3) for simultaneous extraction of the relative order tensor matrices and reconstruction of the interacting vectors in space. Estimation of the relative order tensors and reconstruction of the interacting vectors can be invaluable in a number of endeavors. An example application has been presented where the reconstructed vectors have been used to quantify the fitness of a template protein structure to the unknown protein structure. This work has other important direct applications such as verification of the novelty of an unknown protein and validation of the accuracy of an available protein structure model in drug design. More importantly, the presented work has the potential to bridge the gap between experimental and computational methods of structure determination.
NASA Technical Reports Server (NTRS)
Gatlin, L. L.
1974-01-01
Concepts of information theory are applied to examine various proteins in terms of their redundancy in natural originators such as animals and plants. The Monte Carlo method is used to derive information parameters for random protein sequences. Real protein sequence parameters are compared with the standard parameters of protein sequences having a specific length. The tendency of a chain to contain some amino acids more frequently than others and the tendency of a chain to contain certain amino acid pairs more frequently than other pairs are used as randomness measures of individual protein sequences. Non-periodic proteins are generally found to have random Shannon redundancies except in cases of constraints due to short chain length and genetic codes. Redundant characteristics of highly periodic proteins are discussed. A degree of periodicity parameter is derived.
The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.
Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B; Halpern, Aaron L; Williamson, Shannon J; Remington, Karin; Eisen, Jonathan A; Heidelberg, Karla B; Manning, Gerard; Li, Weizhong; Jaroszewski, Lukasz; Cieplak, Piotr; Miller, Christopher S; Li, Huiying; Mashiyama, Susan T; Joachimiak, Marcin P; van Belle, Christopher; Chandonia, John-Marc; Soergel, David A; Zhai, Yufeng; Natarajan, Kannan; Lee, Shaun; Raphael, Benjamin J; Bafna, Vineet; Friedman, Robert; Brenner, Steven E; Godzik, Adam; Eisenberg, David; Dixon, Jack E; Taylor, Susan S; Strausberg, Robert L; Frazier, Marvin; Venter, J Craig
2007-03-01
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Ohki, Taku; Mizuno, Nobuhiro; Shibata, Naoki; Takeo, Masahiro; Negoro, Seiji; Higuchi, Yoshiki
2005-01-01
To investigate the structure–function relationship between 6-aminohexanoate-dimer hydrolase (EII) from Arthrobacter sp. and a cryptic protein (EII′) which shows 88% sequence identity to EII, a hybrid protein (named Hyb-24) of EII and EII′ was overexpressed, purified and crystallized using the sitting-drop vapour-diffusion method with ammonium sulfate as a precipitant in MES buffer pH 6.5. The crystal belongs to space group P3121 or P3221, with unit-cell parameters a = b = 96.37, c = 113.09 Å. Diffraction data were collected from native and methylmercuric chloride derivative crystals to resolutions of 1.75 and 1.80 Å, respectively. PMID:16511198
Analysis of delay reducing and fuel saving sequencing and spacing algorithms for arrival traffic
NASA Technical Reports Server (NTRS)
Neuman, Frank; Erzberger, Heinz
1991-01-01
The air traffic control subsystem that performs sequencing and spacing is discussed. The function of the sequencing and spacing algorithms is to automatically plan the most efficient landing order and to assign optimally spaced landing times to all arrivals. Several algorithms are described and their statistical performance is examined. Sequencing brings order to an arrival sequence for aircraft. First-come-first-served sequencing (FCFS) establishes a fair order, based on estimated times of arrival, and determines proper separations. Because of the randomness of the arriving traffic, gaps will remain in the sequence of aircraft. Delays are reduced by time-advancing the leading aircraft of each group while still preserving the FCFS order. Tightly spaced groups of aircraft remain with a mix of heavy and large aircraft. Spacing requirements differ for different types of aircraft trailing each other. Traffic is reordered slightly to take advantage of this spacing criterion, thus shortening the groups and reducing average delays. For heavy traffic, delays for different traffic samples vary widely, even when the same set of statistical parameters is used to produce each sample. This report supersedes NASA TM-102795 on the same subject. It includes a new method of time-advance as well as an efficient method of sequencing and spacing for two dependent runways.
Can natural proteins designed with 'inverted' peptide sequences adopt native-like protein folds?
Sridhar, Settu; Guruprasad, Kunchur
2014-01-01
We have carried out a systematic computational analysis on a representative dataset of proteins of known three-dimensional structure, in order to evaluate whether it would possible to 'swap' certain short peptide sequences in naturally occurring proteins with their corresponding 'inverted' peptides and generate 'artificial' proteins that are predicted to retain native-like protein fold. The analysis of 3,967 representative proteins from the Protein Data Bank revealed 102,677 unique identical inverted peptide sequence pairs that vary in sequence length between 5-12 and 18 amino acid residues. Our analysis illustrates with examples that such 'artificial' proteins may be generated by identifying peptides with 'similar structural environment' and by using comparative protein modeling and validation studies. Our analysis suggests that natural proteins may be tolerant to accommodating such peptides.
Chang, Elizabeth; Pourmal, Sergei; Zhou, Chun; Kumar, Rupesh; Teplova, Marianna; Pavletich, Nikola P; Marians, Kenneth J; Erdjument-Bromage, Hediye
2016-07-01
In recent history, alternative approaches to Edman sequencing have been investigated, and to this end, the Association of Biomolecular Resource Facilities (ABRF) Protein Sequencing Research Group (PSRG) initiated studies in 2014 and 2015, looking into bottom-up and top-down N-terminal (Nt) dimethyl derivatization of standard quantities of intact proteins with the aim to determine Nt sequence information. We have expanded this initiative and used low picomole amounts of myoglobin to determine the efficiency of Nt-dimethylation. Application of this approach on protein domains, generated by limited proteolysis of overexpressed proteins, confirms that it is a universal labeling technique and is very sensitive when compared with Edman sequencing. Finally, we compared Edman sequencing and Nt-dimethylation of the same polypeptide fragments; results confirm that there is agreement in the identity of the Nt amino acid sequence between these 2 methods.
De novo protein sequencing by combining top-down and bottom-up tandem mass spectra.
Liu, Xiaowen; Dekker, Lennard J M; Wu, Si; Vanduijn, Martijn M; Luider, Theo M; Tolić, Nikola; Kou, Qiang; Dvorkin, Mikhail; Alexandrova, Sonya; Vyatkina, Kira; Paša-Tolić, Ljiljana; Pevzner, Pavel A
2014-07-03
There are two approaches for de novo protein sequencing: Edman degradation and mass spectrometry (MS). Existing MS-based methods characterize a novel protein by assembling tandem mass spectra of overlapping peptides generated from multiple proteolytic digestions of the protein. Because each tandem mass spectrum covers only a short peptide of the target protein, the key to high coverage protein sequencing is to find spectral pairs from overlapping peptides in order to assemble tandem mass spectra to long ones. However, overlapping regions of peptides may be too short to be confidently identified. High-resolution mass spectrometers have become accessible to many laboratories. These mass spectrometers are capable of analyzing molecules of large mass values, boosting the development of top-down MS. Top-down tandem mass spectra cover whole proteins. However, top-down tandem mass spectra, even combined, rarely provide full ion fragmentation coverage of a protein. We propose an algorithm, TBNovo, for de novo protein sequencing by combining top-down and bottom-up MS. In TBNovo, a top-down tandem mass spectrum is utilized as a scaffold, and bottom-up tandem mass spectra are aligned to the scaffold to increase sequence coverage. Experiments on data sets of two proteins showed that TBNovo achieved high sequence coverage and high sequence accuracy.
A stochastic context free grammar based framework for analysis of protein sequences
Dyrka, Witold; Nebel, Jean-Christophe
2009-01-01
Background In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm. Results This framework was implemented in a system aiming at the production of binding site descriptors. These descriptors not only allow detection of protein regions that are involved in these sites, but also provide insight in their structure. Grammars were induced using quantitative properties of amino acids to deal with the size of the protein alphabet. Moreover, we imposed some structural constraints on grammars to reduce the extent of the rule search space. Finally, grammars based on different properties were combined to convey as much information as possible. Evaluation was performed on sites of various sizes and complexity described either by PROSITE patterns, domain profiles or a set of patterns. Results show the produced binding site descriptors are human-readable and, hence, highlight biologically meaningful features. Moreover, they achieve good accuracy in both annotation and detection. In addition, findings suggest that, unlike current state-of-the-art methods, our system may be particularly suited to deal with patterns shared by non-homologous proteins. Conclusion A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences. Experiments have shown that not only is this new approach valid, but produces human-readable descriptors for binding sites which have been beyond the capability of current machine learning techniques. PMID:19814800
Ho, Michelle L; Adler, Benjamin A; Torre, Michael L; Silberg, Jonathan J; Suh, Junghae
2013-12-20
Adeno-associated virus (AAV) recombination can result in chimeric capsid protein subunits whose ability to assemble into an oligomeric capsid, package a genome, and transduce cells depends on the inheritance of sequence from different AAV parents. To develop quantitative design principles for guiding site-directed recombination of AAV capsids, we have examined how capsid structural perturbations predicted by the SCHEMA algorithm correlate with experimental measurements of disruption in seventeen chimeric capsid proteins. In our small chimera population, created by recombining AAV serotypes 2 and 4, we found that protection of viral genomes and cellular transduction were inversely related to calculated disruption of the capsid structure. Interestingly, however, we did not observe a correlation between genome packaging and calculated structural disruption; a majority of the chimeric capsid proteins formed at least partially assembled capsids and more than half packaged genomes, including those with the highest SCHEMA disruption. These results suggest that the sequence space accessed by recombination of divergent AAV serotypes is rich in capsid chimeras that assemble into 60-mer capsids and package viral genomes. Overall, the SCHEMA algorithm may be useful for delineating quantitative design principles to guide the creation of libraries enriched in genome-protecting virus nanoparticles that can effectively transduce cells. Such improvements to the virus design process may help advance not only gene therapy applications but also other bionanotechnologies dependent upon the development of viruses with new sequences and functions.
Ho, Michelle L.; Adler, Benjamin A.; Torre, Michael L.; Silberg, Jonathan J.; Suh, Junghae
2013-01-01
Adeno-associated virus (AAV) recombination can result in chimeric capsid protein subunits whose ability to assemble into an oligomeric capsid, package a genome, and transduce cells depends on the inheritance of sequence from different AAV parents. To develop quantitative design principles for guiding site-directed recombination of AAV capsids, we have examined how capsid structural perturbations predicted by the SCHEMA algorithm correlate with experimental measurements of disruption in seventeen chimeric capsid proteins. In our small chimera population, created by recombining AAV serotypes 2 and 4, we found that protection of viral genomes and cellular transduction were inversely related to calculated disruption of the capsid structure. Interestingly, however, we did not observe a correlation between genome packaging and calculated structural disruption; a majority of the chimeric capsid proteins formed at least partially assembled capsids and more than half packaged genomes, including those with the highest SCHEMA disruption. These results suggest that the sequence space accessed by recombination of divergent AAV serotypes is rich in capsid chimeras that assemble into 60-mer capsids and package viral genomes. Overall, the SCHEMA algorithm may be useful for delineating quantitative design principles to guide the creation of libraries enriched in genome-protecting virus nanoparticles that can effectively transduce cells. Such improvements to the virus design process may help advance not only gene therapy applications, but also other bionanotechnologies dependent upon the development of viruses with new sequences and functions. PMID:23899192
Odronitz, Florian; Kollmar, Martin
2006-01-01
Background Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Description Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. Conclusion We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein. PMID:17134497
Adaptability of Protein Structures to Enable Functional Interactions and Evolutionary Implications
Haliloglu, Turkan; Bahar, Ivet
2015-01-01
Several studies in recent years have drawn attention to the ability of proteins to adapt to intermolecular interactions by conformational changes along structure-encoded collective modes of motions. These so-called soft modes, primarily driven by entropic effects, facilitate, if not enable, functional interactions. They represent excursions on the conformational space along principal low-ascent directions/paths away from the original free energy minimum, and they are accessible to the protein even prior to protein-protein/ligand interactions. An emerging concept from these studies is the evolution of structures or modular domains to favor such modes of motion that will be recruited or integrated for enabling functional interactions. Structural dynamics, including the allosteric switches in conformation that are often stabilized upon formation of complexes and multimeric assemblies, emerge as key properties that are evolutionarily maintained to accomplish biological activities, consistent with the paradigm sequence → structure → dynamics → function where ‘dynamics’ bridges structure and function. PMID:26254902
A minimalist model protein with multiple folding funnels
Locker, C. Rebecca; Hernandez, Rigoberto
2001-01-01
Kinetic and structural studies of wild-type proteins such as prions and amyloidogenic proteins provide suggestive evidence that proteins may adopt multiple long-lived states in addition to the native state. All of these states differ structurally because they lie far apart in configuration space, but their stability is not necessarily caused by cooperative (nucleation) effects. In this study, a minimalist model protein is designed to exhibit multiple long-lived states to explore the dynamics of the corresponding wild-type proteins. The minimalist protein is modeled as a 27-monomer sequence confined to a cubic lattice with three different monomer types. An order parameter—the winding index—is introduced to characterize the extent of folding. The winding index has several advantages over other commonly used order parameters like the number of native contacts. It can distinguish between enantiomers, its calculation requires less computational time than the number of native contacts, and reduced-dimensional landscapes can be developed when the native state structure is not known a priori. The results for the designed model protein prove by existence that the rugged energy landscape picture of protein folding can be generalized to include protein “misfolding” into long-lived states. PMID:11470921
Arneth, Borros
2012-10-01
As possible mechanisms to explain the emergence of autoimmune diseases, the current author has suggested in earlier papers two new pathways: the "protein localization hypothesis" and the "protein traffic hypothesis". The "protein localization hypothesis" states that an autoimmune disease develops if a protein accumulates in a previously unoccupied compartment, that did not previously contain that protein. Similarly, the "protein traffic hypothesis" states that a sudden error within the transport of a certain protein leads to the emergence of an autoimmune disease. The current article discusses the usefulness of the different commercially available transgenic murine models of diabetes mellitus type 1 to confirm the aforementioned hypotheses. This discussion shows that several transgenic murine models of diabetes mellitus type 1 are in-line and confirm the aforementioned hypotheses. Furthermore, these hypotheses are additionally inline with the occurrence of several newly discovered protein sequences, the so-called trepitope sequences. These sequences modulate the immune response to certain proteins. The current study analyzed to what extent the hypotheses are supported by the occurrence of these new sequences. Thereby the occurrence of the trepitope sequences provides additional evidence supporting the aforementioned hypotheses. Both the "protein localization hypothesis" and the "protein traffic hypothesis" have the potential to lead to new causal therapy concepts. The "protein localization hypothesis" and the "protein traffic hypothesis" provide conceptional explanations for the diabetes mouse models as well as for the newly discovered trepitope sequences. Copyright © 2012 Elsevier Ltd. All rights reserved.
Rational Protein Engineering Guided by Deep Mutational Scanning
Shin, HyeonSeok; Cho, Byung-Kwan
2015-01-01
Sequence–function relationship in a protein is commonly determined by the three-dimensional protein structure followed by various biochemical experiments. However, with the explosive increase in the number of genome sequences, facilitated by recent advances in sequencing technology, the gap between protein sequences available and three-dimensional structures is rapidly widening. A recently developed method termed deep mutational scanning explores the functional phenotype of thousands of mutants via massive sequencing. Coupled with a highly efficient screening system, this approach assesses the phenotypic changes made by the substitution of each amino acid sequence that constitutes a protein. Such an informational resource provides the functional role of each amino acid sequence, thereby providing sufficient rationale for selecting target residues for protein engineering. Here, we discuss the current applications of deep mutational scanning and consider experimental design. PMID:26404267
Computational Tools and Algorithms for Designing Customized Synthetic Genes
Gould, Nathan; Hendy, Oliver; Papamichail, Dimitris
2014-01-01
Advances in DNA synthesis have enabled the construction of artificial genes, gene circuits, and genomes of bacterial scale. Freedom in de novo design of synthetic constructs provides significant power in studying the impact of mutations in sequence features, and verifying hypotheses on the functional information that is encoded in nucleic and amino acids. To aid this goal, a large number of software tools of variable sophistication have been implemented, enabling the design of synthetic genes for sequence optimization based on rationally defined properties. The first generation of tools dealt predominantly with singular objectives such as codon usage optimization and unique restriction site incorporation. Recent years have seen the emergence of sequence design tools that aim to evolve sequences toward combinations of objectives. The design of optimal protein-coding sequences adhering to multiple objectives is computationally hard, and most tools rely on heuristics to sample the vast sequence design space. In this review, we study some of the algorithmic issues behind gene optimization and the approaches that different tools have adopted to redesign genes and optimize desired coding features. We utilize test cases to demonstrate the efficiency of each approach, as well as identify their strengths and limitations. PMID:25340050
Analysis of genomic sequences by Chaos Game Representation.
Almeida, J S; Carriço, J A; Maretzek, A; Noble, P A; Fletcher, M
2001-05-01
Chaos Game Representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to find the coordinates for their position in a continuous space. This distribution of positions has two properties: it is unique, and the source sequence can be recovered from the coordinates such that distance between positions measures similarity between the corresponding sequences. The possibility of using the latter property to identify succession schemes have been entirely overlooked in previous studies which raises the possibility that CGR may be upgraded from a mere representation technique to a sequence modeling tool. The distribution of positions in the CGR plane were shown to be a generalization of Markov chain probability tables that accommodates non-integer orders. Therefore, Markov models are particular cases of CGR models rather than the reverse, as currently accepted. In addition, the CGR generalization has both practical (computational efficiency) and fundamental (scale independence) advantages. These results are illustrated by using Escherichia coli K-12 as a test data-set, in particular, the genes thrA, thrB and thrC of the threonine operon.
Winnowing sequences from a database search.
Berman, P; Zhang, Z; Wolf, Y I; Koonin, E V; Miller, W
2000-01-01
In database searches for sequence similarity, matches to a distinct sequence region (e.g., protein domain) are frequently obscured by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of "tolerable redundancy." An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.
Chameleon sequences in neurodegenerative diseases.
Bahramali, Golnaz; Goliaei, Bahram; Minuchehr, Zarrin; Salari, Ali
2016-03-25
Chameleon sequences can adopt either alpha helix sheet or a coil conformation. Defining chameleon sequences in PDB (Protein Data Bank) may yield to an insight on defining peptides and proteins responsible in neurodegeneration. In this research, we benefitted from the large PDB and performed a sequence analysis on Chameleons, where we developed an algorithm to extract peptide segments with identical sequences, but different structures. In order to find new chameleon sequences, we extracted a set of 8315 non-redundant protein sequences from the PDB with an identity less than 25%. Our data was classified to "helix to strand (HE)", "helix to coil (HC)" and "strand to coil (CE)" alterations. We also analyzed the occurrence of singlet and doublet amino acids and the solvent accessibility in the chameleon sequences; we then sorted out the proteins with the most number of chameleon sequences and named them Chameleon Flexible Proteins (CFPs) in our dataset. Our data revealed that Gly, Val, Ile, Tyr and Phe, are the major amino acids in Chameleons. We also found that there are proteins such as Insulin Degrading Enzyme IDE and GTP-binding nuclear protein Ran (RAN) with the most number of chameleons (640 and 405 respectively). These proteins have known roles in neurodegenerative diseases. Therefore it can be inferred that other CFP's can serve as key proteins in neurodegeneration, and a study on them can shed light on curing and preventing neurodegenerative diseases. Copyright © 2016 Elsevier Inc. All rights reserved.
Chameleon sequences in neurodegenerative diseases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bahramali, Golnaz; Goliaei, Bahram, E-mail: goliaei@ut.ac.ir; Minuchehr, Zarrin, E-mail: minuchehr@nigeb.ac.ir
2016-03-25
Chameleon sequences can adopt either alpha helix sheet or a coil conformation. Defining chameleon sequences in PDB (Protein Data Bank) may yield to an insight on defining peptides and proteins responsible in neurodegeneration. In this research, we benefitted from the large PDB and performed a sequence analysis on Chameleons, where we developed an algorithm to extract peptide segments with identical sequences, but different structures. In order to find new chameleon sequences, we extracted a set of 8315 non-redundant protein sequences from the PDB with an identity less than 25%. Our data was classified to “helix to strand (HE)”, “helix tomore » coil (HC)” and “strand to coil (CE)” alterations. We also analyzed the occurrence of singlet and doublet amino acids and the solvent accessibility in the chameleon sequences; we then sorted out the proteins with the most number of chameleon sequences and named them Chameleon Flexible Proteins (CFPs) in our dataset. Our data revealed that Gly, Val, Ile, Tyr and Phe, are the major amino acids in Chameleons. We also found that there are proteins such as Insulin Degrading Enzyme IDE and GTP-binding nuclear protein Ran (RAN) with the most number of chameleons (640 and 405 respectively). These proteins have known roles in neurodegenerative diseases. Therefore it can be inferred that other CFP's can serve as key proteins in neurodegeneration, and a study on them can shed light on curing and preventing neurodegenerative diseases.« less
Comparative analysis of ribosomal protein L5 sequences from bacteria of the genus Thermus.
Jahn, O; Hartmann, R K; Boeckh, T; Erdmann, V A
1991-06-01
The genes for the ribosomal 5S rRNA binding protein L5 have been cloned from three extremely thermophilic eubacteria, Thermus flavus, Thermus thermophilus HB8 and Thermus aquaticus (Jahn et al, submitted). Genes for protein L5 from the three Thermus strains display 95% G/C in third positions of codons. Amino acid sequences deduced from the DNA sequence were shown to be identical for T flavus and T thermophilus, although the corresponding DNA sequences differed by two T to C transitions in the T thermophilus gene. Protein L5 sequences from T flavus and T thermophilus are 95% homologous to L5 from T aquaticus and 56.5% homologous to the corresponding E coli sequence. The lowest degrees of homology were found between the T flavus/T thermophilus L5 proteins and those of yeast L16 (27.5%), Halobacterium marismortui (34.0%) and Methanococcus vannielii (36.6%). From sequence comparison it becomes clear that thermostability of Thermus L5 proteins is achieved by an increase in hydrophobic interactions and/or by restriction of steric flexibility due to the introduction of amino acids with branched aliphatic side chains such as leucine. Alignment of the nine protein sequences equivalent to Thermus L5 proteins led to identification of a conserved internal segment, rich in acidic amino acids, which shows homology to subsequences of E coli L18 and L25. The occurrence of conserved sequence elements in 5S rRNA binding proteins and ribosomal proteins in general is discussed in terms of evolution and function.
Possenti, Andrea; Vendruscolo, Michele; Camilloni, Carlo; Tiana, Guido
2018-05-23
Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We study the amount of information necessary to specify the protein structure, providing an estimate that keeps into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the 'information gap') is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize artificially-designed protein sequences. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.
FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.
El-Manzalawy, Yasser; Abbas, Mostafa; Malluhi, Qutaibah; Honavar, Vasant
2016-01-01
A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.
Nagano, Yukio; Furuhashi, Hirofumi; Inaba, Takehito; Sasaki, Yukiko
2001-01-01
Complementary DNA encoding a DNA-binding protein, designated PLATZ1 (plant AT-rich sequence- and zinc-binding protein 1), was isolated from peas. The amino acid sequence of the protein is similar to those of other uncharacterized proteins predicted from the genome sequences of higher plants. However, no paralogous sequences have been found outside the plant kingdom. Multiple alignments among these paralogous proteins show that several cysteine and histidine residues are invariant, suggesting that these proteins are a novel class of zinc-dependent DNA-binding proteins with two distantly located regions, C-x2-H-x11-C-x2-C-x(4–5)-C-x2-C-x(3–7)-H-x2-H and C-x2-C-x(10–11)-C-x3-C. In an electrophoretic mobility shift assay, the zinc chelator 1,10-o-phenanthroline inhibited DNA binding, and two distant zinc-binding regions were required for DNA binding. A protein blot with 65ZnCl2 showed that both regions are required for zinc-binding activity. The PLATZ1 protein non-specifically binds to A/T-rich sequences, including the upstream region of the pea GTPase pra2 and plastocyanin petE genes. Expression of the PLATZ1 repressed those of the reporter constructs containing the coding sequence of luciferase gene driven by the cauliflower mosaic virus (CaMV) 35S90 promoter fused to the tandem repeat of the A/T-rich sequences. These results indicate that PLATZ1 is a novel class of plant-specific zinc-dependent DNA-binding protein responsible for A/T-rich sequence-mediated transcriptional repression. PMID:11600698
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gourinath, S., E-mail: sgourinath@mail.jnu.ac.in; Padhan, Narendra; Alam, Neelima
2005-04-01
Calcium-binding protein-2 (EhCaBP2) crystals were grown using MPD as a precipitant. EhCaBP2 also crystallized in complex with strontium (replacing calcium) at similar conditions. Preliminary data for EhCaBP2 crystals in complex with an IQ motif are also reported. Calcium plays a pivotal role in the pathogenesis of amoebiasis, a major disease caused by Entamoeba histolytica. Two domains with four canonical EF-hand-containing calcium-binding proteins (CaBPs) have been identified from E. histolytica. Even though they have very high sequence similarity, these bind to different target proteins in a Ca{sup 2+}-dependent manner, leading to different functional pathways. Calcium-binding protein-2 (EhCaBP2) crystals were grown usingmore » MPD as a precipitant. The crystals belong to space group P2{sub 1}, with unit-cell parameters a = 111.74, b = 68.83, c = 113.25 Å, β = 116.7°. EhCaBP2 also crystallized in complex with strontium (replacing calcium) at similar conditions. The crystals belong to space group P2{sub 1}, with unit-cell parameters a = 69.18, b = 112.03, c = 93.42 Å, β = 92.8°. Preliminary data for EhCaBP2 crystals in complex with an IQ motif are also reported. This complex was crystallized with MPD and ethanol as precipitating agents. These crystals belong to space group P2{sub 1}, with unit-cell parameters a = 60.5, b = 69.86, c = 86.5 Å, β = 97.9°.« less
A space-efficient algorithm for local similarities.
Huang, X Q; Hardison, R C; Miller, W
1990-10-01
Existing dynamic-programming algorithms for identifying similar regions of two sequences require time and space proportional to the product of the sequence lengths. Often this space requirement is more limiting than the time requirement. We describe a dynamic-programming local-similarity algorithm that needs only space proportional to the sum of the sequence lengths. The method can also find repeats within a single long sequence. To illustrate the algorithm's potential, we discuss comparison of a 73,360 nucleotide sequence containing the human beta-like globin gene cluster and a corresponding 44,594 nucleotide sequence for rabbit, a problem well beyond the capabilities of other dynamic-programming software.
Sevy, Alexander M.; Jacobs, Tim M.; Crowe, James E.; Meiler, Jens
2015-01-01
Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a ‘single state’ design (SSD) paradigm. Multi-specificity design (MSD), on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON). The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain) an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design “promiscuous”, polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes. PMID:26147100
Iwasaki, H; Shiba, T; Makino, K; Nakata, A; Shinagawa, H
1989-01-01
The ruvA and ruvB genes of Escherichia coli constitute an operon which belongs to the SOS regulon. Genetic evidence suggests that the products of the ruv operon are involved in DNA repair and recombination. To begin biochemical characterization of these proteins, we developed a plasmid system that overproduced RuvB protein to 20% of total cell protein. Starting from the overproducing system, we purified RuvB protein. The purified RuvB protein behaved like a monomer in gel filtration chromatography and had an apparent relative molecular mass of 38 kilodaltons in sodium dodecyl sulfate-polyacrylamide gel electrophoresis, which agrees with the value predicted from the DNA sequence. The amino acid sequence of the amino-terminal region of the purified protein was analyzed, and the sequence agreed with the one deduced from the DNA sequence. Since the deduced sequence of RuvB protein contained the consensus sequence for ATP-binding proteins, we examined the ATP-binding and ATPase activities of the purified RuvB protein. RuvB protein had a stronger affinity to ADP than to ATP and weak ATPase activity. The results suggest that the weak ATPase activity of RuvB protein is at least partly due to end product inhibition by ADP. Images PMID:2529252
Duan, Zhigui; Cao, Rui; Jiang, Liping; Liang, Songping
2013-01-14
In past years, spider venoms have attracted increasing attention due to their extraordinary chemical and pharmacological diversity. The recently popularized proteomic method highly improved our ability to analyze the proteins in the venom. However, the lack of information about isolated venom proteins sequences dramatically limits the ability to confidently identify venom proteins. In the present paper, the venom from Araneus ventricosus was analyzed using two complementary approaches: 2-DE/Shotgun-LC-MS/MS coupled to MASCOT search and 2-DE/Shotgun-LC-MS/MS coupled to manual de novo sequencing followed by local venom protein database (LVPD) search. The LVPD was constructed with toxin-like protein sequences obtained from the analysis of cDNA library from A. ventricosus venom glands. Our results indicate that a total of 130 toxin-like protein sequences were unambiguously identified by manual de novo sequencing coupled to LVPD search, accounting for 86.67% of all toxin-like proteins in LVPD. Thus manual de novo sequencing coupled to LVPD search was proved an extremely effective approach for the analysis of venom proteins. In addition, the approach displays impeccable advantage in validating mutant positions of isoforms from the same toxin-like family. Intriguingly, methyl esterifcation of glutamic acid was discovered for the first time in animal venom proteins by manual de novo sequencing. Crown Copyright © 2012. Published by Elsevier B.V. All rights reserved.
Hazes, Bart
2014-02-28
Protein-coding DNA sequences and their corresponding amino acid sequences are routinely used to study relationships between sequence, structure, function, and evolution. The rapidly growing size of sequence databases increases the power of such comparative analyses but it makes it more challenging to prepare high quality sequence data sets with control over redundancy, quality, completeness, formatting, and labeling. Software tools for some individual steps in this process exist but manual intervention remains a common and time consuming necessity. CDSbank is a database that stores both the protein-coding DNA sequence (CDS) and amino acid sequence for each protein annotated in Genbank. CDSbank also stores Genbank feature annotation, a flag to indicate incomplete 5' and 3' ends, full taxonomic data, and a heuristic to rank the scientific interest of each species. This rich information allows fully automated data set preparation with a level of sophistication that aims to meet or exceed manual processing. Defaults ensure ease of use for typical scenarios while allowing great flexibility when needed. Access is via a free web server at http://hazeslab.med.ualberta.ca/CDSbank/. CDSbank presents a user-friendly web server to download, filter, format, and name large sequence data sets. Common usage scenarios can be accessed via pre-programmed default choices, while optional sections give full control over the processing pipeline. Particular strengths are: extract protein-coding DNA sequences just as easily as amino acid sequences, full access to taxonomy for labeling and filtering, awareness of incomplete sequences, and the ability to take one protein sequence and extract all synonymous CDS or identical protein sequences in other species. Finally, CDSbank can also create labeled property files to, for instance, annotate or re-label phylogenetic trees.
Assigning protein functions by comparative genome analysis protein phylogenetic profiles
Pellegrini, Matteo; Marcotte, Edward M.; Thompson, Michael J.; Eisenberg, David; Grothe, Robert; Yeates, Todd O.
2003-05-13
A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.
PASS2: an automated database of protein alignments organised as structural superfamilies.
Bhaduri, Anirban; Pugalenthi, Ganesan; Sowdhamini, Ramanathan
2004-04-02
The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html
Tan, Yen Hock; Huang, He; Kihara, Daisuke
2006-08-15
Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods employ profile-profile alignments, and various ways of aligning two profiles have been developed. More fundamentally, a better amino acid similarity matrix can improve a profile itself; thereby resulting in more accurate profile-profile alignments. Here we have developed novel amino acid similarity matrices from knowledge-based amino acid contact potentials. Contact potentials are used because the contact propensity to the other amino acids would be one of the most conserved features of each position of a protein structure. The derived amino acid similarity matrices are tested on benchmark alignments at three different levels, namely, the family, the superfamily, and the fold level. Compared to BLOSUM45 and the other existing matrices, the contact potential-based matrices perform comparably in the family level alignments, but clearly outperform in the fold level alignments. The contact potential-based matrices perform even better when suboptimal alignments are considered. Comparing the matrices themselves with each other revealed that the contact potential-based matrices are very different from BLOSUM45 and the other matrices, indicating that they are located in a different basin in the amino acid similarity matrix space.
Design and application of a data-independent precursor and product ion repository.
Thalassinos, Konstantinos; Vissers, Johannes P C; Tenzer, Stefan; Levin, Yishai; Thompson, J Will; Daniel, David; Mann, Darrin; DeLong, Mark R; Moseley, M Arthur; America, Antoine H; Ottens, Andrew K; Cavey, Greg S; Efstathiou, Georgios; Scrivens, James H; Langridge, James I; Geromanos, Scott J
2012-10-01
The functional design and application of a data-independent LC-MS precursor and product ion repository for protein identification, quantification, and validation is conceptually described. The ion repository was constructed from the sequence search results of a broad range of discovery experiments investigating various tissue types of two closely related mammalian species. The relative high degree of similarity in protein complement, ion detection, and peptide and protein identification allows for the analysis of normalized precursor and product ion intensity values, as well as standardized retention times, creating a multidimensional/orthogonal queryable, qualitative, and quantitative space. Peptide ion map selection for identification and quantification is primarily based on replication and limited variation. The information is stored in a relational database and is used to create peptide- and protein-specific fragment ion maps that can be queried in a targeted fashion against the raw or time aligned ion detections. These queries can be conducted either individually or as groups, where the latter affords pathway and molecular machinery analysis of the protein complement. The presented results also suggest that peptide ionization and fragmentation efficiencies are highly conserved between experiments and practically independent of the analyzed biological sample when using similar instrumentation. Moreover, the data illustrate only minor variation in ionization efficiency with amino acid sequence substitutions occurring between species. Finally, the data and the presented results illustrate how LC-MS performance metrics can be extracted and utilized to ensure optimal performance of the employed analytical workflows.
Prediction of Protein Structure by Template-Based Modeling Combined with the UNRES Force Field.
Krupa, Paweł; Mozolewska, Magdalena A; Joo, Keehyoung; Lee, Jooyoung; Czaplewski, Cezary; Liwo, Adam
2015-06-22
A new approach to the prediction of protein structures that uses distance and backbone virtual-bond dihedral angle restraints derived from template-based models and simulations with the united residue (UNRES) force field is proposed. The approach combines the accuracy and reliability of template-based methods for the segments of the target sequence with high similarity to those having known structures with the ability of UNRES to pack the domains correctly. Multiplexed replica-exchange molecular dynamics with restraints derived from template-based models of a given target, in which each restraint is weighted according to the accuracy of the prediction of the corresponding section of the molecule, is used to search the conformational space, and the weighted histogram analysis method and cluster analysis are applied to determine the families of the most probable conformations, from which candidate predictions are selected. To test the capability of the method to recover template-based models from restraints, five single-domain proteins with structures that have been well-predicted by template-based methods were used; it was found that the resulting structures were of the same quality as the best of the original models. To assess whether the new approach can improve template-based predictions with incorrectly predicted domain packing, four such targets were selected from the CASP10 targets; for three of them the new approach resulted in significantly better predictions compared with the original template-based models. The new approach can be used to predict the structures of proteins for which good templates can be found for sections of the sequence or an overall good template can be found for the entire sequence but the prediction quality is remarkably weaker in putative domain-linker regions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Virador, V.; Reyes Grajeda, J; Blanco-Labra, A
The full-length cDNA sequence (P93622{_}VITVI) of polyphenol oxidase (PPO) cDNA from grape Vitis vinifera L., cv Grenache, was found to encode a translated protein of 607 amino acids with an expected molecular weight of ca. 67 kDa and a predicted pI of 6.83. The translated amino acid sequence was 99%, identical to that of a white grape berry PPO (1) (5 out of 607 amino acid potential sequence differences). The protein was purified from Grenache grape berries by using traditional methods, and it was crystallized with ammonium acetate by the hanging-drop vapor diffusion method. The crystals were orthorhombic, space groupmore » C2221. The structure was obtained at 2.2 {angstrom} resolution using synchrotron radiation using the 39 kDa isozyme of sweet potato PPO (PDB code: 1BT1) as a phase donor. The basic symmetry of the cell parameters (a, b, and c and {alpha}, {beta}, and {gamma}) as well as in the number of asymmetric units in the unit cell of the crystals of PPO, differed between the two proteins. The structures of the two enzymes are quite similar in overall fold, the location of the helix bundles at the core, and the active site in which three histidines bind each of the two catalytic copper ions, and one of the histidines is engaged in a thioether linkage with a cysteine residue. The possibility that the formation of the Cys-His thioether linkage constitutes the activation step is proposed. No evidence of phosphorylation or glycoslyation was found in the electron density map. The mass of the crystallized protein appears to be only 38.4 kDa, and the processing that occurs in the grape berry that leads to this smaller size is discussed.« less
Kojima, Shinya; Suzuki, Kazufumi; Hirata, Masami; Shinohara, Hiroyuki; Ueno, Eiko
2013-03-01
To assess the ability of magnetic resonance imaging (MRI) to depict the semicircular canals of the inner ear by comparing results from the sampling perfection with application-optimized contrasts by using different flip angle evolutions (SPACE) sequence with those from the true free induction with steady precession (TrueFISP) sequence. A 1.5-T MRI system was used to perform an in vivo study of 10 healthy volunteers and 17 patients. A three-point visual score was employed for assessing the depiction of the semicircular canals and facial and vestibulocochlear nerves and the contrast-to-noise ratio (CNR) was computed for the vestibule and pons on images with the SPACE and TrueFIPS sequences. There were no susceptibility artifact-related filling defects with the SPACE sequence. However, the TrueFISP sequence showed filling defects for at least one semicircular canal on both sides in seven cases for healthy subjects and in 10 cases for patients. The CNR with the SPACE sequence was significantly higher than with the TrueFISP sequence (P < 0.05). There was no statistically significant difference in depicting the facial and the vestibulocochlear nerves (P = 0.32). For the depiction of the semicircular canal, the SPACE sequence is superior to the TrueFISP sequence. Copyright © 2012 Wiley Periodicals, Inc.
Computational analysis of sequence selection mechanisms.
Meyerguz, Leonid; Grasso, Catherine; Kleinberg, Jon; Elber, Ron
2004-04-01
Mechanisms leading to gene variations are responsible for the diversity of species and are important components of the theory of evolution. One constraint on gene evolution is that of protein foldability; the three-dimensional shapes of proteins must be thermodynamically stable. We explore the impact of this constraint and calculate properties of foldable sequences using 3660 structures from the Protein Data Bank. We seek a selection function that receives sequences as input, and outputs survival probability based on sequence fitness to structure. We compute the number of sequences that match a particular protein structure with energy lower than the native sequence, the density of the number of sequences, the entropy, and the "selection" temperature. The mechanism of structure selection for sequences longer than 200 amino acids is approximately universal. For shorter sequences, it is not. We speculate on concrete evolutionary mechanisms that show this behavior.
Sequence-similar, structure-dissimilar protein pairs in the PDB.
Kosloff, Mickey; Kolodny, Rachel
2008-05-01
It is often assumed that in the Protein Data Bank (PDB), two proteins with similar sequences will also have similar structures. Accordingly, it has proved useful to develop subsets of the PDB from which "redundant" structures have been removed, based on a sequence-based criterion for similarity. Similarly, when predicting protein structure using homology modeling, if a template structure for modeling a target sequence is selected by sequence alone, this implicitly assumes that all sequence-similar templates are equivalent. Here, we show that this assumption is often not correct and that standard approaches to create subsets of the PDB can lead to the loss of structurally and functionally important information. We have carried out sequence-based structural superpositions and geometry-based structural alignments of a large number of protein pairs to determine the extent to which sequence similarity ensures structural similarity. We find many examples where two proteins that are similar in sequence have structures that differ significantly from one another. The source of the structural differences usually has a functional basis. The number of such proteins pairs that are identified and the magnitude of the dissimilarity depend on the approach that is used to calculate the differences; in particular sequence-based structure superpositioning will identify a larger number of structurally dissimilar pairs than geometry-based structural alignments. When two sequences can be aligned in a statistically meaningful way, sequence-based structural superpositioning provides a meaningful measure of structural differences. This approach and geometry-based structure alignments reveal somewhat different information and one or the other might be preferable in a given application. Our results suggest that in some cases, notably homology modeling, the common use of nonredundant datasets, culled from the PDB based on sequence, may mask important structural and functional information. We have established a data base of sequence-similar, structurally dissimilar protein pairs that will help address this problem (http://luna.bioc.columbia.edu/rachel/seqsimstrdiff.htm).
Computational biology of RNA interactions.
Dieterich, Christoph; Stadler, Peter F
2013-01-01
The biodiversity of the RNA world has been underestimated for decades. RNA molecules are key building blocks, sensors, and regulators of modern cells. The biological function of RNA molecules cannot be separated from their ability to bind to and interact with a wide space of chemical species, including small molecules, nucleic acids, and proteins. Computational chemists, physicists, and biologists have developed a rich tool set for modeling and predicting RNA interactions. These interactions are to some extent determined by the binding conformation of the RNA molecule. RNA binding conformations are approximated with often acceptable accuracy by sequence and secondary structure motifs. Secondary structure ensembles of a given RNA molecule can be efficiently computed in many relevant situations by employing a standard energy model for base pair interactions and dynamic programming techniques. The case of bi-molecular RNA-RNA interactions can be seen as an extension of this approach. However, unbiased transcriptome-wide scans for local RNA-RNA interactions are computationally challenging yet become efficient if the binding motif/mode is known and other external information can be used to confine the search space. Computational methods are less developed for proteins and small molecules, which bind to RNA with very high specificity. Binding descriptors of proteins are usually determined by in vitro high-throughput assays (e.g., microarrays or sequencing). Intriguingly, recent experimental advances, which are mostly based on light-induced cross-linking of binding partners, render in vivo binding patterns accessible yet require new computational methods for careful data interpretation. The grand challenge is to model the in vivo situation where a complex interplay of RNA binders competes for the same target RNA molecule. Evidently, bioinformaticians are just catching up with the impressive pace of these developments. Copyright © 2012 John Wiley & Sons, Ltd.
Vetting, Matthew W.; Al-Obaidi, Nawar; Zhao, Suwen; ...
2014-12-25
The rate at which genome sequencing data is accruing demands enhanced methods for functional annotation and metabolism discovery. Solute binding proteins (SBPs) facilitate the transport of the first reactant in a metabolic pathway, thereby constraining the regions of chemical space and the chemistries that must be considered for pathway reconstruction. Here in this paper, we describe high-throughput protein production and differential scanning fluorimetry platforms, which enabled the screening of 158 SBPs against a 189 component library specifically tailored for this class of proteins. Like all screening efforts, this approach is limited by the practical constraints imposed by construction of themore » library, i.e., we can study only those metabolites that are known to exist and which can be made in sufficient quantities for experimentation. To move beyond these inherent limitations, we illustrate the promise of crystallographic- and mass spectrometric-based approaches for the unbiased use of entire metabolomes as screening libraries. Together, our approaches identified 40 new SBP ligands, generated experiment-based annotations for 2084 SBPs in 71 isofunctional clusters, and defined numerous metabolic pathways, including novel catabolic pathways for the utilization of ethanolamine as sole nitrogen source and the use of D-Ala-D-Ala as sole carbon source. These efforts begin to define an integrated strategy for realizing the full value of amassing genome sequence data.« less
Carrizo, Maria E; Irazoqui, Fernando J; Lardone, Ricardo D; Nores, Gustavo A; Curtino, Juan A; Capaldi, Stefano; Perduca, Massimiliano; Monaco, Hugo L
2004-04-01
The lectin from the common edible mushroom Agaricus bisporus (ABL) belongs to the group of proteins that have the property of binding the Thomsen-Friedenreich antigen (T-antigen) selectively and with high affinity, but does not show any sequence similarity to the other proteins that share this property. The ABL sequence is instead similar to those of members of the saline-soluble fungal lectins, a protein family with pesticidal properties. The presence of different isoforms has been reported. It has been found that in order to be able to grow diffraction-quality crystals of the lectin, it is essential to separate the isoforms, which was performed by preparative isoelectric focusing. Using standard procedures, it was possible to crystallize the most basic of the forms by either vapour diffusion or equilibrium dialysis, but attempts to grow crystals of the other more acidic forms were unsuccessful. The ABL crystals belong to the orthorhombic space group C222(1), with unit-cell parameters a = 93.06, b = 98.16, c = 76.38 A, and diffract to a resolution of 2.2 A on a conventional source at room temperature. It is expected that the solution of this structure will yield further valuable information on the differences in the T-antigen-binding folds and will perhaps help to clarify the details of the ligand binding to the protein.
Goonesekere, Nalin Cw
2009-01-01
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
Sequencing proteins with transverse ionic transport in nanochannels.
Boynton, Paul; Di Ventra, Massimiliano
2016-05-03
De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing.
Predicting membrane protein types by the LLDA algorithm.
Wang, Tong; Yang, Jie; Shen, Hong-Bin; Chou, Kuo-Chen
2008-01-01
Membrane proteins are generally classified into the following eight types: (1) type I transmembrane, (2) type II, (3) type III, (4) type IV, (5) multipass transmembrane, (6) lipid-chain-anchored membrane, (7) GPI-anchored membrane, and (8) peripheral membrane (K.C. Chou and H.B. Shen: BBRC, 2007, 360: 339-345). Knowing the type of an uncharacterized membrane protein often provides useful clues for finding its biological function and interaction process with other molecules in a biological system. With the explosion of protein sequences generated in the Post-Genomic Age, it is urgent to develop an automated method to deal with such a challenge. Recently, the PsePSSM (Pseudo Position-Specific Score Matrix) descriptor is proposed by Chou and Shen (Biochem. Biophys. Res. Comm. 2007, 360, 339-345) to represent a protein sample. The advantage of the PsePSSM descriptor is that it can combine the evolution information and sequence-correlated information. However, incorporating all these effects into a descriptor may cause the "high dimension disaster". To overcome such a problem, the fusion approach was adopted by Chou and Shen. Here, a completely different approach, the so-called LLDA (Local Linear Discriminant Analysis) is introduced to extract the key features from the high-dimensional PsePSSM space. The dimension-reduced descriptor vector thus obtained is a compact representation of the original high dimensional vector. Our jackknife and independent dataset test results indicate that it is very promising to use the LLDA approach to cope with complicated problems in biological systems, such as predicting the membrane protein type.
Discrete Haar transform and protein structure.
Morosetti, S
1997-12-01
The discrete Haar transform of the sequence of the backbone dihedral angles (phi and psi) was performed over a set of X-ray protein structures of high resolution from the Brookhaven Protein Data Bank. Afterwards, the new dihedral angles were calculated by the inverse transform, using a growing number of Haar functions, from the lower to the higher degree. New structures were obtained using these dihedral angles, with standard values for bond lengths and angles, and with omega = 0 degree. The reconstructed structures were compared with the experimental ones, and analyzed by visual inspection and statistical analysis. When half of the Haar coefficients were used, all the reconstructed structures were not yet collapsed to a tertiary folding, but they showed yet realized most of the secondary motifs. These results indicate a substantial separation of structural information in the space of Haar transform, with the secondary structural information mainly present in the Haar coefficients of lower degrees, and the tertiary one present in the higher degree coefficients. Because of this separation, the representation of the folded structures in the space of Haar transform seems a promising candidate to encompass the problem of premature convergence in genetic algorithms.
NASA Technical Reports Server (NTRS)
Nakayama, S.; Kretsinger, R. H.
1993-01-01
In the first report in this series we presented dendrograms based on 152 individual proteins of the EF-hand family. In the second we used sequences from 228 proteins, containing 835 domains, and showed that eight of the 29 subfamilies are congruent and that the EF-hand domains of the remaining 21 subfamilies have diverse evolutionary histories. In this study we have computed dendrograms within and among the EF-hand subfamilies using the encoding DNA sequences. In most instances the dendrograms based on protein and on DNA sequences are very similar. Significant differences between protein and DNA trees for calmodulin remain unexplained. In our fourth report we evaluate the sequences and the distribution of introns within the EF-hand family and conclude that exon shuffling did not play a significant role in its evolution.
Application of 2D graphic representation of protein sequence based on Huffman tree method.
Qi, Zhao-Hui; Feng, Jun; Qi, Xiao-Qin; Li, Ling
2012-05-01
Based on Huffman tree method, we propose a new 2D graphic representation of protein sequence. This representation can completely avoid loss of information in the transfer of data from a protein sequence to its graphic representation. The method consists of two parts. One is about the 0-1 codes of 20 amino acids by Huffman tree with amino acid frequency. The amino acid frequency is defined as the statistical number of an amino acid in the analyzed protein sequences. The other is about the 2D graphic representation of protein sequence based on the 0-1 codes. Then the applications of the method on ten ND5 genes and seven Escherichia coli strains are presented in detail. The results show that the proposed model may provide us with some new sights to understand the evolution patterns determined from protein sequences and complete genomes. Copyright © 2012 Elsevier Ltd. All rights reserved.
Ollikainen, Noah; Smith, Colin A.; Fraser, James S.; Kortemme, Tanja
2013-01-01
Sampling alternative conformations is key to understanding how proteins work and engineering them for new functions. However, accurately characterizing and modeling protein conformational ensembles remains experimentally and computationally challenging. These challenges must be met before protein conformational heterogeneity can be exploited in protein engineering and design. Here, as a stepping stone, we describe methods to detect alternative conformations in proteins and strategies to model these near-native conformational changes based on backrub-type Monte Carlo moves in Rosetta. We illustrate how Rosetta simulations that apply backrub moves improve modeling of point mutant side chain conformations, native side chain conformational heterogeneity, functional conformational changes, tolerated sequence space, protein interaction specificity, and amino acid co-variation across protein-protein interfaces. We include relevant Rosetta command lines and RosettaScripts to encourage the application of these types of simulations to other systems. Our work highlights that critical scoring and sampling improvements will be necessary to approximate conformational landscapes. Challenges for the future development of these methods include modeling conformational changes that propagate away from designed mutation sites and modulating backbone flexibility to predictively design functionally important conformational heterogeneity. PMID:23422426
Watanabe, K; Yoshioka, K; Ito, H; Ishigami, M; Takagi, K; Utsunomiya, S; Kobayashi, M; Kishimoto, H; Yano, M; Kakumu, S
1999-11-10
Hypervariable region 1 (HVR1) proteins of hepatitis C virus (HCV) have been reported to react broadly with sera of patients with HCV infection. However, the variability of the broad reactivity of individual HVR1 proteins has not been elucidated. We assessed the reactivity of 25 different HVR1 proteins (genotype 1b) with sera of 81 patients with HCV infection (genotype 1b) by Western blot. HVR1 proteins reacted with 2-60 sera. The number of sera reactive with each HVR1 protein significantly correlated with the number of amino acid residues identical to the consensus sequence defined by Puntoriero et al. (G. Puntoriero, A. Lahm, S. Zucchelli, B. B. Ercole, R. Tafi, M. Penzzanera, M. U. Mondelli, R. Cortese, A. Tramontano, G. Galfre', and A. Nicosia. 1998. EMBO J. 17, 3521-3533. ) (r = 0.561, P < 0.005). The most widely reactive HVR1 protein, 12-22, had a sequence similar to the consensus sequence. The peptide with C-terminal 13-amino-acids sequence of HVR1 protein 12-22 (NH2-CSFTSLFTPGPSQK) was injected into rabbits as an immunogen. The rabbit immune sera reacted with 9 of 25 HVR1 proteins of genotype 1b including HVR1 protein 12-22 and with 3 of 12 proteins of genotype 2a. These results indicate that the HVR1 protein broadly reactive with patients' sera has a sequence similar to the consensus sequence, can induce broadly reactive sera, and could be one of the candidate immunogens in a prophylactic vaccine against HCV. Copyright 1999 Academic Press.
The eukaryotic signal sequence, YGRL, targets the chlamydial inclusion
Kabeiseman, Emily J.; Cichos, Kyle H.; Moore, Elizabeth R.
2014-01-01
Understanding how host proteins are targeted to pathogen-specified organelles, like the chlamydial inclusion, is fundamentally important to understanding the biogenesis of these unique subcellular compartments and how they maintain autonomy within the cell. Syntaxin 6, which localizes to the chlamydial inclusion, contains an YGRL signal sequence. The YGRL functions to return syntaxin 6 to the trans-Golgi from the plasma membrane, and deletion of the YGRL signal sequence from syntaxin 6 also prevents the protein from localizing to the chlamydial inclusion. YGRL is one of three YXXL (YGRL, YQRL, and YKGL) signal sequences which target proteins to the trans-Golgi. We designed various constructs of eukaryotic proteins to test the specificity and propensity of YXXL sequences to target the inclusion. The YGRL signal sequence redirects proteins (e.g., Tgn38, furin, syntaxin 4) that normally do not localize to the chlamydial inclusion. Further, the requirement of the YGRL signal sequence for syntaxin 6 localization to inclusions formed by different species of Chlamydia is conserved. These data indicate that there is an inherent property of the chlamydial inclusion, which allows it to recognize the YGRL signal sequence. To examine whether this “inherent property” was protein or lipid in nature, we asked if deletion of the YGRL signal sequence from syntaxin 6 altered the ability of the protein to interact with proteins or lipids. Deletion or alteration of the YGRL from syntaxin 6 does not appreciably impact syntaxin 6-protein interactions, but does decrease syntaxin 6-lipid interactions. Intriguingly, data also demonstrate that YKGL or YQRL can successfully substitute for YGRL in localization of syntaxin 6 to the chlamydial inclusion. Importantly and for the first time, we are establishing that a eukaryotic signal sequence targets the chlamydial inclusion. PMID:25309881
Pang, Erli; Wu, Xiaomei; Lin, Kui
2016-06-01
Protein evolution plays an important role in the evolution of each genome. Because of their functional nature, in general, most of their parts or sites are differently constrained selectively, particularly by purifying selection. Most previous studies on protein evolution considered individual proteins in their entirety or compared protein-coding sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each protein of a given genome. To this end, based on PfamA annotation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occurring within protein domains and those within unassigned regions. With these classifications, we found: the density of synonymous SNPs within domains is significantly greater than that of synonymous SNPs within unassigned regions; however, the density of non-synonymous SNPs shows the opposite pattern. We also found there are signatures of purifying selection on both the domain and unassigned regions. Furthermore, the selective strength on domains is significantly greater than that on unassigned regions. In addition, among all of the human protein sequences, there are 117 PfamA domains in which no SNPs are found. Our results highlight an important aspect of protein domains and may contribute to our understanding of protein evolution.
Lorenzo, J Ramiro; Alonso, Leonardo G; Sánchez, Ignacio E
2015-01-01
Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage "Protein and nucleic acid structure and sequence analysis".
AlignMe—a membrane protein sequence alignment web server
Stamm, Marcus; Staritzbichler, René; Khafizov, Kamil; Forrest, Lucy R.
2014-01-01
We present a web server for pair-wise alignment of membrane protein sequences, using the program AlignMe. The server makes available two operational modes of AlignMe: (i) sequence to sequence alignment, taking two sequences in fasta format as input, combining information about each sequence from multiple sources and producing a pair-wise alignment (PW mode); and (ii) alignment of two multiple sequence alignments to create family-averaged hydropathy profile alignments (HP mode). For the PW sequence alignment mode, four different optimized parameter sets are provided, each suited to pairs of sequences with a specific similarity level. These settings utilize different types of inputs: (position-specific) substitution matrices, secondary structure predictions and transmembrane propensities from transmembrane predictions or hydrophobicity scales. In the second (HP) mode, each input multiple sequence alignment is converted into a hydrophobicity profile averaged over the provided set of sequence homologs; the two profiles are then aligned. The HP mode enables qualitative comparison of transmembrane topologies (and therefore potentially of 3D folds) of two membrane proteins, which can be useful if the proteins have low sequence similarity. In summary, the AlignMe web server provides user-friendly access to a set of tools for analysis and comparison of membrane protein sequences. Access is available at http://www.bioinfo.mpg.de/AlignMe PMID:24753425
Tang, Lixia; Wang, Xiong; Ru, Beibei; Sun, Hengfei; Huang, Jian; Gao, Hui
2014-06-01
Recent computational and bioinformatics advances have enabled the efficient creation of novel biocatalysts by reducing amino acid variability at hot spot regions. To further expand the utility of this strategy, we present here a tool called Multi-site Degenerate Codon Analyzer (MDC-Analyzer) for the automated design of intelligent mutagenesis libraries that can completely cover user-defined randomized sequences, especially when multiple contiguous and/or adjacent sites are targeted. By initially defining an objective function, the possible optimal degenerate PCR primer profiles could be automatically explored using the heuristic approach of Greedy Best-First-Search. Compared to the previously developed DC-Analyzer, MDC-Analyzer allows for the existence of a small amount of undesired sequences as a tradeoff between the number of degenerate primers and the encoded library size while still providing all the benefits of DC-Analyzer with the ability to randomize multiple contiguous sites. MDC-Analyzer was validated using a series of randomly generated mutation schemes and experimental case studies on the evolution of halohydrin dehalogenase, which proved that the MDC methodology is more efficient than other methods and is particularly well-suited to exploring the sequence space of proteins using data-driven protein engineering strategies.
Specific minor groove solvation is a crucial determinant of DNA binding site recognition
Harris, Lydia-Ann; Williams, Loren Dean; Koudelka, Gerald B.
2014-01-01
The DNA sequence preferences of nearly all sequence specific DNA binding proteins are influenced by the identities of bases that are not directly contacted by protein. Discrimination between non-contacted base sequences is commonly based on the differential abilities of DNA sequences to allow narrowing of the DNA minor groove. However, the factors that govern the propensity of minor groove narrowing are not completely understood. Here we show that the differential abilities of various DNA sequences to support formation of a highly ordered and stable minor groove solvation network are a key determinant of non-contacted base recognition by a sequence-specific binding protein. In addition, disrupting the solvent network in the non-contacted region of the binding site alters the protein's ability to recognize contacted base sequences at positions 5–6 bases away. This observation suggests that DNA solvent interactions link contacted and non-contacted base recognition by the protein. PMID:25429976
MIPS: a database for protein sequences and complete genomes.
Mewes, H W; Hani, J; Pfeiffer, F; Frishman, D
1998-01-01
The MIPS group [Munich Information Center for Protein Sequences of the German National Center for Environment and Health (GSF)] at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, is involved in a number of data collection activities, including a comprehensive database of the yeast genome, a database reflecting the progress in sequencing the Arabidopsis thaliana genome, the systematic analysis of other small genomes and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). Through its WWW server (http://www.mips.biochem.mpg.de ) MIPS provides access to a variety of generic databases, including a database of protein families as well as automatically generated data by the systematic application of sequence analysis algorithms. The yeast genome sequence and its related information was also compiled on CD-ROM to provide dynamic interactive access to the 16 chromosomes of the first eukaryotic genome unraveled. PMID:9399795
Nakano, Shogo; Asano, Yasuhisa
2015-02-03
Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs.
NASA Astrophysics Data System (ADS)
Nakano, Shogo; Asano, Yasuhisa
2015-02-01
Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs.
Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard
2009-05-01
The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.
Zhou, Carol L Ecale
2015-01-01
In order to better define regions of similarity among related protein structures, it is useful to identify the residue-residue correspondences among proteins. Few codes exist for constructing a one-to-many multiple sequence alignment derived from a set of structure or sequence alignments, and a need was evident for creating such a tool for combining pairwise structure alignments that would allow for insertion of gaps in the reference structure. This report describes a new Python code, CombAlign, which takes as input a set of pairwise sequence alignments (which may be structure based) and generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA). The use and utility of CombAlign was demonstrated by generating gapped MSSAs using sets of pairwise structure-based sequence alignments between structure models of the matrix protein (VP40) and pre-small/secreted glycoprotein (sGP) of Reston Ebolavirus and the corresponding proteins of several other filoviruses. The gapped MSSAs revealed structure-based residue-residue correspondences, which enabled identification of structurally similar versus differing regions in the Reston proteins compared to each of the other corresponding proteins. CombAlign is a new Python code that generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA) given a set of pairwise sequence alignments (which may be structure based). CombAlign has utility in assisting the user in distinguishing structurally conserved versus divergent regions on a reference protein structure relative to other closely related proteins. CombAlign was developed in Python 2.6, and the source code is available for download from the GitHub code repository.
Van Hove, B; Staudenmaier, H; Braun, V
1990-12-01
Citrate and iron have to enter only the periplasmic space in order to induce the citrate-dependent iron(III) transport system of Escherichia coli. The five transport genes fecABCDE form an operon and are transcribed from fecA to fecE. Two genes, termed fecI and fecR, that mediate induction by iron(III) dicitrate have been identified upstream of fecA. The fecI gene encodes a protein of 173 amino acids (molecular weight, 19,478); the fecR gene encodes a protein of 317 amino acids (molecular weight, 35,529). Chromosomal fecI::Mu d1 mutants were unable to grow with iron(III) dicitrate as the sole iron source and synthesized no FecA outer membrane receptor protein. Growth was restored by transformation with plasmids encoding fecI or fecI and fecR. FecA and beta-galactosidase syntheses under transcription control of the fecB gene (fecB::Mu d1) were constitutive in fecI transformants and were regulated by iron(III) dicitrate in fecI fecR transformants. The amino acid sequence of the FecI protein contains a region close to the carboxy-terminal end for which a helix-turn-helix motif is predicted, which is typical for DNA-binding regulatory proteins. The FecI protein was found in the membrane, and the FecR protein was found in the periplasmic fraction. It is proposed that the FecR protein is the sensor that recognizes iron(III) dicitrate in the periplasm. The FecI protein activates fec gene expression by binding to the fec operator region. In the absence of citrate, FecR inactivates FecI. The lack of sequence homologies to other transmembrane signaling proteins and the location of the two proteins suggest a new type of transmembrane control mechanism.
Matrix Transformations between Certain Sequence Spaces over the Non-Newtonian Complex Field
Efe, Hakan
2014-01-01
In some cases, the most general linear operator between two sequence spaces is given by an infinite matrix. So the theory of matrix transformations has always been of great interest in the study of sequence spaces. In the present paper, we introduce the matrix transformations in sequence spaces over the field ℂ* and characterize some classes of infinite matrices with respect to the non-Newtonian calculus. Also we give the necessary and sufficient conditions on an infinite matrix transforming one of the classical sets over ℂ* to another one. Furthermore, the concept for sequence-to-sequence and series-to-series methods of summability is given with some illustrated examples. PMID:25110740
A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences.
Yu, Jia-Feng; Dou, Xiang-Hua; Wang, Hong-Bo; Sun, Xiao; Zhao, Hui-Ying; Wang, Ji-Hua
2015-06-22
The composition and sequence order of amino acid residues are the two most important characteristics to describe a protein sequence. Graphical representations facilitate visualization of biological sequences and produce biologically useful numerical descriptors. In this paper, we propose a novel cylindrical representation by placing the 20 amino acid residue types in a circle and sequence positions along the z axis. This representation allows visualization of the composition and sequence order of amino acids at the same time. Ten numerical descriptors and one weighted numerical descriptor have been developed to quantitatively describe intrinsic properties of protein sequences on the basis of the cylindrical model. Their applications to similarity/dissimilarity analysis of nine ND5 proteins indicated that these numerical descriptors are more effective than several classical numerical matrices. Thus, the cylindrical representation obtained here provides a new useful tool for visualizing and charactering protein sequences. An online server is available at http://biophy.dzu.edu.cn:8080/CNumD/input.jsp .
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification.
Sinclair, Robert M; Ravantti, Janne J; Bamford, Dennis H
2017-04-15
Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. Copyright © 2017 Sinclair et al.
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification
Sinclair, Robert M.; Ravantti, Janne J.
2017-01-01
ABSTRACT Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. PMID:28122979
Altahawi, Faysal F; Blount, Kevin J; Morley, Nicholas P; Raithel, Esther; Omar, Imran M
2017-01-01
To compare a faster, new, high-resolution accelerated 3D-fast-spin-echo (3D-FSE) acquisition sequence (CS-SPACE) to traditional 2D and high-resolution 3D sequences for knee 3-T magnetic resonance imaging (MRI). Twenty patients received knee MRIs that included routine 2D (T1, PD ± FS, T2-FS; 0.5 × 0.5 × 3 mm 3 ; ∼10 min), traditional 3D FSE (SPACE-PD-FS; 0.5 × 0.5 × 0.5 mm 3 ; ∼7.5 min), and accelerated 3D-FSE prototype (CS-SPACE-PD-FS; 0.5 × 0.5 × 0.5 mm 3 ; ∼5 min) acquisitions on a 3-T MRI system (Siemens MAGNETOM Skyra). Three musculoskeletal radiologists (MSKRs) prospectively and independently reviewed the studies with graded surveys comparing image and diagnostic quality. Tissue-specific signal-to-noise ratios (SNR) and contrast-to-noise ratios (CNR) were also compared. MSKR-perceived diagnostic quality of cartilage was significantly higher for CS-SPACE than for SPACE and 2D sequences (p < 0.001). Assessment of diagnostic quality of menisci and synovial fluid was higher for CS-SPACE than for SPACE (p < 0.001). CS-SPACE was not significantly different from SPACE but had lower assessments than 2D sequences for evaluation of bones, ligaments, muscles, and fat (p ≤ 0.004). 3D sequences had higher spatial resolution, but lower overall assessed contrast (p < 0.001). Overall image quality from CS-SPACE was assessed as higher than SPACE (p = 0.007), but lower than 2D sequences (p < 0.001). Compared to SPACE, CS-SPACE had higher fluid SNR and CNR against all other tissues (all p < 0.001). The CS-SPACE prototype allows for faster isotropic acquisitions of knee MRIs over currently used protocols. High fluid-to-cartilage CNR and higher spatial resolution over routine 2D sequences may present a valuable role for CS-SPACE in the evaluation of cartilage and menisci.
Predicting residue-wise contact orders in proteins by support vector regression.
Song, Jiangning; Burrage, Kevin
2006-10-03
The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Cloning and expression of recombinant adhesive protein Mefp-1 of the blue mussel, Mytilus edulis
Silverman, Heather G.; Roberto, Francisco F.
2006-01-17
The present invention comprises a Mytilus edulis cDNA sequenc having a nucleotide sequence that encodes for the Mytilus edulis foot protein-1 (Mefp-1), an example of a mollusk foot protein. Mefp-1 is an integral component of the blue mussels' adhesive protein complex, which allows the mussel to attach to objects underwater. The isolation, purification and sequencing of the Mefp-1 gene will allow researchers to produce Mefp-1 protein using genetic engineering techniques. The discovery of Mefp-1 gene sequence will also allow scientists to better understand how the blue mussel creates its waterproof adhesive protein complex.
Determination of the sequences of protein-derived peptides and peptide mixtures by mass spectrometry
Morris, Howard R.; Williams, Dudley H.; Ambler, Richard P.
1971-01-01
Micro-quantities of protein-derived peptides have been converted into N-acetylated permethyl derivatives, and their sequences determined by low-resolution mass spectrometry without prior knowledge of their amino acid compositions or lengths. A new strategy is suggested for the mass spectrometric sequencing of oligopeptides or proteins, involving gel filtration of protein hydrolysates and subsequent sequence analysis of peptide mixtures. Finally, results are given that demonstrate for the first time the use of mass spectrometry for the analysis of a protein-derived peptide mixture, again without prior knowledge of the protein or components within the mixture. PMID:5158904
Kizaki, Seiichiro; Chandran, Anandhakumar; Sugiyama, Hiroshi
2016-03-02
Tet (ten-eleven translocation) family proteins have the ability to oxidize 5-methylcytosine (mC) to 5-hydroxymethylcytosine (hmC), 5-formylcytosine (fC), and 5-carboxycytosine (caC). However, the oxidation reaction of Tet is not understood completely. Evaluation of genomic-level epigenetic changes by Tet protein requires unbiased identification of the highly selective oxidation sites. In this study, we used high-throughput sequencing to investigate the sequence specificity of mC oxidation by Tet1. A 6.6×10(4) -member mC-containing random DNA-sequence library was constructed. The library was subjected to Tet-reactive pulldown followed by high-throughput sequencing. Analysis of the obtained sequence data identified the Tet1-reactive sequences. We identified mCpG as a highly reactive sequence of Tet1 protein. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Arndt, E; Scholzen, T; Krömer, W; Hatakeyama, T; Kimura, M
1991-06-01
Approximately 40 ribosomal proteins from each Halobacterium marismortui and Bacillus stearothermophilus have been sequenced either by direct protein sequence analysis or by DNA sequence analysis of the appropriate genes. The comparison of the amino acid sequences from the archaebacterium H marismortui with the available ribosomal proteins from the eubacterial and eukaryotic kingdoms revealed four different groups of proteins: 24 proteins are related to both eubacterial as well as eukaryotic proteins. Eleven proteins are exclusively related to eukaryotic counterparts. For three proteins only eubacterial relatives-and for another three proteins no counterpart-could be found. The similarities of the halobacterial ribosomal proteins are in general somewhat higher to their eukaryotic than to their eubacterial counterparts. The comparison of B stearothermophilus proteins with their E coli homologues showed that the proteins evolved at different rates. Some proteins are highly conserved with 64-76% identity, others are poorly conserved with only 25-34% identical amino acid residues.
HomPPI: a class of sequence homology based protein-protein interface prediction methods
2011-01-01
Background Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. Results We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence. Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein. Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably identified. The HomPPI web server is available at http://homppi.cs.iastate.edu/. Conclusions Sequence homology-based methods offer a class of computationally efficient and reliable approaches for predicting the protein-protein interface residues that participate in either obligate or transient interactions. For query proteins involved in transient interactions, the reliability of interface residue prediction can be improved by exploiting knowledge of putative interaction partners. PMID:21682895
Cloning and sequence analysis of a cDNA clone coding for the mouse GM2 activator protein.
Bellachioma, G; Stirling, J L; Orlacchio, A; Beccari, T
1993-01-01
A cDNA (1.1 kb) containing the complete coding sequence for the mouse GM2 activator protein was isolated from a mouse macrophage library using a cDNA for the human protein as a probe. There was a single ATG located 12 bp from the 5' end of the cDNA clone followed by an open reading frame of 579 bp. Northern blot analysis of mouse macrophage RNA showed that there was a single band with a mobility corresponding to a size of 2.3 kb. We deduce from this that the mouse mRNA, in common with the mRNA for the human GM2 activator protein, has a long 3' untranslated sequence of approx. 1.7 kb. Alignment of the mouse and human deduced amino acid sequences showed 68% identity overall and 75% identity for the sequence on the C-terminal side of the first 31 residues, which in the human GM2 activator protein contains the signal peptide. Hydropathicity plots showed great similarity between the mouse and human sequences even in regions of low sequence similarity. There is a single N-glycosylation site in the mouse GM2 activator protein sequence (Asn151-Phe-Thr) which differs in its location from the single site reported in the human GM2 activator protein sequence (Asn63-Val-Thr). Images Figure 1 PMID:7689829
Darville, Lancia N F; Merchant, Mark E; Maccha, Venkata; Siddavarapu, Vivekananda Reddy; Hasan, Azeem; Murray, Kermit K
2012-02-01
Mass spectrometry in conjunction with de novo sequencing was used to determine the amino acid sequence of a 35kDa lectin protein isolated from the serum of the American alligator that exhibits binding to mannose. The protein N-terminal sequence was determined using Edman degradation and enzymatic digestion with different proteases was used to generate peptide fragments for analysis by liquid chromatography tandem mass spectrometry (LC MS/MS). Separate analysis of the protein digests with multiple enzymes enhanced the protein sequence coverage. De novo sequencing was accomplished using MASCOT Distiller and PEAKS software and the sequences were searched against the NCBI database using MASCOT and BLAST to identify homologous peptides. MS analysis of the intact protein indicated that it is present primarily as monomer and dimer in vitro. The isolated 35kDa protein was ~98% sequenced and found to have 313 amino acids and nine cysteine residues and was identified as an alligator lectin. The alligator lectin sequence was aligned with other lectin sequences using DIALIGN and ClustalW software and was found to exhibit 58% and 59% similarity to both human and mouse intelectin-1. The alligator lectin exhibited strong binding affinities toward mannan and mannose as compared to other tested carbohydrates. Copyright © 2011 Elsevier Inc. All rights reserved.
Wang, Yu-Ling; Goh, King-Xiang; Wu, Wen-guey; Chen, Chun-Jung
2004-10-01
Cysteine-rich secretory proteins (CRISPs) play an important role in the innate immune system and are transcriptionally regulated by androgens in several tissues. The proteins are mostly found in the epididymis and granules of mammals, whilst a number of snake venoms also contain CRISP-family proteins. The natrin protein from the venom of Naja atra (Taiwan cobra), which belongs to a family of CRISPs and has a cysteine-rich C-terminal amino-acid sequence, has been purified using a three-stage chromatography procedure and crystals suitable for X-ray analysis have been obtained using the hanging-drop vapour-diffusion method. X-ray diffraction data were collected to 1.58 A resolution using synchrotron radiation; the crystals belong to space group C222(1), with unit-cell parameters a = 59.172, b = 65.038, c = 243.156 A. There are two protein molecules in the asymmetric unit and the Matthews coefficient is estimated to be 2.35 A3 Da(-1), corresponding to a solvent content of 47.60%.
NASA Technical Reports Server (NTRS)
Kretsinger, R. H.; Nakayama, S.
1993-01-01
In the previous three reports in this series we demonstrated that the EF-hand family of proteins evolved by a complex pattern of gene duplication, transposition, and splicing. The dendrograms based on exon sequences are nearly identical to those based on protein sequences for troponin C, the essential light chain myosin, the regulatory light chain, and calpain. This validates both the computational methods and the dendrograms for these subfamilies. The proposal of congruence for calmodulin, troponin C, essential light chain, and regulatory light chain was confirmed. There are, however, significant differences in the calmodulin dendrograms computed from DNA and from protein sequences. In this study we find that introns are distributed throughout the EF-hand domain and the interdomain regions. Further, dendrograms based on intron type and distribution bear little resemblance to those based on protein or on DNA sequences. We conclude that introns are inserted, and probably deleted, with relatively high frequency. Further, in the EF-hand family exons do not correspond to structural domains and exon shuffling played little if any role in the evolution of this widely distributed homolog family. Calmodulin has had a turbulent evolution. Its dendrograms based on protein sequence, exon sequence, 3'-tail sequence, intron sequences, and intron positions all show significant differences.
Reiz, Bela; Li, Liang
2010-09-01
Controlled hydrolysis of proteins to generate peptide ladders combined with mass spectrometric analysis of the resultant peptides can be used for protein sequencing. In this paper, two methods of improving the microwave-assisted protein hydrolysis process are described to enable rapid sequencing of proteins containing disulfide bonds and increase sequence coverage, respectively. It was demonstrated that proteins containing disulfide bonds could be sequenced by MS analysis by first performing hydrolysis for less than 2 min, followed by 1 h of reduction to release the peptides originally linked by disulfide bonds. It was shown that a strong base could be used as a catalyst for microwave-assisted protein hydrolysis, producing complementary sequence information to that generated by microwave-assisted acid hydrolysis. However, using either acid or base hydrolysis, amide bond breakages in small regions of the polypeptide chains of the model proteins (e.g., cytochrome c and lysozyme) were not detected. Dynamic light scattering measurement of the proteins solubilized in an acid or base indicated that protein-protein interaction or aggregation was not the cause of the failure to hydrolyze certain amide bonds. It was speculated that there were some unknown local structures that might play a role in preventing an acid or base from reacting with the peptide bonds therein. 2010 American Society for Mass Spectrometry. Published by Elsevier Inc. All rights reserved.
Comparative analysis of the prion protein gene sequences in African lion.
Wu, Chang-De; Pang, Wan-Yong; Zhao, De-Ming
2006-10-01
The prion protein gene of African lion (Panthera Leo) was first cloned and polymorphisms screened. The results suggest that the prion protein gene of eight African lions is highly homogenous. The amino acid sequences of the prion protein (PrP) of all samples tested were identical. Four single nucleotide polymorphisms (C42T, C81A, C420T, T600C) in the prion protein gene (Prnp) of African lion were found, but no amino acid substitutions. Sequence analysis showed that the higher homology is observed to felis catus AF003087 (96.7%) and to sheep number M31313.1 (96.2%) Genbank accessed. With respect to all the mammalian prion protein sequences compared, the African lion prion protein sequence has three amino acid substitutions. The homology might in turn affect the potential intermolecular interactions critical for cross species transmission of prion disease.
Experimental investigation of an RNA sequence space
NASA Technical Reports Server (NTRS)
Lee, Youn-Hyung; Dsouza, Lisa; Fox, George E.
1993-01-01
Modern rRNAs are the historic consequence of an ongoing evolutionary exploration of a sequence space. These extant sequences belong to a special subset of the sequence space that is comprised only of those primary sequences that can validly perform the biological function(s) required of the particular RNA. If it were possible to readily identify all such valid sequences, stochastic predictions could be made about the relative likelihood of various evolutionary pathways available to an RNA. Herein an experimental system which can assess whether a particular sequence is likely to have validity as a eubacterial 5S rRNA is described. A total of ten naturally occurring, and hence known to be valid, sequences and two point mutants of unknown validity were used to test the usefulness of the approach. Nine of the ten valid sequences tested positive whereas both mutants tested as clearly defective. The tenth valid sequence gave results that would be interpreted as reflecting a borderline status were the answer not known. These results demonstrate that it is possible to experimentally determine which sequences in local regions of the sequence space are potentially valid 5S rRNAs.
Basis for substrate recognition and distinction by matrix metalloproteinases
Ratnikov, Boris I.; Cieplak, Piotr; Gramatikoff, Kosi; Pierce, James; Eroshkin, Alexey; Igarashi, Yoshinobu; Kazanov, Marat; Sun, Qing; Godzik, Adam; Osterman, Andrei; Stec, Boguslaw; Strongin, Alex; Smith, Jeffrey W.
2014-01-01
Genomic sequencing and structural genomics produced a vast amount of sequence and structural data, creating an opportunity for structure–function analysis in silico [Radivojac P, et al. (2013) Nat Methods 10(3):221–227]. Unfortunately, only a few large experimental datasets exist to serve as benchmarks for function-related predictions. Furthermore, currently there are no reliable means to predict the extent of functional similarity among proteins. Here, we quantify structure–function relationships among three phylogenetic branches of the matrix metalloproteinase (MMP) family by comparing their cleavage efficiencies toward an extended set of phage peptide substrates that were selected from ∼64 million peptide sequences (i.e., a large unbiased representation of substrate space). The observed second-order rate constants [k(obs)] across the substrate space provide a distance measure of functional similarity among the MMPs. These functional distances directly correlate with MMP phylogenetic distance. There is also a remarkable and near-perfect correlation between the MMP substrate preference and sequence identity of 50–57 discontinuous residues surrounding the catalytic groove. We conclude that these residues represent the specificity-determining positions (SDPs) that allowed for the expansion of MMP proteolytic function during evolution. A transmutation of only a few selected SDPs proximal to the bound substrate peptide, and contributing the most to selectivity among the MMPs, is sufficient to enact a global change in the substrate preference of one MMP to that of another, indicating the potential for the rational and focused redesign of cleavage specificity in MMPs. PMID:25246591
A protein block based fold recognition method for the annotation of twilight zone sequences.
Suresh, V; Ganesan, K; Parthasarathy, S
2013-03-01
The description of protein backbone was recently improved with a group of structural fragments called Structural Alphabets instead of the regular three states (Helix, Sheet and Coil) secondary structure description. Protein Blocks is one of the Structural Alphabets used to describe each and every region of protein backbone including the coil. According to de Brevern (2000) the Protein Blocks has 16 structural fragments and each one has 5 residues in length. Protein Blocks fragments are highly informative among the available Structural Alphabets and it has been used for many applications. Here, we present a protein fold recognition method based on Protein Blocks for the annotation of twilight zone sequences. In our method, we align the predicted Protein Blocks of a query amino acid sequence with a library of assigned Protein Blocks of 953 known folds using the local pair-wise alignment. The alignment results with z-value ≥ 2.5 and P-value ≤ 0.08 are predicted as possible folds. Our method is able to recognize the possible folds for nearly 35.5% of the twilight zone sequences with their predicted Protein Block sequence obtained by pb_prediction, which is available at Protein Block Export server.
Meta sequence analysis of human blood peptides and their parent proteins.
Bowden, Peter; Pendrak, Voitek; Zhu, Peihong; Marshall, John G
2010-04-18
Sequence analysis of the blood peptides and their qualities will be key to understanding the mechanisms that contribute to error in LC-ESI-MS/MS. Analysis of peptides and their proteins at the level of sequences is much more direct and informative than the comparison of disparate accession numbers. A portable database of all blood peptide and protein sequences with descriptor fields and gene ontology terms might be useful for designing immunological or MRM assays from human blood. The results of twelve studies of human blood peptides and/or proteins identified by LC-MS/MS and correlated against a disparate array of genetic libraries were parsed and matched to proteins from the human ENSEMBL, SwissProt and RefSeq databases by SQL. The reported peptide and protein sequences were organized into an SQL database with full protein sequences and up to five unique peptides in order of prevalence along with the peptide count for each protein. Structured query language or BLAST was used to acquire descriptive information in current databases. Sampling error at the level of peptides is the largest source of disparity between groups. Chi Square analysis of peptide to protein distributions confirmed the significant agreement between groups on identified proteins. Copyright 2010. Published by Elsevier B.V.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, X.; Wilcox, G.L.
1993-12-31
We have implemented large scale back-propagation neural networks on a 544 node Connection Machine, CM-5, using the C language in MIMD mode. The program running on 512 processors performs backpropagation learning at 0.53 Gflops, which provides 76 million connection updates per second. We have applied the network to the prediction of protein tertiary structure from sequence information alone. A neural network with one hidden layer and 40 million connections is trained to learn the relationship between sequence and tertiary structure. The trained network yields predicted structures of some proteins on which it has not been trained given only their sequences.more » Presentation of the Fourier transform of the sequences accentuates periodicity in the sequence and yields good generalization with greatly increased training efficiency. Training simulations with a large, heterologous set of protein structures (111 proteins from CM-5 time) to solutions with under 2% RMS residual error within the training set (random responses give an RMS error of about 20%). Presentation of 15 sequences of related proteins in a testing set of 24 proteins yields predicted structures with less than 8% RMS residual error, indicating good apparent generalization.« less
3-base periodicity in coding DNA is affected by intercodon dinucleotides
Sánchez, Joaquín
2011-01-01
All coding DNAs exhibit 3-base periodicity (TBP), which may be defined as the tendency of nucleotides and higher order n-tuples, e.g. trinucleotides (triplets), to be preferentially spaced by 3, 6, 9 etc, bases, and we have proposed an association between TBP and clustering of same-phase triplets. We here investigated if TBP was affected by intercodon dinucleotide tendencies and whether clustering of same-phase triplets was involved. Under constant protein sequence intercodon dinucleotide frequencies depend on the distribution of synonymous codons. So, possible effects were revealed by randomly exchanging synonymous codons without altering protein sequences to subsequently document changes in TBP via frequency distribution of distances (FDD) of DNA triplets. A tripartite positive correlation was found between intercodon dinucleotide frequencies, clustering of same-phase triplets and TBP. So, intercodon C|A (where “|” indicates the boundary between codons) was more frequent in native human DNA than in the codon-shuffled sequences; higher C|A frequency occurred along with more frequent clustering of C|AN triplets (where N jointly represents A, C, G and T) and with intense CAN TBP. The opposite was found for C|G, which was less frequent in native than in shuffled sequences; lower C|G frequency occurred together with reduced clustering of C|GN triplets and with less intense CGN TBP. We hence propose that intercodon dinucleotides affect TBP via same-phase triplet clustering. A possible biological relevance of our findings is briefly discussed. PMID:21814388
NASA Astrophysics Data System (ADS)
Gao, Jie; Jiang, Li-Li; Xu, Zhen-Yuan
2009-10-01
A new chaos game representation of protein sequences based on the detailed hydrophobic-hydrophilic (HP) model has been proposed by Yu et al (Physica A 337 (2004) 171). A CGR-walk model is proposed based on the new CGR coordinates for the protein sequences from complete genomes in the present paper. The new CGR coordinates based on the detailed HP model are converted into a time series, and a long-memory ARFIMA(p, d, q) model is introduced into the protein sequence analysis. This model is applied to simulating real CGR-walk sequence data of twelve protein sequences. Remarkably long-range correlations are uncovered in the data and the results obtained from these models are reasonably consistent with those available from the ARFIMA(p, d, q) model.
Rodriguez-Rivas, Juan; Marsili, Simone; Juan, David; Valencia, Alfonso
2016-01-01
Protein–protein interactions are fundamental for the proper functioning of the cell. As a result, protein interaction surfaces are subject to strong evolutionary constraints. Recent developments have shown that residue coevolution provides accurate predictions of heterodimeric protein interfaces from sequence information. So far these approaches have been limited to the analysis of families of prokaryotic complexes for which large multiple sequence alignments of homologous sequences can be compiled. We explore the hypothesis that coevolution points to structurally conserved contacts at protein–protein interfaces, which can be reliably projected to homologous complexes with distantly related sequences. We introduce a domain-centered protocol to study the interplay between residue coevolution and structural conservation of protein–protein interfaces. We show that sequence-based coevolutionary analysis systematically identifies residue contacts at prokaryotic interfaces that are structurally conserved at the interface of their eukaryotic counterparts. In turn, this allows the prediction of conserved contacts at eukaryotic protein–protein interfaces with high confidence using solely mutational patterns extracted from prokaryotic genomes. Even in the context of high divergence in sequence (the twilight zone), where standard homology modeling of protein complexes is unreliable, our approach provides sequence-based accurate information about specific details of protein interactions at the residue level. Selected examples of the application of prokaryotic coevolutionary analysis to the prediction of eukaryotic interfaces further illustrate the potential of this approach. PMID:27965389
Bolla, J M; Dé, E; Dorez, A; Pagès, J M
2000-01-01
A novel pore-forming protein identified in Campylobacter was purified by ion-exchange chromatography and named Omp50 according to both its molecular mass and its outer membrane localization. We observed a pore-forming ability of Omp50 after re-incorporation into artificial membranes. The protein induced cation-selective channels with major conductance values of 50-60 pS in 1 M NaCl. N-terminal sequencing allowed us to identify the predicted coding sequence Cj1170c from the Campylobacter jejuni genome database as the corresponding gene in the NCTC 11168 genome sequence. The gene, designated omp50, consists of a 1425 bp open reading frame encoding a deduced 453-amino acid protein with a calculated pI of 5.81 and a molecular mass of 51169.2 Da. The protein possessed a 20-amino acid leader sequence. No significant similarity was found between Omp50 and porin protein sequences already determined. Moreover, the protein showed only weak sequence identity with the major outer-membrane protein (MOMP) of Campylobacter, correlating with the absence of antigenic cross-reactivity between these two proteins. Omp50 is expressed in C. jejuni and Campylobacter lari but not in Campylobacter coli. The gene, however, was detected in all three species by PCR. According to its conformation and functional properties, the protein would belong to the family of outer-membrane monomeric porins. PMID:11104668
Claudins reign: The claudin/EMP/PMP22/γ channel protein family in C. elegans.
Simske, Jeffrey S
2013-07-01
The claudin family of integral membrane proteins was identified as the major protein component of the tight junctions in all vertebrates. Since their identification, claudins, and their associated pfam00822 superfamily of proteins have been implicated in a wide variety of cellular processes. Claudin homologs have been identified in invertebrates as well, including Drosophila and C. elegans. Recent studies demonstrate that the C. elegans claudins, clc-1-clc- 5, and similar proteins in the greater PMP22/EMP/claudin/voltage-gated calcium channel γ subunit family, including nsy-4, and vab-9, while highly divergent at a sequence level from each other and from the vertebrate claudins, in many cases play roles similar to those traditionally assigned to their vertebrate homologs. These include regulating cell adhesion and passage of small molecules through the paracellular space, channel activity, protein aggregation, sensitivity to pore-forming toxins, intercellular signaling, cell fate specification and dynamic changes in cell morphology. Study of claudin superfamily proteins in C. elegans should continue to provide clues as to how claudin family protein function has been adapted to perform diverse functions at specialized cell-cell contacts in metazoans.
Zhou, Hang; Yang, Yang; Shen, Hong-Bin
2017-03-15
Protein subcellular localization prediction has been an important research topic in computational biology over the last decade. Various automatic methods have been proposed to predict locations for large scale protein datasets, where statistical machine learning algorithms are widely used for model construction. A key step in these predictors is encoding the amino acid sequences into feature vectors. Many studies have shown that features extracted from biological domains, such as gene ontology and functional domains, can be very useful for improving the prediction accuracy. However, domain knowledge usually results in redundant features and high-dimensional feature spaces, which may degenerate the performance of machine learning models. In this paper, we propose a new amino acid sequence-based human protein subcellular location prediction approach Hum-mPLoc 3.0, which covers 12 human subcellular localizations. The sequences are represented by multi-view complementary features, i.e. context vocabulary annotation-based gene ontology (GO) terms, peptide-based functional domains, and residue-based statistical features. To systematically reflect the structural hierarchy of the domain knowledge bases, we propose a novel feature representation protocol denoted as HCM (Hidden Correlation Modeling), which will create more compact and discriminative feature vectors by modeling the hidden correlations between annotation terms. Experimental results on four benchmark datasets show that HCM improves prediction accuracy by 5-11% and F 1 by 8-19% compared with conventional GO-based methods. A large-scale application of Hum-mPLoc 3.0 on the whole human proteome reveals proteins co-localization preferences in the cell. www.csbio.sjtu.edu.cn/bioinf/Hum-mPLoc3/. hbshen@sjtu.edu.cn. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Metagenomic ventures into outer sequence space.
Dutilh, Bas E
Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question.
Cysteine-containing peptide tag for site-specific conjugation of proteins
Backer, Marina V.; Backer, Joseph M.
2008-04-08
The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety bound to the targeting moiety; the biological conjugate having a covalent bond between the thiol group of SEQ ID NO:2 and a functional group in the binding moiety. The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety that comprises an adapter protein, the adapter protein having a thiol group; the biological conjugate having a disulfide bond between the thiol group of SEQ ID NO:2 and the thiol group of the adapter protein. The present invention is also directed to biological sequences employed in the above biological conjugates, as well as pharmaceutical preparations and methods using the above biological conjugates.
Cysteine-containing peptide tag for site-specific conjugation of proteins
Backer, Marina V.; Backer, Joseph M.
2010-10-05
The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety bound to the targeting moiety; the biological conjugate having a covalent bond between the thiol group of SEQ ID NO:2 and a functional group in the binding moiety. The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety that comprises an adapter protein, the adapter protein having a thiol group; the biological conjugate having a disulfide bond between the thiol group of SEQ ID NO:2 and the thiol group of the adapter protein. The present invention is also directed to biological sequences employed in the above biological conjugates, as well as pharmaceutical preparations and methods using the above biological conjugates.
Graph pyramids for protein function prediction
2015-01-01
Background Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Methods Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Results Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data. PMID:26044522
Graph pyramids for protein function prediction.
Sandhan, Tushar; Yoo, Youngjun; Choi, Jin; Kim, Sun
2015-01-01
Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data.
Protein Crystal Eco R1 Endonulease-DNA Complex
NASA Technical Reports Server (NTRS)
1998-01-01
Type II restriction enzymes, such as Eco R1 endonulease, present a unique advantage for the study of sequence-specific recognition because they leave a record of where they have been in the form of the cleaved ends of the DNA sites where they were bound. The differential behavior of a sequence -specific protein at sites of differing base sequence is the essence of the sequence-specificity; the core question is how do these proteins discriminate between different DNA sequences especially when the two sequences are very similar. Principal Investigator: Dan Carter/New Century Pharmaceuticals
Hao, Ge-Fei; Xu, Wei-Fang; Yang, Sheng-Gang; Yang, Guang-Fu
2015-01-01
Protein and peptide structure predictions are of paramount importance for understanding their functions, as well as the interactions with other molecules. However, the use of molecular simulation techniques to directly predict the peptide structure from the primary amino acid sequence is always hindered by the rough topology of the conformational space and the limited simulation time scale. We developed here a new strategy, named Multiple Simulated Annealing-Molecular Dynamics (MSA-MD) to identify the native states of a peptide and miniprotein. A cluster of near native structures could be obtained by using the MSA-MD method, which turned out to be significantly more efficient in reaching the native structure compared to continuous MD and conventional SA-MD simulation. PMID:26492886
Genomewide Function Conservation and Phylogeny in the Herpesviridae
Albà, M. Mar; Das, Rhiju; Orengo, Christine A.; Kellam, Paul
2001-01-01
The Herpesviridae are a large group of well-characterized double-stranded DNA viruses for which many complete genome sequences have been determined. We have extracted protein sequences from all predicted open reading frames of 19 herpesvirus genomes. Sequence comparison and protein sequence clustering methods have been used to construct herpesvirus protein homologous families. This resulted in 1692 proteins being clustered into 243 multiprotein families and 196 singleton proteins. Predicted functions were assigned to each homologous family based on genome annotation and published data and each family classified into seven broad functional groups. Phylogenetic profiles were constructed for each herpesvirus from the homologous protein families and used to determine conserved functions and genomewide phylogenetic trees. These trees agreed with molecular-sequence-derived trees and allowed greater insight into the phylogeny of ungulate and murine gammaherpesviruses. PMID:11156614
Determining protein function and interaction from genome analysis
Eisenberg, David; Marcotte, Edward M.; Thompson, Michael J.; Pellegrini, Matteo; Yeates, Todd O.
2004-08-03
A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.
Riffel, Philipp; Michaely, Henrik J; Morelli, John N; Paul, Dominik; Kannengiesser, Stephan; Schoenberg, Stefan O; Haneder, Stefan
2015-04-01
The purpose of this study was to evaluate the feasibility and technical quality of a zoomed three-dimensional (3D) turbo spin-echo (TSE) sampling perfection with application optimized contrasts using different flip-angle evolutions (SPACE) sequence of the lumbar spine. In this prospective feasibility study, nine volunteers underwent a 3-T magnetic resonance examination of the lumbar spine including 1) a conventional 3D T2-weighted (T2w) SPACE sequence with generalized autocalibrating partially parallel acquisition technique acceleration factor 2 and 2) a zoomed 3D T2w SPACE sequence with a reduced field of view (reduction factor 2). Images were evaluated with regard to image sharpness, signal homogeneity, and the presence of artifacts by two experienced radiologists. For quantitative analysis, signal-to-noise ratio (SNR) values were calculated. Image sharpness of anatomic structures was statistically significantly greater with zoomed SPACE (P < .0001), whereas the signal homogeneity was statistically significantly greater with conventional SPACE (cSPACE; P = .0003). There were no statistically significant differences in extent of artifacts. Acquisition times were 8:20 minutes for cSPACE and 6:30 minutes for zoomed SPACE. Readers 1 and 2 selected zSPACE as the preferred sequence in five of nine cases. In two of nine cases, both sequences were rated as equally preferred by both the readers. SNR values were statistically significantly greater with cSPACE. In comparison to a cSPACE sequences, zoomed SPACE imaging of the lumbar spine provides sharper images in conjunction with a 25% reduction in acquisition time. Copyright © 2015 AUR. Published by Elsevier Inc. All rights reserved.
An approach to large scale identification of non-obvious structural similarities between proteins
Cherkasov, Artem; Jones, Steven JM
2004-01-01
Background A new sequence independent bioinformatics approach allowing genome-wide search for proteins with similar three dimensional structures has been developed. By utilizing the numerical output of the sequence threading it establishes putative non-obvious structural similarities between proteins. When applied to the testing set of proteins with known three dimensional structures the developed approach was able to recognize structurally similar proteins with high accuracy. Results The method has been developed to identify pathogenic proteins with low sequence identity and high structural similarity to host analogues. Such protein structure relationships would be hypothesized to arise through convergent evolution or through ancient horizontal gene transfer events, now undetectable using current sequence alignment techniques. The pathogen proteins, which could mimic or interfere with host activities, would represent candidate virulence factors. The developed approach utilizes the numerical outputs from the sequence-structure threading. It identifies the potential structural similarity between a pair of proteins by correlating the threading scores of the corresponding two primary sequences against the library of the standard folds. This approach allowed up to 64% sensitivity and 99.9% specificity in distinguishing protein pairs with high structural similarity. Conclusion Preliminary results obtained by comparison of the genomes of Homo sapiens and several strains of Chlamydia trachomatis have demonstrated the potential usefulness of the method in the identification of bacterial proteins with known or potential roles in virulence. PMID:15147578
NASA Technical Reports Server (NTRS)
Funderburgh, J. L.; Funderburgh, M. L.; Brown, S. J.; Vergnes, J. P.; Hassell, J. R.; Mann, M. M.; Conrad, G. W.; Spooner, B. S. (Principal Investigator)
1993-01-01
Amino acid sequence from tryptic peptides of three different bovine corneal keratan sulfate proteoglycan (KSPG) core proteins (designated 37A, 37B, and 25) showed similarities to the sequence of a chicken KSPG core protein lumican. Bovine lumican cDNA was isolated from a bovine corneal expression library by screening with chicken lumican cDNA. The bovine cDNA codes for a 342-amino acid protein, M(r) 38,712, containing amino acid sequences identified in the 37B KSPG core protein. The bovine lumican is 68% identical to chicken lumican, with an 83% identity excluding the N-terminal 40 amino acids. Location of 6 cysteine and 4 consensus N-glycosylation sites in the bovine sequence were identical to those in chicken lumican. Bovine lumican had about 50% identity to bovine fibromodulin and 20% identity to bovine decorin and biglycan. About two-thirds of the lumican protein consists of a series of 10 amino acid leucine-rich repeats that occur in regions of calculated high beta-hydrophobic moment, suggesting that the leucine-rich repeats contribute to beta-sheet formation in these proteins. Sequences obtained from 37A and 25 core proteins were absent in bovine lumican, thus predicting a unique primary structure and separate mRNA for each of the three bovine KSPG core proteins.
NASA Astrophysics Data System (ADS)
Humpula, James F.; Ostrom, Peggy H.; Gandhi, Hasand; Strahler, John R.; Walker, Angela K.; Stafford, Thomas W.; Smith, James J.; Voorhies, Michael R.; George Corner, R.; Andrews, Phillip C.
2007-12-01
Ancient DNA sequences offer an extraordinary opportunity to unravel the evolutionary history of ancient organisms. Protein sequences offer another reservoir of genetic information that has recently become tractable through the application of mass spectrometric techniques. The extent to which ancient protein sequences resolve phylogenetic relationships, however, has not been explored. We determined the osteocalcin amino acid sequence from the bone of an extinct Camelid (21 ka, Camelops hesternus) excavated from Isleta Cave, New Mexico and three bones of extant camelids: bactrian camel ( Camelus bactrianus); dromedary camel ( Camelus dromedarius) and guanaco ( Llama guanacoe) for a diagenetic and phylogenetic assessment. There was no difference in sequence among the four taxa. Structural attributes observed in both modern and ancient osteocalcin include a post-translation modification, Hyp 9, deamidation of Gln 35 and Gln 39, and oxidation of Met 36. Carbamylation of the N-terminus in ancient osteocalcin may result in blockage and explain previous difficulties in sequencing ancient proteins via Edman degradation. A phylogenetic analysis using osteocalcin sequences of 25 vertebrate taxa was conducted to explore osteocalcin protein evolution and the utility of osteocalcin sequences for delineating phylogenetic relationships. The maximum likelihood tree closely reflected generally recognized taxonomic relationships. For example, maximum likelihood analysis recovered rodents, birds and, within hominins, the Homo-Pan-Gorilla trichotomy. Within Artiodactyla, character state analysis showed that a substitution of Pro 4 for His 4 defines the Capra-Ovis clade within Artiodactyla. Homoplasy in our analysis indicated that osteocalcin evolution is not a perfect indicator of species evolution. Limited sequence availability prevented assigning functional significance to sequence changes. Our preliminary analysis of osteocalcin evolution represents an initial step towards a complete character analysis aimed at determining the evolutionary history of this functionally significant protein. We emphasize that ancient protein sequencing and phylogenetic analyses using amino acid sequences must pay close attention to post-translational modifications, amino acid substitutions due to diagenetic alteration and the impacts of isobaric amino acids on mass shifts and sequence alignments.
Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum
DOE Office of Scientific and Technical Information (OSTI.GOV)
VanBuren, Robert; Bryant, Doug; Edger, Patrick P.
Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly1. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE). Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetiummore » genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a ‘near-complete’ draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. As a result, the Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.« less
Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum
VanBuren, Robert; Bryant, Doug; Edger, Patrick P.; ...
2015-11-11
Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly1. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE). Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetiummore » genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a ‘near-complete’ draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. As a result, the Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.« less
Vandenbol, M; Jauniaux, J C; Grenson, M
1989-11-15
The complete nucleotide (nt) sequence of the PUT4 gene, whose product is required for high-affinity proline active transport in the yeast Saccharomyces cerevisiae, is presented. The sequence contains a single long open reading frame of 1881 nt, encoding a polypeptide with a calculated Mr of 68,795. The predicted protein is strongly hydrophobic and exhibits six potential glycosylation sites. Its hydropathy profile suggests the presence of twelve membrane-spanning regions flanked by hydrophilic N- and C-terminal domains. The N terminus does not resemble signal sequences found in secreted proteins. These features are characteristic of integral membrane proteins catalyzing translocation of ligands across cellular membranes. Protein sequence comparisons indicate strong resemblance to the arginine and histidine permeases of S. cerevisiae, but no marked sequence similarity to the proline permease of Escherichia coli or to other known prokaryotic or eukaryotic transport proteins. The strong similarity between the three yeast amino acid permeases suggests a common ancestor for the three proteins.
Wittmann-Liebold, B; Geissler, A W; Lin, A; Wool, I G
1979-01-01
The sequence of the amino-terminal region of eleven rat liver ribosomal proteins--S4, S6, S8, L6, L7a, L18, L27, L30, L37a, and L39--was determined. The analysis confirmed the homogeneity of the proteins and suggests that they are unique, since no extensive common sequences were found. The N-terminal regions of the rat liver proteins were compared with amino acid sequences in Saccharomyces cerevisiae and in Escherichia coli ribosomal proteins. It seems likely that the proteins L37 from rat liver and Y55 from yeast ribosomes are homologous. It is possible that rat liver L7a or L37a or both are related to S cerevisiae Y44, although the similar sequences are at the amino-terminus of the rat liver proteins and in an internal region of Y44. A number of similarities in the sequences of rat liver and E coli ribosomal proteins have been found; however, it is not yet possible to say whether they connote a common ancestry.
Anisimov, Andrey P; Panfertsev, Evgeniy A; Svetoch, Tat'yana E; Dentovskaya, Svetlana V
2007-01-01
Sequencing of lcrV genes and comparison of the deduced amino acid sequences from ten Y. pestis strains belonging mostly to the group of atypical rhamnose-positive isolates (non-pestis subspecies or pestoides group) showed that the LcrV proteins analyzed could be classified into five sequence types. This classification was based on major amino acid polymorphisms among LcrV proteins in the four "hot points" of the protein sequences. Some additional minor polymorphisms were found throughout these sequence types. The "hot points" corresponded to amino acids 18 (Lys --> Asn), 72 (Lys --> Arg), 273 (Cys --> Ser), and 324-326 (Ser-Gly-Lys --> Arg) in the LcrV sequence of the reference Y. pestis strain CO92. One possible explanation for polymorphism in amino acid sequences of LcrV among different strains is that strain-specific variation resulted from adaptation of the plague pathogen to different rodent and lagomorph hosts.
Robustness of Reconstructed Ancestral Protein Functions to Statistical Uncertainty.
Eick, Geeta N; Bridgham, Jamie T; Anderson, Douglas P; Harms, Michael J; Thornton, Joseph W
2017-02-01
Hypotheses about the functions of ancient proteins and the effects of historical mutations on them are often tested using ancestral protein reconstruction (APR)-phylogenetic inference of ancestral sequences followed by synthesis and experimental characterization. Usually, some sequence sites are ambiguously reconstructed, with two or more statistically plausible states. The extent to which the inferred functions and mutational effects are robust to uncertainty about the ancestral sequence has not been studied systematically. To address this issue, we reconstructed ancestral proteins in three domain families that have different functions, architectures, and degrees of uncertainty; we then experimentally characterized the functional robustness of these proteins when uncertainty was incorporated using several approaches, including sampling amino acid states from the posterior distribution at each site and incorporating the alternative amino acid state at every ambiguous site in the sequence into a single "worst plausible case" protein. In every case, qualitative conclusions about the ancestral proteins' functions and the effects of key historical mutations were robust to sequence uncertainty, with similar functions observed even when scores of alternate amino acids were incorporated. There was some variation in quantitative descriptors of function among plausible sequences, suggesting that experimentally characterizing robustness is particularly important when quantitative estimates of ancient biochemical parameters are desired. The worst plausible case method appears to provide an efficient strategy for characterizing the functional robustness of ancestral proteins to large amounts of sequence uncertainty. Sampling from the posterior distribution sometimes produced artifactually nonfunctional proteins for sequences reconstructed with substantial ambiguity. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Liu, Bin; Wang, Xiaolong; Lin, Lei; Dong, Qiwen; Wang, Xuan
2008-12-01
Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
Complete genome sequences of two novel European clade bovine foamy viruses from Germany and Poland.
Hechler, Torsten; Materniak, Magdalena; Kehl, Timo; Kuzmak, Jacek; Löchelt, Martin
2012-10-01
Bovine foamy virus (BFV), or bovine spumaretrovirus, is an infectious agent of cattle with no obvious disease association but high prevalence in its host. Here, we report two complete BFV sequences, BFV-Riems, isolated in 1978 in East Germany, and BFV100, isolated in 2005 in Poland. Both new BFV isolates share the overall genetic makeup of other foamy viruses (FV). Although isolated almost 25 years apart and propagated in either bovine (BFV-Riems) or nonbovine (BFV100) cells, both viruses are highly related, forming the European BFV clade. Despite clear differences, BFV-Riems and BFV100 are still very similar to BFV isolates from China and the United States, comprising the non-European BFV clade. The genomic sequences presented here confirm the concept of high sequence conservation across most of the FV genome. Analyses of cell culture-derived genomes reveal that proviral DNA may specifically lack introns in the env-bel coding region. The spacing of the splice sites in this region suggests that BFV has developed a novel mode to express a secretory but nonfunctional Env protein.
Complete Genome Sequences of Two Novel European Clade Bovine Foamy Viruses from Germany and Poland
Hechler, Torsten; Materniak, Magdalena; Kehl, Timo; Kuzmak, Jacek
2012-01-01
Bovine foamy virus (BFV), or bovine spumaretrovirus, is an infectious agent of cattle with no obvious disease association but high prevalence in its host. Here, we report two complete BFV sequences, BFV-Riems, isolated in 1978 in East Germany, and BFV100, isolated in 2005 in Poland. Both new BFV isolates share the overall genetic makeup of other foamy viruses (FV). Although isolated almost 25 years apart and propagated in either bovine (BFV-Riems) or nonbovine (BFV100) cells, both viruses are highly related, forming the European BFV clade. Despite clear differences, BFV-Riems and BFV100 are still very similar to BFV isolates from China and the United States, comprising the non-European BFV clade. The genomic sequences presented here confirm the concept of high sequence conservation across most of the FV genome. Analyses of cell culture-derived genomes reveal that proviral DNA may specifically lack introns in the env-bel coding region. The spacing of the splice sites in this region suggests that BFV has developed a novel mode to express a secretory but nonfunctional Env protein. PMID:22966195
Kim, Bo-Mi; Kang, Seunghyun; Ahn, Do-Hwan; Kim, Jin-Hyoung; Ahn, Inhye; Lee, Chi-Woo; Cho, Joo-Lae; Min, Gi-Sik; Park, Hyun
2017-01-01
Bathynellacea (Crustacea, Syncarida, Parabathynellidae) are subterranean aquatic crustaceans that typically inhabit freshwater interstitial spaces (e.g., groundwater) and are occasionally found in caves and even hot springs. In this study, we sequenced the whole transcriptome of Allobathynella bangokensis using RNA-seq. De novo sequence assembly produced 74,866 contigs including 28,934 BLAST hits. Overall, the gene sequences were most similar to those of the waterflea Daphnia pulex. In the A. bangokensis transcriptome, no opsin or related sequences were identified, and no contig aligned to the crustacean visual opsins and non-visual opsins (i.e. arthropsins, peropsins, and melaopsins), suggesting potential regressive adaptation to the dark environment. However, A. bangokensis expressed conserved gene family sets, such as heat shock proteins and those related to key innate immunity pathways and antioxidant defense systems, at the transcriptional level, suggesting that this species has evolved adaptations involving molecular mechanisms of homeostasis. The transcriptomic information of A. bangokensis will be useful for investigating molecular adaptations and response mechanisms to subterranean environmental conditions. PMID:28107438
Genome sequence of the phylogenetically isolated spirochete Leptonema illini type strain (3055T)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huntemann, Marcel; Stackebrandt, Erko; Held, Brittany
2013-01-01
Leptonema illini Hovind-Hougen 1979 is the type species of the genus Leptonema, family Leptospiraceae, phylum Spirochaetes. Organisms of this family have a Gram-negative-like cell enve- lope consisting of a cytoplasmic membrane and an outer membrane. The peptidoglycan layer is as- sociated with the cytoplasmic rather than the outer membrane. The two flagella of members of Leptospiraceae extend from the cytoplasmic membrane at the ends of the bacteria into the periplasmic space and are necessary for their motility. Here we describe the features of the L. illini type strain, together with the complete genome sequence, and annotation. This is the firstmore » genome sequence (finished at the level of Improved High Quality Draft) to be reported from of a member of the genus Leptonema and a representative of the third genus of the family Leptospiraceae for which complete or draft genome sequences are now available. The three scaffolds of the 4,522,760 bp draft genome sequence reported here, and its 4,230 protein-coding and 47 RNA genes are part of the Ge- nomic Encyclopedia of Bacteria and Archaea project.« less
Plant Seeds as Model Vectors for the Transfer of Life Through Space
NASA Astrophysics Data System (ADS)
Tepfer, David; Leach, Sydney
2006-12-01
We consider plant seeds as terrestrial models for a vectored life form that could protect biological information in space. Seeds consist of maternal tissue surrounding and protecting an embryo. Some seeds resist deleterious conditions found in space: ultra low vacuum, extreme temperatures and radiation, including intense UV light. In a receptive environment, seeds could liberate a viable embryo, viable higher cells or a viable free-living organism (an endosymbiont or endophyte). Even if viability is lost, seeds still contain functional macro and small molecules (DNA, RNA, proteins, amino acids, lipids, etc.) that could provide the chemical basis for starting or modifying life. The possible release of endophytes or endosymbionts from a seed-like space traveler suggests that multiple domains of life, defined in DNA sequence phylogenies, could be disseminated simultaneously from Earth. We consider the possibility of exospermia, the outward transfer of life, as well as introspermia, the inward transfer of life-both as a contemporary and ancient events.
Host Cell Virus Entry Mediated by Australian Bat Lyssavirus Envelope G glycoprotein
2013-10-24
39 Figure 7. Comparison of the amino acid sequences of Saccolaimus and Pteropus ABLV G mature protein... sequence analysis revealed that the PCR products were identical. Sequence comparisons of the ABLV N and other lyssavirus N proteins showed that ABLV...Saccolaimus flaviventris) (129). Nucleoprotein sequence comparisons revealed that the Saccolaimus N protein shared 96% amino acid homology with the Pteropus
Yerrapragada, Shaila; Shukla, Animesh; Hallsworth-Pepin, Kymberlie; Choi, Kwangmin; Wollam, Aye; Clifton, Sandra; Qin, Xiang; Muzny, Donna; Raghuraman, Sriram; Ashki, Haleh; Uzman, Akif; Highlander, Sarah K.; Fryszczyn, Bartlomiej G.; Fox, George E.; Tirumalai, Madhan R.; Liu, Yamei; Kim, Sun
2015-01-01
Tolypothrix sp. PCC 7601 is a freshwater filamentous cyanobacterium with complex responses to environmental conditions. Here, we present its 9.96-Mbp draft genome sequence, containing 10,065 putative protein-coding sequences, including 305 predicted two-component system proteins and 27 putative phytochrome-class photoreceptors, the most such proteins in any sequenced genome. PMID:25953173
Visual exploration of parameter influence on phylogenetic trees.
Hess, Martin; Bremm, Sebastian; Weissgraeber, Stephanie; Hamacher, Kay; Goesele, Michael; Wiemeyer, Josef; von Landesberger, Tatiana
2014-01-01
Evolutionary relationships between organisms are frequently derived as phylogenetic trees inferred from multiple sequence alignments (MSAs). The MSA parameter space is exponentially large, so tens of thousands of potential trees can emerge for each dataset. A proposed visual-analytics approach can reveal the parameters' impact on the trees. Given input trees created with different parameter settings, it hierarchically clusters the trees according to their structural similarity. The most important clusters of similar trees are shown together with their parameters. This view offers interactive parameter exploration and automatic identification of relevant parameters. Biologists applied this approach to real data of 16S ribosomal RNA and protein sequences of ion channels. It revealed which parameters affected the tree structures. This led to a more reliable selection of the best trees.
Ghosh, Pritha; Mathew, Oommen K; Sowdhamini, Ramanathan
2016-10-07
RNA-binding proteins (RBPs) interact with their cognate RNA(s) to form large biomolecular assemblies. They are versatile in their functionality and are involved in a myriad of processes inside the cell. RBPs with similar structural features and common biological functions are grouped together into families and superfamilies. It will be useful to obtain an early understanding and association of RNA-binding property of sequences of gene products. Here, we report a web server, RStrucFam, to predict the structure, type of cognate RNA(s) and function(s) of proteins, where possible, from mere sequence information. The web server employs Hidden Markov Model scan (hmmscan) to enable association to a back-end database of structural and sequence families. The database (HMMRBP) comprises of 437 HMMs of RBP families of known structure that have been generated using structure-based sequence alignments and 746 sequence-centric RBP family HMMs. The input protein sequence is associated with structural or sequence domain families, if structure or sequence signatures exist. In case of association of the protein with a family of known structures, output features like, multiple structure-based sequence alignment (MSSA) of the query with all others members of that family is provided. Further, cognate RNA partner(s) for that protein, Gene Ontology (GO) annotations, if any and a homology model of the protein can be obtained. The users can also browse through the database for details pertaining to each family, protein or RNA and their related information based on keyword search or RNA motif search. RStrucFam is a web server that exploits structurally conserved features of RBPs, derived from known family members and imprinted in mathematical profiles, to predict putative RBPs from sequence information. Proteins that fail to associate with such structure-centric families are further queried against the sequence-centric RBP family HMMs in the HMMRBP database. Further, all other essential information pertaining to an RBP, like overall function annotations, are provided. The web server can be accessed at the following link: http://caps.ncbs.res.in/rstrucfam .
Crone, T M; Kanugula, S; Pegg, A E
1995-08-01
Although the human O6-alkylguanine-DNA alkyltransferase (AGT) is very sensitive to inactivation by O6-benzylguanine (BG) or 2,4-diamino-6-benzyloxy-5-nitrosopyrimidine (5-nitroso-BP), the equivalent protein formed by the carboxyl terminal domain of the product of the Escherichia coli ada gene (Ada-C) is unaffected by these inhibitors. This difference is remarkable in view of the substantial similarity between these proteins (33% of the residues in the common sequence are identical) and is potentially very important since these inhibitors are under development as drugs to enhance the anti-tumor activity of alkylating agents. In order to understand the reason for the resistance of the Ada-C protein, we have made chimeras between Ada-C and AGT sequences and mutations in the Ada-C protein, expressed the altered proteins in an E. coli strain lacking endogenous alkyltransferase activity and tested the inactivation of the resulting proteins by BG or 5-nitroso-BP. Chimeric alkyltransferase proteins were made in which the residues on the amino side of the cysteine acceptor site came from Ada-C and the residues on the carboxyl side came from AGT and vice versa but these did not show sensitivity to BG suggesting that resistance is produced by residues in both segments of the protein. Analysis of the Ada-C mutant proteins revealed two sites for mutations that confer sensitivity to these inhibitors. One of these was tryptophan-336 and the other was residues lysine-314 and alanine-316. Thus, when the combined mutations of A316P/W336A were made in the Ada-C sequence, the protein was sensitive to inactivation by BG. This A316P/W336A mutant protein was even more sensitive to 5-nitroso-BP and the mutant proteins W336A, K314P/A316P and A316P could also be inhibited by this drug (in decreasing order of sensitivity) although the control Ada-C and a mutant R335S were not inhibited. These results provide strong support for the hypothesis that the resistance of the Ada-C alkyl-transferase is due to a steric effect limiting access to the active site. Insertion of proline residues at positions 314 and 316 and removal of the bulky tryptophan residue at position 336 increases the space available at the active site and permits these inhibitors to be effective.
On the relationship between residue structural environment and sequence conservation in proteins.
Liu, Jen-Wei; Lin, Jau-Ji; Cheng, Chih-Wen; Lin, Yu-Feng; Hwang, Jenn-Kang; Huang, Tsun-Tsao
2017-09-01
Residues that are crucial to protein function or structure are usually evolutionarily conserved. To identify the important residues in protein, sequence conservation is estimated, and current methods rely upon the unbiased collection of homologous sequences. Surprisingly, our previous studies have shown that the sequence conservation is closely correlated with the weighted contact number (WCN), a measure of packing density for residue's structural environment, calculated only based on the C α positions of a protein structure. Moreover, studies have shown that sequence conservation is correlated with environment-related structural properties calculated based on different protein substructures, such as a protein's all atoms, backbone atoms, side-chain atoms, or side-chain centroid. To know whether the C α atomic positions are adequate to show the relationship between residue environment and sequence conservation or not, here we compared C α atoms with other substructures in their contributions to the sequence conservation. Our results show that C α positions are substantially equivalent to the other substructures in calculations of various measures of residue environment. As a result, the overlapping contributions between C α atoms and the other substructures are high, yielding similar structure-conservation relationship. Take the WCN as an example, the average overlapping contribution to sequence conservation is 87% between C α and all-atom substructures. These results indicate that only C α atoms of a protein structure could reflect sequence conservation at the residue level. © 2017 Wiley Periodicals, Inc.
Accelerating Information Retrieval from Profile Hidden Markov Model Databases.
Tamimi, Ahmad; Ashhab, Yaqoub; Tamimi, Hashem
2016-01-01
Profile Hidden Markov Model (Profile-HMM) is an efficient statistical approach to represent protein families. Currently, several databases maintain valuable protein sequence information as profile-HMMs. There is an increasing interest to improve the efficiency of searching Profile-HMM databases to detect sequence-profile or profile-profile homology. However, most efforts to enhance searching efficiency have been focusing on improving the alignment algorithms. Although the performance of these algorithms is fairly acceptable, the growing size of these databases, as well as the increasing demand for using batch query searching approach, are strong motivations that call for further enhancement of information retrieval from profile-HMM databases. This work presents a heuristic method to accelerate the current profile-HMM homology searching approaches. The method works by cluster-based remodeling of the database to reduce the search space, rather than focusing on the alignment algorithms. Using different clustering techniques, 4284 TIGRFAMs profiles were clustered based on their similarities. A representative for each cluster was assigned. To enhance sensitivity, we proposed an extended step that allows overlapping among clusters. A validation benchmark of 6000 randomly selected protein sequences was used to query the clustered profiles. To evaluate the efficiency of our approach, speed and recall values were measured and compared with the sequential search approach. Using hierarchical, k-means, and connected component clustering techniques followed by the extended overlapping step, we obtained an average reduction in time of 41%, and an average recall of 96%. Our results demonstrate that representation of profile-HMMs using a clustering-based approach can significantly accelerate data retrieval from profile-HMM databases.
Ali, Safdar; Majid, Abdul; Javed, Syed Gibran; Sattar, Mohsin
2016-06-01
Early prediction of breast cancer is important for effective treatment and survival. We developed an effective Cost-Sensitive Classifier with GentleBoost Ensemble (Can-CSC-GBE) for the classification of breast cancer using protein amino acid features. In this work, first, discriminant information of the protein sequences related to breast tissue is extracted. Then, the physicochemical properties hydrophobicity and hydrophilicity of amino acids are employed to generate molecule descriptors in different feature spaces. For comparison, we obtained results by combining Cost-Sensitive learning with conventional ensemble of AdaBoostM1 and Bagging. The proposed Can-CSC-GBE system has effectively reduced the misclassification costs and thereby improved the overall classification performance. Our novel approach has highlighted promising results as compared to the state-of-the-art ensemble approaches. Copyright © 2016 Elsevier Ltd. All rights reserved.
Cuff, Alison L.; Sillitoe, Ian; Lewis, Tony; Clegg, Andrew B.; Rentzsch, Robert; Furnham, Nicholas; Pellegrini-Calace, Marialuisa; Jones, David; Thornton, Janet; Orengo, Christine A.
2011-01-01
CATH version 3.3 (class, architecture, topology, homology) contains 128 688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework. PMID:21097779
Protein Sequencing with Tandem Mass Spectrometry
NASA Astrophysics Data System (ADS)
Ziady, Assem G.; Kinter, Michael
The recent introduction of electrospray ionization techniques that are suitable for peptides and whole proteins has allowed for the design of mass spectrometric protocols that provide accurate sequence information for proteins. The advantages gained by these approaches over traditional Edman Degradation sequencing include faster analysis and femtomole, sometimes attomole, sensitivity. The ability to efficiently identify proteins has allowed investigators to conduct studies on their differential expression or modification in response to various treatments or disease states. In this chapter, we discuss the use of electrospray tandem mass spectrometry, a technique whereby protein-derived peptides are subjected to fragmentation in the gas phase, revealing sequence information for the protein. This powerful technique has been instrumental for the study of proteins and markers associated with various disorders, including heart disease, cancer, and cystic fibrosis. We use the study of protein expression in cystic fibrosis as an example.
Vector Development for the Expression of Foreign Proteins in the Vaccine Strain Brucella abortus S19
Comerci, Diego J.; Pollevick, Guido D.; Vigliocco, Ana M.; Frasch, Alberto C. C.; Ugalde, Rodolfo A.
1998-01-01
A vector for the expression of foreign antigens in the vaccine strain Brucella abortus S19 was developed by using a DNA fragment containing the regulatory sequences and the signal peptide of the Brucella bcsp31 gene. This fragment was cloned in broad-host-range plasmid pBBR4MCS, resulting in plasmid pBEV. As a reporter protein, a repetitive antigen of Trypanosoma cruzi was used. The recombinant fusion protein is stably expressed and secreted into the Brucella periplasmic space, inducing a good antibody response against the T. cruzi antigen. The expression of the repetitive antigen in Brucella neither altered its growth pattern nor generated a toxic or lethal effect during experimental infection. The application of this strategy for the generation of live recombinant vaccines and the tagging of B. abortus S19 vaccine is discussed. This is the first time that a recombinant protein has been expressed in the periplasm of brucellae. PMID:9673273
Chang, Shaojie; Song, Xiaomin; Yan, Ming; Zhou, Zhaocai; Wu, Fang; Gong, Weimin
2004-01-01
The proteins Spe31 and Spe32, named after their respective molecular weights of about 31 and 32 kDa, were purified simultaneously from the seeds of Pachyrrhizus erosus. They cannot be separated from each other by column chromatography. N-terminal sequence analysis indicated that they belonged to the papain family of cysteine proteases. An in-gel activity assay revealed that Spe31 possesses proteolytic activity while Spe32 only displays very weak activity for protein degradation. Both of them are glycoproteins as detected by the periodic acid and Schiff's reagent method. Crystals were obtained from the protein mixture by the hanging-drop vapour-diffusion method; they diffracted to a resolution of 2.61 A on an in-house X-ray source. The crystals belong to space group P4(1(3))2(1)2, with unit-cell parameters a = b = 61.96, c = 145.61 A. Gel electrophoresis under non-denaturing conditions showed that the protein crystallized was Spe31.
Programmable polyproteams built using twin peptide superglues
Veggiani, Gianluca; Nakamura, Tomohiko; Brenner, Michael D.; Yan, Jun; Robinson, Carol V.; Howarth, Mark
2016-01-01
Programmed connection of amino acids or nucleotides into chains introduced a revolution in control of biological function. Reacting proteins together is more complex because of the number of reactive groups and delicate stability. Here we achieved sequence-programmed irreversible connection of protein units, forming polyprotein teams by sequential amidation and transamidation. SpyTag peptide is engineered to spontaneously form an isopeptide bond with SpyCatcher protein. By engineering the adhesin RrgA from Streptococcus pneumoniae, we developed the peptide SnoopTag, which formed a spontaneous isopeptide bond to its protein partner SnoopCatcher with >99% yield and no cross-reaction to SpyTag/SpyCatcher. Solid-phase attachment followed by sequential SpyTag or SnoopTag reaction between building-blocks enabled iterative extension. Linear, branched, and combinatorial polyproteins were synthesized, identifying optimal combinations of ligands against death receptors and growth factor receptors for cancer cell death signal activation. This simple and modular route to programmable “polyproteams” should enable exploration of a new area of biological space. PMID:26787909
Programmable polyproteams built using twin peptide superglues.
Veggiani, Gianluca; Nakamura, Tomohiko; Brenner, Michael D; Gayet, Raphaël V; Yan, Jun; Robinson, Carol V; Howarth, Mark
2016-02-02
Programmed connection of amino acids or nucleotides into chains introduced a revolution in control of biological function. Reacting proteins together is more complex because of the number of reactive groups and delicate stability. Here we achieved sequence-programmed irreversible connection of protein units, forming polyprotein teams by sequential amidation and transamidation. SpyTag peptide is engineered to spontaneously form an isopeptide bond with SpyCatcher protein. By engineering the adhesin RrgA from Streptococcus pneumoniae, we developed the peptide SnoopTag, which formed a spontaneous isopeptide bond to its protein partner SnoopCatcher with >99% yield and no cross-reaction to SpyTag/SpyCatcher. Solid-phase attachment followed by sequential SpyTag or SnoopTag reaction between building-blocks enabled iterative extension. Linear, branched, and combinatorial polyproteins were synthesized, identifying optimal combinations of ligands against death receptors and growth factor receptors for cancer cell death signal activation. This simple and modular route to programmable "polyproteams" should enable exploration of a new area of biological space.
Bershtein, Shimon; Serohijos, Adrian W.R.; Shakhnovich, Eugene I.
2016-01-01
Bridging the gap between the molecular properties of proteins and organismal/population fitness is essential for understanding evolutionary processes. This task requires the integration of the several physical scales of biological organization, each defined by a distinct set of mechanisms and constraints, into a single unifying model. The molecular scale is dominated by the constraints imposed by the physico-chemical properties of proteins and their substrates, which give rise to trade-offs and epistatic (non-additive) effects of mutations. At the systems scale, biological networks modulate protein expression and can either buffer or enhance the fitness effects of mutations. The population scale is influenced by the mutational input, selection regimes, and stochastic changes affecting the size and structure of populations, which eventually determine the evolutionary fate of mutations. Here, we summarize the recent advances in theory, computer simulations, and experiments that advance our understanding of the links between various physical scales in biology. PMID:27810574
Petty, Tom J.; Nishimura, Taisuke; Emamzadah, Soheila; Gabus, Caroline; Paszkowski, Jerzy; Halazonetis, Thanos D.; Thore, Stéphane
2010-01-01
Of the known epigenetic control regulators found in plants, the Morpheus’ molecule 1 (MOM1) protein is atypical in that the deletion of MOM1 does not affect the level of epigenetic marks controlling the transcriptional status of the genome. A short 197-amino-acid fragment of the MOM1 protein sequence can complement MOM1 deletion when coupled to a nuclear localization signal, suggesting that this region contains a functional domain that compensates for the loss of the full-length protein. Numerous constructs centred on the highly conserved MOM1 motif 2 (CMM2) present in these 197 residues have been generated and expressed in Escherichia coli. Following purification and crystallization screening, diamond-shaped single crystals were obtained that diffracted to ∼3.2 Å resolution. They belonged to the trigonal space group P3121 (or P3221), with unit-cell parameters a = 85.64, c = 292.74 Å. Structure determination is ongoing. PMID:20693667
Petty, Tom J; Nishimura, Taisuke; Emamzadah, Soheila; Gabus, Caroline; Paszkowski, Jerzy; Halazonetis, Thanos D; Thore, Stéphane
2010-08-01
Of the known epigenetic control regulators found in plants, the Morpheus' molecule 1 (MOM1) protein is atypical in that the deletion of MOM1 does not affect the level of epigenetic marks controlling the transcriptional status of the genome. A short 197-amino-acid fragment of the MOM1 protein sequence can complement MOM1 deletion when coupled to a nuclear localization signal, suggesting that this region contains a functional domain that compensates for the loss of the full-length protein. Numerous constructs centred on the highly conserved MOM1 motif 2 (CMM2) present in these 197 residues have been generated and expressed in Escherichia coli. Following purification and crystallization screening, diamond-shaped single crystals were obtained that diffracted to approximately 3.2 A resolution. They belonged to the trigonal space group P3(1)21 (or P3(2)21), with unit-cell parameters a=85.64, c=292.74 A. Structure determination is ongoing.
Bershtein, Shimon; Serohijos, Adrian Wr; Shakhnovich, Eugene I
2017-02-01
Bridging the gap between the molecular properties of proteins and organismal/population fitness is essential for understanding evolutionary processes. This task requires the integration of the several physical scales of biological organization, each defined by a distinct set of mechanisms and constraints, into a single unifying model. The molecular scale is dominated by the constraints imposed by the physico-chemical properties of proteins and their substrates, which give rise to trade-offs and epistatic (non-additive) effects of mutations. At the systems scale, biological networks modulate protein expression and can either buffer or enhance the fitness effects of mutations. The population scale is influenced by the mutational input, selection regimes, and stochastic changes affecting the size and structure of populations, which eventually determine the evolutionary fate of mutations. Here, we summarize the recent advances in theory, computer simulations, and experiments that advance our understanding of the links between various physical scales in biology. Copyright © 2016 Elsevier Ltd. All rights reserved.
Overvoorde, P J; Chao, W S; Grimes, H D
1997-06-20
Photoaffinity labeling of a soybean cotyledon membrane fraction identified a sucrose-binding protein (SBP). Subsequent studies have shown that the SBP is a unique plasma membrane protein that mediates the linear uptake of sucrose in the presence of up to 30 mM external sucrose when ectopically expressed in yeast. Analysis of the SBP-deduced amino acid sequence indicates it lacks sequence similarity with other known transport proteins. Data presented here, however, indicate that the SBP shares significant sequence and structural homology with the vicilin-like seed storage proteins that organize into homotrimers. These similarities include a repeated sequence that forms the basis of the reiterated domain structure characteristic of the vicilin-like protein family. In addition, analytical ultracentrifugation and nonreducing SDS-polyacrylamide gel electrophoresis demonstrate that the SBP appears to be organized into oligomeric complexes with a Mr indicative of the existence of SBP homotrimers and homodimers. The structural similarity shared by the SBP and vicilin-like proteins provides a novel framework to explore the mechanistic basis of SBP-mediated sucrose uptake. Expression of the maize Glb protein (a vicilin-like protein closely related to the SBP) in yeast demonstrates that a closely related vicilin-like protein is unable to mediate sucrose uptake. Thus, despite sequence and structural similarities shared by the SBP and the vicilin-like protein family, the SBP is functionally divergent from other members of this group.
Metagenome assembly through clustering of next-generation sequencing data using protein sequences.
Sim, Mikang; Kim, Jaebum
2015-02-01
The study of environmental microbial communities, called metagenomics, has gained a lot of attention because of the recent advances in next-generation sequencing (NGS) technologies. Microbes play a critical role in changing their environments, and the mode of their effect can be solved by investigating metagenomes. However, the difficulty of metagenomes, such as the combination of multiple microbes and different species abundance, makes metagenome assembly tasks more challenging. In this paper, we developed a new metagenome assembly method by utilizing protein sequences, in addition to the NGS read sequences. Our method (i) builds read clusters by using mapping information against available protein sequences, and (ii) creates contig sequences by finding consensus sequences through probabilistic choices from the read clusters. By using simulated NGS read sequences from real microbial genome sequences, we evaluated our method in comparison with four existing assembly programs. We found that our method could generate relatively long and accurate metagenome assemblies, indicating that the idea of using protein sequences, as a guide for the assembly, is promising. Copyright © 2015 Elsevier B.V. All rights reserved.
Urra, Javier; Sandoval, Moisés; Cornejo, Isabel; Barros, L Felipe; Sepúlveda, Francisco V; Cid, L Pablo
2008-10-01
Extracellular pH, especially in relatively inaccessible microdomains between cells, affects transport membrane protein activity and might have an intercellular signaling role. We have developed a genetically encoded extracellular pH sensor capable of detecting pH changes in basolateral spaces of epithelial cells. It consists of a chimerical membrane protein displaying concatenated enhanced variants of cyan fluorescence protein (ECFP) and yellow fluorescence protein (EYFP) at the external aspect of the cell surface. The construct, termed pHCECSensor01, was targeted to basolateral membranes of Madin-Darby canine kidney (MDCK) cells by means of a sequence derived from the aquaporin AQP4. The fusion of pH-sensitive EYFP with pH-insensitive ECFP allows ratiometric pH measurements. The titration curve of pHCECSensor01 in vivo had a pK (a) value of 6.5 +/- 0.04. Only minor effects of extracellular chloride on pHCECSensor01 were observed around the physiological concentrations of this anion. In MDCK cells, the sensor was able to detect changes in pH secondary to H(+) efflux into the basolateral spaces elicited by an ammonium prepulse or lactate load. This genetically encoded sensor has the potential to serve as a noninvasive tool for monitoring changes in extracellular pH microdomains in epithelial and other tissues in vivo.
Martínez-Castilla, León P.; Rodríguez-Sotres, Rogelio
2010-01-01
Background Despite the remarkable progress of bioinformatics, how the primary structure of a protein leads to a three-dimensional fold, and in turn determines its function remains an elusive question. Alignments of sequences with known function can be used to identify proteins with the same or similar function with high success. However, identification of function-related and structure-related amino acid positions is only possible after a detailed study of every protein. Folding pattern diversity seems to be much narrower than sequence diversity, and the amino acid sequences of natural proteins have evolved under a selective pressure comprising structural and functional requirements acting in parallel. Principal Findings The approach described in this work begins by generating a large number of amino acid sequences using ROSETTA [Dantas G et al. (2003) J Mol Biol 332:449–460], a program with notable robustness in the assignment of amino acids to a known three-dimensional structure. The resulting sequence-sets showed no conservation of amino acids at active sites, or protein-protein interfaces. Hidden Markov models built from the resulting sequence sets were used to search sequence databases. Surprisingly, the models retrieved from the database sequences belonged to proteins with the same or a very similar function. Given an appropriate cutoff, the rate of false positives was zero. According to our results, this protocol, here referred to as Rd.HMM, detects fine structural details on the folding patterns, that seem to be tightly linked to the fitness of a structural framework for a specific biological function. Conclusion Because the sequence of the native protein used to create the Rd.HMM model was always amongst the top hits, the procedure is a reliable tool to score, very accurately, the quality and appropriateness of computer-modeled 3D-structures, without the need for spectroscopy data. However, Rd.HMM is very sensitive to the conformational features of the models' backbone. PMID:20830209
Sheridan, Robert P; Maiorov, Vladimir N; Holloway, M Katharine; Cornell, Wendy D; Gao, Ying-Duo
2010-11-22
One approach to estimating the "chemical tractability" of a candidate protein target where we know the atomic resolution structure is to examine the physical properties of potential binding sites. A number of other workers have addressed this issue. We characterize ~290,000 "pockets" from ~42,000 protein crystal structures in terms of a three parameter "pocket space": volume, buriedness, and hydrophobicity. A metric DLID (drug-like density) measures how likely a pocket is to bind a drug-like molecule. This is calculated from the count of other pockets in its local neighborhood in pocket space that contain drug-like cocrystallized ligands and the count of total pockets in the neighborhood. Surprisingly, despite being defined locally, a global trend in DLID can be predicted by a simple linear regression on log(volume), buriedness, and hydrophobicity. Two levels of simplification are necessary to relate the DLID of individual pockets to "targets": taking the best DLID per Protein Data Bank (PDB) entry (because any given crystal structure can have many pockets), and taking the median DLID over all PDB entries for the same target (because different crystal structures of the same protein can vary because of artifacts and real conformational changes). We can show that median DLIDs for targets that are detectably homologous in sequence are reasonably similar and that median DLIDs correlate with the "druggability" estimate of Cheng et al. (Nature Biotechnology 2007, 25, 71-75).
Dynamic programming algorithms for biological sequence comparison.
Pearson, W R; Miller, W
1992-01-01
Efficient dynamic programming algorithms are available for a broad class of protein and DNA sequence comparison problems. These algorithms require computer time proportional to the product of the lengths of the two sequences being compared [O(N2)] but require memory space proportional only to the sum of these lengths [O(N)]. Although the requirement for O(N2) time limits use of the algorithms to the largest computers when searching protein and DNA sequence databases, many other applications of these algorithms, such as calculation of distances for evolutionary trees and comparison of a new sequence to a library of sequence profiles, are well within the capabilities of desktop computers. In particular, the results of library searches with rapid searching programs, such as FASTA or BLAST, should be confirmed by performing a rigorous optimal alignment. Whereas rapid methods do not overlook significant sequence similarities, FASTA limits the number of gaps that can be inserted into an alignment, so that a rigorous alignment may extend the alignment substantially in some cases. BLAST does not allow gaps in the local regions that it reports; a calculation that allows gaps is very likely to extend the alignment substantially. Although a Monte Carlo evaluation of the statistical significance of a similarity score with a rigorous algorithm is much slower than the heuristic approach used by the RDF2 program, the dynamic programming approach should take less than 1 hr on a 386-based PC or desktop Unix workstation. For descriptive purposes, we have limited our discussion to methods for calculating similarity scores and distances that use gap penalties of the form g = rk. Nevertheless, programs for the more general case (g = q+rk) are readily available. Versions of these programs that run either on Unix workstations, IBM-PC class computers, or the Macintosh can be obtained from either of the authors.
Pruitt, Kim D.; Tatusova, Tatiana; Maglott, Donna R.
2005-01-01
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff. PMID:15608248
Coarse-grained sequences for protein folding and design.
Brown, Scott; Fawzi, Nicolas J; Head-Gordon, Teresa
2003-09-16
We present the results of sequence design on our off-lattice minimalist model in which no specification of native-state tertiary contacts is needed. We start with a sequence that adopts a target topology and build on it through sequence mutation to produce new sequences that comprise distinct members within a target fold class. In this work, we use the alpha/beta ubiquitin fold class and design two new sequences that, when characterized through folding simulations, reproduce the differences in folding mechanism seen experimentally for proteins L and G. The primary implication of this work is that patterning of hydrophobic and hydrophilic residues is the physical origin for the success of relative contact-order descriptions of folding, and that these physics-based potentials provide a predictive connection between free energy landscapes and amino acid sequence (the original protein folding problem). We present results of the sequence mapping from a 20- to the three-letter code for determining a sequence that folds into the WW domain topology to illustrate future extensions to protein design.
Coarse-grained sequences for protein folding and design
Brown, Scott; Fawzi, Nicolas J.; Head-Gordon, Teresa
2003-01-01
We present the results of sequence design on our off-lattice minimalist model in which no specification of native-state tertiary contacts is needed. We start with a sequence that adopts a target topology and build on it through sequence mutation to produce new sequences that comprise distinct members within a target fold class. In this work, we use the α/β ubiquitin fold class and design two new sequences that, when characterized through folding simulations, reproduce the differences in folding mechanism seen experimentally for proteins L and G. The primary implication of this work is that patterning of hydrophobic and hydrophilic residues is the physical origin for the success of relative contact-order descriptions of folding, and that these physics-based potentials provide a predictive connection between free energy landscapes and amino acid sequence (the original protein folding problem). We present results of the sequence mapping from a 20- to the three-letter code for determining a sequence that folds into the WW domain topology to illustrate future extensions to protein design. PMID:12963815
Short Toxin-like Proteins Abound in Cnidaria Genomes
Tirosh, Yitshak; Linial, Itai; Askenazi, Manor; Linial, Michal
2012-01-01
Cnidaria is a rich phylum that includes thousands of marine species. In this study, we focused on Anthozoa and Hydrozoa that are represented by the Nematostella vectensis (Sea anemone) and Hydra magnipapillata genomes. We present a method for ranking the toxin-like candidates from complete proteomes of Cnidaria. Toxin-like functions were revealed using ClanTox, a statistical machine-learning predictor trained on ion channel inhibitors from venomous animals. Fundamental features that were emphasized in training ClanTox include cysteines and their spacing along the sequences. Among the 83,000 proteins derived from Cnidaria representatives, we found 170 candidates that fulfill the properties of toxin-like-proteins, the vast majority of which were previously unrecognized as toxins. An additional 394 short proteins exhibit characteristics of toxin-like proteins at a moderate degree of confidence. Remarkably, only 11% of the predicted toxin-like proteins were previously classified as toxins. Based on our prediction methodology and manual annotation, we inferred functions for over 400 of these proteins. Such functions include protease inhibitors, membrane pore formation, ion channel blockers and metal binding proteins. Many of the proteins belong to small families of paralogs. We conclude that the evolutionary expansion of toxin-like proteins in Cnidaria contributes to their fitness in the complex environment of the aquatic ecosystem. PMID:23202321
Short toxin-like proteins abound in Cnidaria genomes.
Tirosh, Yitshak; Linial, Itai; Askenazi, Manor; Linial, Michal
2012-11-16
Cnidaria is a rich phylum that includes thousands of marine species. In this study, we focused on Anthozoa and Hydrozoa that are represented by the Nematostella vectensis (Sea anemone) and Hydra magnipapillata genomes. We present a method for ranking the toxin-like candidates from complete proteomes of Cnidaria. Toxin-like functions were revealed using ClanTox, a statistical machine-learning predictor trained on ion channel inhibitors from venomous animals. Fundamental features that were emphasized in training ClanTox include cysteines and their spacing along the sequences. Among the 83,000 proteins derived from Cnidaria representatives, we found 170 candidates that fulfill the properties of toxin-like-proteins, the vast majority of which were previously unrecognized as toxins. An additional 394 short proteins exhibit characteristics of toxin-like proteins at a moderate degree of confidence. Remarkably, only 11% of the predicted toxin-like proteins were previously classified as toxins. Based on our prediction methodology and manual annotation, we inferred functions for over 400 of these proteins. Such functions include protease inhibitors, membrane pore formation, ion channel blockers and metal binding proteins. Many of the proteins belong to small families of paralogs. We conclude that the evolutionary expansion of toxin-like proteins in Cnidaria contributes to their fitness in the complex environment of the aquatic ecosystem.
Equally parsimonious pathways through an RNA sequence space are not equally likely
NASA Technical Reports Server (NTRS)
Lee, Y. H.; DSouza, L. M.; Fox, G. E.
1997-01-01
An experimental system for determining the potential ability of sequences resembling 5S ribosomal RNA (rRNA) to perform as functional 5S rRNAs in vivo in the Escherichia coli cellular environment was devised previously. Presumably, the only 5S rRNA sequences that would have been fixed by ancestral populations are ones that were functionally valid, and hence the actual historical paths taken through RNA sequence space during 5S rRNA evolution would have most likely utilized valid sequences. Herein, we examine the potential validity of all sequence intermediates along alternative equally parsimonious trajectories through RNA sequence space which connect two pairs of sequences that had previously been shown to behave as valid 5S rRNAs in E. coli. The first trajectory requires a total of four changes. The 14 sequence intermediates provide 24 apparently equally parsimonious paths by which the transition could occur. The second trajectory involves three changes, six intermediate sequences, and six potentially equally parsimonious paths. In total, only eight of the 20 sequence intermediates were found to be clearly invalid. As a consequence of the position of these invalid intermediates in the sequence space, seven of the 30 possible paths consisted of exclusively valid sequences. In several cases, the apparent validity/invalidity of the intermediate sequences could not be anticipated on the basis of current knowledge of the 5S rRNA structure. This suggests that the interdependencies in RNA sequence space may be more complex than currently appreciated. If ancestral sequences predicted by parsimony are to be regarded as actual historical sequences, then the present results would suggest that they should also satisfy a validity requirement and that, in at least limited cases, this conjecture can be tested experimentally.
NASA Astrophysics Data System (ADS)
Moreland, Blythe; Oman, Kenji; Curfman, John; Yan, Pearlly; Bundschuh, Ralf
Methyl-binding domain (MBD) protein pulldown experiments have been a valuable tool in measuring the levels of methylated CpG dinucleotides. Due to the frequent use of this technique, high-throughput sequencing data sets are available that allow a detailed quantitative characterization of the underlying interaction between methylated DNA and MBD proteins. Analyzing such data sets, we first found that two such proteins cannot bind closer to each other than 2 bp, consistent with structural models of the DNA-protein interaction. Second, the large amount of sequencing data allowed us to find rather weak but nevertheless clearly statistically significant sequence preferences for several bases around the required CpG. These results demonstrate that pulldown sequencing is a high-precision tool in characterizing DNA-protein interactions. This material is based upon work supported by the National Science Foundation under Grant No. DMR-1410172.
Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier
2016-01-04
The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions
Abnousi, Armen; Broschat, Shira L.; Kalyanaraman, Ananth
2016-01-01
Background Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. Methods In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. Results We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences. PMID:27552220
Wolffe, E J; Gause, W C; Pelfrey, C M; Holland, S M; Steinberg, A D; August, J T
1990-01-05
We describe the isolation and sequencing of a cDNA encoding mouse Pgp-1. An oligonucleotide probe corresponding to the NH2-terminal sequence of the purified protein was synthesized by the polymerase chain reaction and used to screen a mouse macrophage lambda gt11 library. A cDNA clone with an insert of 1.2 kilobases was selected and sequenced. In Northern blot analysis, only cells expressing Pgp-1 contained mRNA species that hybridized with this Pgp-1 cDNA. The nucleotide sequence of the cDNA has a single open reading frame that yields a protein-coding sequence of 1076 base pairs followed by a 132-base pair 3'-untranslated sequence that includes a putative polyadenylation signal but no poly(A) tail. The translated sequence comprises a 13-amino acid signal peptide followed by a polypeptide core of 345 residues corresponding to an Mr of 37,800. Portions of the deduced amino acid sequence were identical to those obtained by amino acid sequence analysis from the purified glycoprotein, confirming that the cDNA encodes Pgp-1. The predicted structure of Pgp-1 includes an NH2-terminal extracellular domain (residues 14-265), a transmembrane domain (residues 266-286), and a cytoplasmic tail (residues 287-358). Portions of the mouse Pgp-1 sequence are highly similar to that of the human CD44 cell surface glycoprotein implicated in cell adhesion. The protein also shows sequence similarity to the proteoglycan tandem repeat sequences found in cartilage link protein and cartilage proteoglycan core protein which are thought to be involved in binding to hyaluronic acid.
DWARF – a data warehouse system for analyzing protein families
Fischer, Markus; Thai, Quan K; Grieb, Melanie; Pleiss, Jürgen
2006-01-01
Background The emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families. Description The data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families. Conclusion DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering. PMID:17094801
Hsing, Michael; Cherkasov, Artem
2008-06-25
Insertions and deletions (indels) represent a common type of sequence variations, which are less studied and pose many important biological questions. Recent research has shown that the presence of sizable indels in protein sequences may be indicative of protein essentiality and their role in protein interaction networks. Examples of utilization of indels for structure-based drug design have also been recently demonstrated. Nonetheless many structural and functional characteristics of indels remain less researched or unknown. We have created a web-based resource, Indel PDB, representing a structural database of insertions/deletions identified from the sequence alignments of highly similar proteins found in the Protein Data Bank (PDB). Indel PDB utilized large amounts of available structural information to characterize 1-, 2- and 3-dimensional features of indel sites. Indel PDB contains 117,266 non-redundant indel sites extracted from 11,294 indel-containing proteins. Unlike loop databases, Indel PDB features more indel sequences with secondary structures including alpha-helices and beta-sheets in addition to loops. The insertion fragments have been characterized by their sequences, lengths, locations, secondary structure composition, solvent accessibility, protein domain association and three dimensional structures. By utilizing the data available in Indel PDB, we have studied and presented here several sequence and structural features of indels. We anticipate that Indel PDB will not only enable future functional studies of indels, but will also assist protein modeling efforts and identification of indel-directed drug binding sites.
Sequence Determines Degree of Knottedness in a Coarse-Grained Protein Model
NASA Astrophysics Data System (ADS)
Wüst, Thomas; Reith, Daniel; Virnau, Peter
2015-01-01
Knots are abundant in globular homopolymers but rare in globular proteins. To shed new light on this long-standing conundrum, we study the influence of sequence on the formation of knots in proteins under native conditions within the framework of the hydrophobic-polar lattice protein model. By employing large-scale Wang-Landau simulations combined with suitable Monte Carlo trial moves we show that even though knots are still abundant on average, sequence introduces large variability in the degree of self-entanglements. Moreover, we are able to design sequences which are either almost always or almost never knotted. Our findings serve as proof of concept that the introduction of just one additional degree of freedom per monomer (in our case sequence) facilitates evolution towards a protein universe in which knots are rare.
Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier
2003-01-01
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.
The Protein Information Resource: an integrated public resource of functional annotation of proteins
Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.
2002-01-01
The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247
Analysis of secreted proteins from Aspergillus flavus.
Medina, Martha L; Haynes, Paul A; Breci, Linda; Francisco, Wilson A
2005-08-01
MS/MS techniques in proteomics make possible the identification of proteins from organisms with little or no genome sequence information available. Peptide sequences are obtained from tandem mass spectra by matching peptide mass and fragmentation information to protein sequence information from related organisms, including unannotated genome sequence data. This peptide identification data can then be grouped and reconstructed into protein data. In this study, we have used this approach to study protein secretion by Aspergillus flavus, a filamentous fungus for which very little genome sequence information is available. A. flavus is capable of degrading the flavonoid rutin (quercetin 3-O-glycoside), as the only source of carbon via an extracellular enzyme system. In this continuing study, a proteomic analysis was used to identify secreted proteins from A. flavus when grown on rutin. The growth media glucose and potato dextrose were used to identify differentially expressed secreted proteins. The secreted proteins were analyzed by 1- and 2-DE and MS/MS. A total of 51 unique A. flavus secreted proteins were identified from the three growth conditions. Ten proteins were unique to rutin-, five to glucose- and one to potato dextrose-grown A. flavus. Sixteen secreted proteins were common to all three media. Fourteen identifications were of hypothetical proteins or proteins of unknown functions. To our knowledge, this is the first extensive proteomic study conducted to identify the secreted proteins from a filamentous fungus.
Yerrapragada, Shaila; Shukla, Animesh; Hallsworth-Pepin, Kymberlie; Choi, Kwangmin; Wollam, Aye; Clifton, Sandra; Qin, Xiang; Muzny, Donna; Raghuraman, Sriram; Ashki, Haleh; Uzman, Akif; Highlander, Sarah K; Fryszczyn, Bartlomiej G; Fox, George E; Tirumalai, Madhan R; Liu, Yamei; Kim, Sun; Kehoe, David M; Weinstock, George M
2015-05-07
Tolypothrix sp. PCC 7601 is a freshwater filamentous cyanobacterium with complex responses to environmental conditions. Here, we present its 9.96-Mbp draft genome sequence, containing 10,065 putative protein-coding sequences, including 305 predicted two-component system proteins and 27 putative phytochrome-class photoreceptors, the most such proteins in any sequenced genome. Copyright © 2015 Yerrapragada et al.
Influence of Layup Sequence on the Surface Accuracy of Carbon Fiber Composite Space Mirrors
NASA Astrophysics Data System (ADS)
Yang, Zhiyong; Liu, Qingnian; Zhang, Boming; Xu, Liang; Tang, Zhanwen; Xie, Yongjie
2018-04-01
Layup sequence is directly related to stiffness and deformation resistance of the composite space mirror, and error caused by layup sequence can affect the surface precision of composite mirrors evidently. Variation of layup sequence with the same total thickness of composite space mirror changes surface form of the composite mirror, which is the focus of our study. In our research, the influence of varied quasi-isotropic stacking sequences and random angular deviation on the surface accuracy of composite space mirrors was investigated through finite element analyses (FEA). We established a simulation model for the studied concave mirror with 500 mm diameter, essential factors of layup sequences and random angular deviations on different plies were discussed. Five guiding findings were described in this study. Increasing total plies, optimizing stacking sequence and keeping consistency of ply alignment in ply placement are effective to improve surface accuracy of composite mirror.
A family of cellular proteins related to snake venom disintegrins.
Weskamp, G; Blobel, C P
1994-03-29
Disintegrins are short soluble integrin ligands that were initially identified in snake venom. A previously recognized cellular protein with a disintegrin domain was the guinea pig sperm protein PH-30, a protein implicated in sperm-egg membrane binding and fusion. Here we present peptide sequences that are characteristic for several cellular disintegrin-domain proteins. These peptide sequences were deduced from cDNA sequence tags that were generated by polymerase chain reaction from various mouse tissue and a mouse muscle cell line. Northern blot analysis with four sequence tags revealed distinct mRNA expression patterns. Evidently, cellular proteins containing a disintegrin domain define a superfamily of potential integrin ligands that are likely to function in important cell-cell and cell-matrix interactions.