Science.gov

Sample records for protein sequence comparison

  1. Protein sequence comparison and protein evolution

    SciTech Connect

    Pearson, W.R.

    1995-12-31

    This tutorial was one of eight tutorials selected to be presented at the Third International Conference on Intelligent Systems for Molecular Biology which was held in the United Kingdom from July 16 to 19, 1995. This tutorial examines how the information conserved during the evolution of a protein molecule can be used to infer reliably homology, and thus a shared proteinfold and possibly a shared active site or function. The authors start by reviewing a geological/evolutionary time scale. Next they look at the evolution of several protein families. During the tutorial, these families will be used to demonstrate that homologous protein ancestry can be inferred with confidence. They also examine different modes of protein evolution and consider some hypotheses that have been presented to explain the very earliest events in protein evolution. The next part of the tutorial will examine the technical aspects of protein sequence comparison. Both optimal and heuristic algorithms and their associated parameters that are used to characterize protein sequence similarities are discussed. Perhaps more importantly, they survey the statistics of local similarity scores, and how these statistics can both be used to improve the selectivity of a search and to evaluate the significance of a match. They them examine distantly related members of three protein families, the serine proteases, the glutathione transferases, and the G-protein-coupled receptors (GCRs). Finally, the discuss how sequence similarity can be used to examine internal repeated or mosaic structures in proteins.

  2. Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'

    PubMed Central

    Dai, Qi; Wang, Tianming

    2008-01-01

    Background Many proposed statistical measures can efficiently compare protein sequence to further infer protein structure, function and evolutionary information. They share the same idea of using k-word frequencies of protein sequences. Given a protein sequence, the information on its related protein sequences hasn't been used for protein sequence comparison until now. This paper proposed a scheme to construct protein 'sequence space' which was associated with protein sequences related to the given protein, and the performances of statistical measures were compared when they explored the information on protein 'sequence space' or not. This paper also presented two statistical measures for protein: gre.k (generalized relative entropy) and gsm.k (gapped similarity measure). Results We tested statistical measures based on protein 'sequence space' or not with three data sets. This not only offers the systematic and quantitative experimental assessment of these statistical measures, but also naturally complements the available comparison of statistical measures based on protein sequence. Moreover, we compared our statistical measures with alignment-based measures and the existing statistical measures. The experiments were grouped into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the statistical measures to discriminate and classify protein sequences. The second set of the experiments aims at assessing how well our measure does in phylogenetic analysis. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of protein 'sequence space' and statistical measures were obtained. Conclusion Alignment-based measures have a clear advantage when the data is high redundant. The more efficient statistical measure is the novel gsm.k introduced by this article, the cos.k followed. When the data becomes less redundant, gre.k proposed by us achieves a

  3. Sequence-, structure-, and dynamics-based comparisons of structurally homologous CheY-like proteins

    PubMed Central

    He, Yi; Maisuradze, Gia G.; Yin, Yanping; Kachlishvili, Khatuna; Rackovsky, S.; Scheraga, Harold A.

    2017-01-01

    We recently introduced a physically based approach to sequence comparison, the property factor method (PFM). In the present work, we apply the PFM approach to the study of a challenging set of sequences—the bacterial chemotaxis protein CheY, the N-terminal receiver domain of the nitrogen regulation protein NT-NtrC, and the sporulation response regulator Spo0F. These are all response regulators involved in signal transduction. Despite functional similarity and structural homology, they exhibit low sequence identity. PFM sequence comparison demonstrates a statistically significant qualitative difference between the sequence of CheY and those of the other two proteins that is not found using conventional alignment methods. This difference is shown to be consonant with structural characteristics, using distance matrix comparisons. We also demonstrate that residues participating strongly in native contacts during unfolding are distributed differently in CheY than in the other two proteins. The PFM result is also in accord with dynamic simulation results of several types. Molecular dynamics simulations of all three proteins were carried out at several temperatures, and it is shown that the dynamics of CheY are predicted to differ from those of NT-NtrC and Spo0F. The predicted dynamic properties of the three proteins are in good agreement with experimentally determined B factors and with fluctuations predicted by the Gaussian network model. We pinpoint the differences between the PFM and traditional sequence comparisons and discuss the informatic basis for the ability of the PFM approach to detect physical differences between these sequences that are not apparent from traditional alignment-based comparison. PMID:28143938

  4. Protein sequence comparison and fold recognition: progress and good-practice benchmarking.

    PubMed

    Söding, Johannes; Remmert, Michael

    2011-06-01

    Protein sequence comparison methods have grown increasingly sensitive during the last decade and can often identify distantly related proteins sharing a common ancestor some 3 billion years ago. Although cellular function is not conserved so long, molecular functions and structures of protein domains often are. In combination with a domain-centered approach to function and structure prediction, modern remote homology detection methods have a great and largely underexploited potential for elucidating protein functions and evolution. Advances during the last few years include nonlinear scoring functions combining various sequence features, the use of sequence context information, and powerful new software packages. Since progress depends on realistically assessing new and existing methods and published benchmarks are often hard to compare, we propose 10 rules of good-practice benchmarking.

  5. A comparison of several similarity indices used in the classification of protein sequences: a multivariate analysis.

    PubMed Central

    Landès, C; Hénaut, A; Risler, J L

    1992-01-01

    The present work describes an attempt to identify reliable criteria which could be used as distance indices between protein sequences. Seven different criteria have been tested: i and ii) the scores of the alignments as given by the BESTFIT and the FASTA programs; iii) the ratio parameter, i.e. the BESTFIT score divided by the length of the aligned peptides; iv and v) the statistical significance (Z-scores) of the scores calculated by BESTFIT and FASTA, as obtained by comparison with shuffled sequences; vi) the Z-scores provided by the program RELATE which performs a segment-by-segment comparison of 2 sequences, and vii) an original distance index calculated by the program DOCMA from all the pairwise dotplots between the sequences. These 7 criteria have been tested against the aminoacid sequences of 39 globins and those of the 20 aminoacyl-tRNA synthetases from E. coli. The distances between the sequences were analyzed by the multivariate analysis techniques. The results show that the distances calculated from the scores of the pairwise alignments are not adequately sensitive. The Z-score from RELATE is not selective enough and too demanding in computer time. Three criteria gave a classification consistent with the known similarities between the sequences in the sets, namely the Z-scores from BESTFIT and FASTA and the multiple dotplot comparison distance index from DOCMA. PMID:1641329

  6. iPBA: a tool for protein structure comparison using sequence alignment strategies

    PubMed Central

    Gelly, Jean-Christophe; Joseph, Agnel Praveen; Srinivasan, Narayanaswamy; de Brevern, Alexandre G.

    2011-01-01

    With the immense growth in the number of available protein structures, fast and accurate structure comparison has been essential. We propose an efficient method for structure comparison, based on a structural alphabet. Protein Blocks (PBs) is a widely used structural alphabet with 16 pentapeptide conformations that can fairly approximate a complete protein chain. Thus a 3D structure can be translated into a 1D sequence of PBs. With a simple Needleman–Wunsch approach and a raw PB substitution matrix, PB-based structural alignments were better than many popular methods. iPBA web server presents an improved alignment approach using (i) specialized PB Substitution Matrices (SM) and (ii) anchor-based alignment methodology. With these developments, the quality of ∼88% of alignments was improved. iPBA alignments were also better than DALI, MUSTANG and GANGSTA+ in >80% of the cases. The webserver is designed to for both pairwise comparisons and database searches. Outputs are given as sequence alignment and superposed 3D structures displayed using PyMol and Jmol. A local alignment option for detecting subs-structural similarity is also embedded. As a fast and efficient ‘sequence-based’ structure comparison tool, we believe that it will be quite useful to the scientific community. iPBA can be accessed at http://www.dsimb.inserm.fr/dsimb_tools/ipba/. PMID:21586582

  7. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species

    NASA Technical Reports Server (NTRS)

    Haney, P. J.; Badger, J. H.; Buldak, G. L.; Reich, C. I.; Woese, C. R.; Olsen, G. J.

    1999-01-01

    The genome sequence of the extremely thermophilic archaeon Methanococcus jannaschii provides a wealth of data on proteins from a thermophile. In this paper, sequences of 115 proteins from M. jannaschii are compared with their homologs from mesophilic Methanococcus species. Although the growth temperatures of the mesophiles are about 50 degrees C below that of M. jannaschii, their genomic G+C contents are nearly identical. The properties most correlated with the proteins of the thermophile include higher residue volume, higher residue hydrophobicity, more charged amino acids (especially Glu, Arg, and Lys), and fewer uncharged polar residues (Ser, Thr, Asn, and Gln). These are recurring themes, with all trends applying to 83-92% of the proteins for which complete sequences were available. Nearly all of the amino acid replacements most significantly correlated with the temperature change are the same relatively conservative changes observed in all proteins, but in the case of the mesophile/thermophile comparison there is a directional bias. We identify 26 specific pairs of amino acids with a statistically significant (P < 0.01) preferred direction of replacement.

  8. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species

    NASA Technical Reports Server (NTRS)

    Haney, P. J.; Badger, J. H.; Buldak, G. L.; Reich, C. I.; Woese, C. R.; Olsen, G. J.

    1999-01-01

    The genome sequence of the extremely thermophilic archaeon Methanococcus jannaschii provides a wealth of data on proteins from a thermophile. In this paper, sequences of 115 proteins from M. jannaschii are compared with their homologs from mesophilic Methanococcus species. Although the growth temperatures of the mesophiles are about 50 degrees C below that of M. jannaschii, their genomic G+C contents are nearly identical. The properties most correlated with the proteins of the thermophile include higher residue volume, higher residue hydrophobicity, more charged amino acids (especially Glu, Arg, and Lys), and fewer uncharged polar residues (Ser, Thr, Asn, and Gln). These are recurring themes, with all trends applying to 83-92% of the proteins for which complete sequences were available. Nearly all of the amino acid replacements most significantly correlated with the temperature change are the same relatively conservative changes observed in all proteins, but in the case of the mesophile/thermophile comparison there is a directional bias. We identify 26 specific pairs of amino acids with a statistically significant (P < 0.01) preferred direction of replacement.

  9. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species

    PubMed Central

    Haney, Paul J.; Badger, Jonathan H.; Buldak, Gerald L.; Reich, Claudia I.; Woese, Carl R.; Olsen, Gary J.

    1999-01-01

    The genome sequence of the extremely thermophilic archaeon Methanococcus jannaschii provides a wealth of data on proteins from a thermophile. In this paper, sequences of 115 proteins from M. jannaschii are compared with their homologs from mesophilic Methanococcus species. Although the growth temperatures of the mesophiles are about 50°C below that of M. jannaschii, their genomic G+C contents are nearly identical. The properties most correlated with the proteins of the thermophile include higher residue volume, higher residue hydrophobicity, more charged amino acids (especially Glu, Arg, and Lys), and fewer uncharged polar residues (Ser, Thr, Asn, and Gln). These are recurring themes, with all trends applying to 83–92% of the proteins for which complete sequences were available. Nearly all of the amino acid replacements most significantly correlated with the temperature change are the same relatively conservative changes observed in all proteins, but in the case of the mesophile/thermophile comparison there is a directional bias. We identify 26 specific pairs of amino acids with a statistically significant (P < 0.01) preferred direction of replacement. PMID:10097079

  10. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species.

    PubMed

    Haney, P J; Badger, J H; Buldak, G L; Reich, C I; Woese, C R; Olsen, G J

    1999-03-30

    The genome sequence of the extremely thermophilic archaeon Methanococcus jannaschii provides a wealth of data on proteins from a thermophile. In this paper, sequences of 115 proteins from M. jannaschii are compared with their homologs from mesophilic Methanococcus species. Although the growth temperatures of the mesophiles are about 50 degrees C below that of M. jannaschii, their genomic G+C contents are nearly identical. The properties most correlated with the proteins of the thermophile include higher residue volume, higher residue hydrophobicity, more charged amino acids (especially Glu, Arg, and Lys), and fewer uncharged polar residues (Ser, Thr, Asn, and Gln). These are recurring themes, with all trends applying to 83-92% of the proteins for which complete sequences were available. Nearly all of the amino acid replacements most significantly correlated with the temperature change are the same relatively conservative changes observed in all proteins, but in the case of the mesophile/thermophile comparison there is a directional bias. We identify 26 specific pairs of amino acids with a statistically significant (P < 0.01) preferred direction of replacement.

  11. A statistical physics perspective on alignment-independent protein sequence comparison.

    PubMed

    Chattopadhyay, Amit K; Nasiev, Diar; Flower, Darren R

    2015-08-01

    Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from 'first passage probability distribution' to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. © The Author 2015. Published by Oxford University Press.

  12. Shotgun protein sequencing.

    SciTech Connect

    Faulon, Jean-Loup Michel; Heffelfinger, Grant S.

    2009-06-01

    A novel experimental and computational technique based on multiple enzymatic digestion of a protein or protein mixture that reconstructs protein sequences from sequences of overlapping peptides is described in this SAND report. This approach, analogous to shotgun sequencing of DNA, is to be used to sequence alternative spliced proteins, to identify post-translational modifications, and to sequence genetically engineered proteins.

  13. Microsequence analysis of electroblotted proteins. II. Comparison of sequence performance on different types of PVDF membranes.

    PubMed

    Reim, D F; Speicher, D W

    1992-11-15

    The influence of different types of polyvinylidene difluoride (PVDF) membranes on gas phase sequence performance has been evaluated. These PVDF membranes have been classified as either high retention (Trans-Blot and ProBlott) or low retention membranes (Immobilon-P) based on their ability to bind proteins during electroblotting from gels. Initial yields, repetitive yields, and extraction efficiency of the anilinothiazolinone amino acid derivatives have been compared for several standard proteins that have been either electroblotted or loaded onto PVDF membranes by direct adsorption. These results show that the major differences in initial sequence yields between membranes arise from differences in the amount of protein actually transferred to the membrane rather than sequencer-related factors. In contrast to several previous observations from other laboratories, more tightly bound proteins do not sequence with lower initial yields and initial yields are not affected by the ratio of surface area to protein. The stronger binding on high retention PVDF membranes does not adversely affect recoveries of difficult to extract, or very hydrophobic, amino acid derivatives. Several amino acids, especially tryptophan, are actually recovered in dramatically higher yield on high retention membranes compared with either Immobilon or glass filters. At the same time, the protein and peptide binding properties of high retention membranes will frequently improve the repetitive yield by minimizing sample extraction during the sequencer cycle. Stronger protein binding together with improved electroblotting yields offer substantially improved sequence performance when high retention PVDF membranes are used.

  14. Large-Scale Sequence Comparison.

    PubMed

    Lal, Devi; Verma, Mansi

    2017-01-01

    There are millions of sequences deposited in genomic databases, and it is an important task to categorize them according to their structural and functional roles. Sequence comparison is a prerequisite for proper categorization of both DNA and protein sequences, and helps in assigning a putative or hypothetical structure and function to a given sequence. There are various methods available for comparing sequences, alignment being first and foremost for sequences with a small number of base pairs as well as for large-scale genome comparison. Various tools are available for performing pairwise large sequence comparison. The best known tools either perform global alignment or generate local alignments between the two sequences. In this chapter we first provide basic information regarding sequence comparison. This is followed by the description of the PAM and BLOSUM matrices that form the basis of sequence comparison. We also give a practical overview of currently available methods such as BLAST and FASTA, followed by a description and overview of tools available for genome comparison including LAGAN, MumMER, BLASTZ, and AVID.

  15. Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix

    PubMed Central

    Yu, Lulu; Zhang, Yusen; Gutman, Ivan; Shi, Yongtang; Dehmer, Matthias

    2017-01-01

    We develop a novel position-feature-based model for protein sequences by employing physicochemical properties of 20 amino acids and the measure of graph energy. The method puts the emphasis on sequence order information and describes local dynamic distributions of sequences, from which one can get a characteristic B-vector. Afterwards, we apply the relative entropy to the sequences representing B-vectors to measure their similarity/dissimilarity. The numerical results obtained in this study show that the proposed methods leads to meaningful results compared with competitors such as Clustal W. PMID:28393857

  16. Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix.

    PubMed

    Yu, Lulu; Zhang, Yusen; Gutman, Ivan; Shi, Yongtang; Dehmer, Matthias

    2017-04-10

    We develop a novel position-feature-based model for protein sequences by employing physicochemical properties of 20 amino acids and the measure of graph energy. The method puts the emphasis on sequence order information and describes local dynamic distributions of sequences, from which one can get a characteristic B-vector. Afterwards, we apply the relative entropy to the sequences representing B-vectors to measure their similarity/dissimilarity. The numerical results obtained in this study show that the proposed methods leads to meaningful results compared with competitors such as Clustal W.

  17. Establishing homologies in protein sequences

    NASA Technical Reports Server (NTRS)

    Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.

    1983-01-01

    Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.

  18. Establishing homologies in protein sequences

    NASA Technical Reports Server (NTRS)

    Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.

    1983-01-01

    Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.

  19. Zucchini yellow mosaic virus: biological properties, detection procedures and comparison of coat protein gene sequences.

    PubMed

    Coutts, B A; Kehoe, M A; Webster, C G; Wylie, S J; Jones, R A C

    2011-12-01

    Between 2006 and 2010, 5324 samples from at least 34 weed, two cultivated legume and 11 native species were collected from three cucurbit-growing areas in tropical or subtropical Western Australia. Two new alternative hosts of zucchini yellow mosaic virus (ZYMV) were identified, the Australian native cucurbit Cucumis maderaspatanus, and the naturalised legume species Rhyncosia minima. Low-level (0.7%) seed transmission of ZYMV was found in seedlings grown from seed collected from zucchini (Cucurbita pepo) fruit infected with isolate Cvn-1. Seed transmission was absent in >9500 pumpkin (C. maxima and C. moschata) seedlings from fruit infected with isolate Knx-1. Leaf samples from symptomatic cucurbit plants collected from fields in five cucurbit-growing areas in four Australian states were tested for the presence of ZYMV. When 42 complete coat protein (CP) nucleotide (nt) sequences from the new ZYMV isolates obtained were compared to those of 101 complete CP nt sequences from five other continents, phylogenetic analysis of the 143 ZYMV sequences revealed three distinct groups (A, B and C), with four subgroups in A (I-IV) and two in B (I-II). The new Australian sequences grouped according to collection location, fitting within A-I, A-II and B-II. The 16 new sequences from one isolated location in tropical northern Western Australia all grouped into subgroup B-II, which contained no other isolates. In contrast, the three sequences from the Northern Territory fitted into A-II with 94.6-99.0% nt identities with isolates from the United States, Iran, China and Japan. The 23 new sequences from the central west coast and two east coast locations all fitted into A-I, with 95.9-98.9% nt identities to sequences from Europe and Japan. These findings suggest that (i) there have been at least three separate ZYMV introductions into Australia and (ii) there are few changes to local isolate CP sequences following their establishment in remote growing areas. Isolates from A-I and B

  20. Relating Promoter Sequences to the Proteins that Bind to Them: A Comparison Study.

    NASA Astrophysics Data System (ADS)

    Glass, Kimberly

    2007-03-01

    Chromatin Immunoprecipitation (ChIP-on-ChIP) microarray data reveals that the proteins H3K9dimethyl and RNA-Polymerase II are exclusive regarding their binding to the promoter region of genes. When comparing the base pair sequences of the promoters that bind to Pol2 versus H3K9, striking differences appear. The mononucleotides have fundamentally different behaviors in each group. In addition, motifs that cluster before the transcriptional start site also generally have a strong enrichment in one group compared to the other. Using this knowledge a model can be developed that allows one to calculate a probability that a promoter will bind to either H3K9 or Pol2 based on its base pair sequence.

  1. Sequence comparison and phylogenetic analysis by the Maximum Likelihood method of ribosome-inactivating proteins from angiosperms.

    PubMed

    Di Maro, Antimo; Citores, Lucía; Russo, Rosita; Iglesias, Rosario; Ferreras, José Miguel

    2014-08-01

    Ribosome-inactivating proteins (RIPs) from angiosperms are rRNA N-glycosidases that have been proposed as defence proteins against virus and fungi. They have been classified as type 1 RIPs, consisting of single-chain proteins, and type 2 RIPs, consisting of an A chain with RIP properties covalently linked to a B chain with lectin properties. In this work we have carried out a broad search of RIP sequence data banks from angiosperms in order to study their main structural characteristics and phylogenetic evolution. The comparison of the sequences revealed the presence, outside of the active site, of a novel structure that might be involved in the internal protein dynamics linked to enzyme catalysis. Also the B-chains presented another conserved structure that might function either supporting the beta-trefoil structure or in the communication between both sugar-binding sites. A systematic phylogenetic analysis of RIP sequences revealed that the most primitive type 1 RIPs were similar to that of the actual monocots (Poaceae and Asparagaceae). The primitive RIPs evolved to the dicot type 1 related RIPs (like those from Caryophyllales, Lamiales and Euphorbiales). The gene of a type 1 RIP related with the actual Euphorbiaceae type 1 RIPs fused with a double beta trefoil lectin gene similar to the actual Cucurbitaceae lectins to generate the type 2 RIPs and finally this gene underwent deletions rendering either type 1 RIPs (like those from Cucurbitaceae, Rosaceae and Iridaceae) or lectins without A chain (like those from Adoxaceae).

  2. Sequence Comparison and Phylogeny of Nucleotide Sequence of Coat Protein and Nucleic Acid Binding Protein of a Distinct Isolate of Shallot virus X from India.

    PubMed

    Majumder, S; Baranwal, V K

    2011-06-01

    Shallot virus X (ShVX), a type species in the genus Allexivirus of the family Alfaflexiviridae has been associated with shallot plants in India and other shallot growing countries like Russia, Germany, Netherland, and New Zealand. Coat protein (CP) and nucleic acid binding protein (NB) region of the virus was obtained by reverse transcriptase polymerase chain reaction from scales leaves of shallot bulbs. The partial cDNA contained two open reading frames encoding proteins of molecular weights of 28.66 and 14.18 kDa belonging to Flexi_CP super-family and viral NB super-family, respectively. The percent identity and phylogenetic analysis of amino acid sequences of CP and NB region of the virus associated with shallot indicated that it was a distinct isolate of ShVX.

  3. A novel regucalcin gene promoter region-related protein: comparison of nucleotide and amino acid sequences in vertebrate species.

    PubMed

    Sawada, Natsumi; Yamaguchi, Masayoshi

    2005-01-01

    The molecular cloning and sequencing of the cDNA coding for a novel regucalcin gene promoter region-related protein (RGPR-p117) from bovine, rabbit and chicken livers was investigated using rapid amplification of cDNA endo (RACE) method. Their nucleotide and amino acid sequences were compared with human, rat and mouse sequences published previously. RGPR-p117 of bovine, rabbit and chicken livers consisted of 1052, 1045, and 929 amino acid residues with calculated molecular mass of 117, 114, and 103 kDa, and estimated pI of 5.64, 5.84, and 5.59, respectively. Comparison analysis revealed that the nucleotide sequences of RGPR-p117 from mammalian species were highly-conserved in their coding region, and the homologies were at least 72.9%. The RGPR-p117 proteins in mammalian species consisted of 1045-1060 amino acids, and had 63.1-90.2% identity. Meanwhile, the nucleotide and amino acid sequences of chicken RGPR-p117 had at least 36.4 and 43.7% identities, respectively. Phylogenetic analysis showed that RGPR-p117 in six vertebrates appears to form a single cluster. Mammalian RGPR-p117 conserved a leucine zipper motif. Moreover, the analysis for subcellular localization of RGPR-p117 from six vertebrates showed the probability of nuclear localization >52.2%; the nuclear localization in rat and mouse was 78.3%. This study demonstrates a great conservation of RGPR-p117 genes throughout evolution.

  4. Iranian johnsongrass mosaic virus: the complete genome sequence, molecular and biological characterization, and comparison of coat protein gene sequences.

    PubMed

    Moradi, Zohreh; Mehrvar, Mohsen; Nazifi, Ehsan; Zakiaghl, Mohammad

    2017-02-01

    Iranian johnsongrass mosaic virus (IJMV) is one of the most prevalent viruses causing maize mosaic disease in Iran. An IJMV isolate, Maz-Bah, was obtained from the maize showing mosaic symptoms in Mazandaran, north of Iran. The complete genomic sequence of Maz-Bah is 9544 nucleotides, excluding the poly(A) tail. It contains one single open reading frame of 9165 nucleotides and encodes a large polyprotein of 3054 amino acids, flanked by a 5'-untranslated region (UTR) of 143 nucleotides and a 3'-UTR of 236 nucleotides. The entire genomic sequence of Maz-Bah isolate shares identities of 84.9 and 94.2 % with the IJMV (Shz) isolate, the lone complete genome sequence available in the GenBank at the nucleotide (nt) and deduced amino acid (aa) levels, respectively. The whole genome sequences share identities of 51.5-69.8 and 44.9-74.3 % with those of other Sugarcane mosaic virus (SCMV) subgroup potyviruses at nt and aa levels, respectively. In phylogenetic trees based on the multiple alignments of the entire nt and aa sequences, IJMV isolates formed a separate sublineage of the tree with potyviruses infecting monocotyledons of cereals, indicating that IJMV is a member of SCMV subgroup of potyviruses. IJMV is most closely related to Sorghum mosaic virus and Maize dwarf mosaic virus and less closely related to the Johnsongrass mosaic virus and Cocksfoot streak virus. To further investigate the genetic relationship of IJMV, 9 other isolates from different hosts were cloned and sequenced. The identity of IJMV CP nt and aa sequences of 11 Iranian isolates ranged from 86.4 to 99.8 % and 90.5 to 99.7 %, respectively, indicating a high nt variability in CP gene. Furthermore, in the CP-based phylogenetic tree, IJMV isolates were clustered together with a maize potyvirus described as Zea mosaic virus from Israel (with 86-89 % nt identity), indicating that both isolates probably are the strains of the same virus.

  5. Comparison of Exome and Genome Sequencing Technologies for the Complete Capture of Protein-Coding Regions.

    PubMed

    Lelieveld, Stefan H; Spielmann, Malte; Mundlos, Stefan; Veltman, Joris A; Gilissen, Christian

    2015-08-01

    For next-generation sequencing technologies, sufficient base-pair coverage is the foremost requirement for the reliable detection of genomic variants. We investigated whether whole-genome sequencing (WGS) platforms offer improved coverage of coding regions compared with whole-exome sequencing (WES) platforms, and compared single-base coverage for a large set of exome and genome samples. We find that WES platforms have improved considerably in the last years, but at comparable sequencing depth, WGS outperforms WES in terms of covered coding regions. At higher sequencing depth (95x-160x), WES successfully captures 95% of the coding regions with a minimal coverage of 20x, compared with 98% for WGS at 87-fold coverage. Three different assessments of sequence coverage bias showed consistent biases for WES but not for WGS. We found no clear differences for the technologies concerning their ability to achieve complete coverage of 2,759 clinically relevant genes. We show that WES performs comparable to WGS in terms of covered bases if sequenced at two to three times higher coverage. This does, however, go at the cost of substantially more sequencing biases in WES approaches. Our findings will guide laboratories to make an informed decision on which sequencing platform and coverage to choose.

  6. Grouping and comparison of Indian citrus tristeza virus isolates based on coat protein gene sequences and restriction analysis patterns.

    PubMed

    Roy, A; Ramachandran, P; Brlansky, R H

    2003-04-01

    Citrus tristeza virus (CTV) is an aphid-transmitted closterovirus, which causes one of the most important citrus diseases worldwide. Isolates of CTV differ widely in their biological properties. CTV-infected samples were collected from four locations in India: Bangalore (CTV-B), Delhi (CTV-D), Nagpur (CTV-N), and Pune (CTV-P), and were maintained by grafting into Kagzi lime ( Citrus aurantifolia (Christm. Swing.). All isolates produced typical vein clearing and flecking symptoms 6-8 weeks after grafting. In addition, CTV-B and CTV-P isolates produced stem-pitting symptoms after 8-10 months. The CTV coat protein gene (CPG) was amplified by RT-PCR using CPG specific primers, yielding an amplicon of 672 bp for all the isolates. Sequence analysis of the CPG amplicon of all the four Indian isolates showed 93-94% nucleotide sequence homology to the Californian CTV severe stem pitting isolate SY568 and 92-93% homology to the Japanese seedling yellows isolate NUagA and Israeli VT p346 isolates. In phylogenetic tree analysis, Indian CTV isolates appeared far different from other isolates as they formed a separate branch. Comparison among the Indian isolates was carried out by restriction analysis and restriction fragment length polymorphism (RFLP). Specific primers to various genome segments of well-characterized CTV isolates were used to further classify the Indian CTV isolates.

  7. Protein Structure Comparison and Classification

    NASA Astrophysics Data System (ADS)

    Çamoǧlu, Orhan; Singh, Ambuj K.

    The success of genome projects has generated an enormous amount of sequence data. In order to realize the full value of the data, we need to understand its functional role and its evolutionary origin. Sequence comparison methods are incredibly valuable for this task. However, for sequences falling in the twilight zone (usually between 20 and 35% sequence similarity), we need to resort to structural alignment and comparison for a meaningful analysis. Such a structural approach can be used for classification of proteins, isolation of structural motifs, and discovery of drug targets.

  8. Comparison of the sequence of the gene encoding African swine fever virus attachment protein p12 from field virus isolates and viruses passaged in tissue culture.

    PubMed Central

    Angulo, A; Viñuela, E; Alcamí, A

    1992-01-01

    Comparison of the amino acid sequence of the African swine fever virus attachment protein p12 from different field virus isolates, deduced from the nucleotide sequence of the gene, revealed a high degree of conservation. No mutations were found after adaptation to Vero cells, and a polypeptide with similar characteristics was present in an IBRS2-adapted virus. The sequence of the 5' flanking region was conserved among the isolates, whereas sequences downstream of the gene were highly variable in length and contained direct repeats in tandem that may account for the deletions found in different isolates. Protein p12 was synthesized in swine macrophages infected with all of the viruses tested. PMID:1583733

  9. Sequence comparisons via algorithmic mutual information

    SciTech Connect

    Milosavijevic, A.

    1994-12-31

    One of the main problems in DNA and protein sequence comparisons is to decide whether observed similarity of two sequences should be explained by their relatedness or by mere presence of some shared internal structure, e.g., shared internal tandem repeats. The standard methods that are based on statistics or classical information theory can be used to discover either internal structure or mutual sequence similarity, but cannot take into account both. Consequently, currently used methods for sequence comparison employ {open_quotes}masking{close_quotes} techniques that simply eliminate sequences that exhibit internal repetitive structure prior to sequence comparisons. The {open_quotes}masking{close_quotes} approach precludes discovery of homologous sequences of moderate or low complexity, which abound at both DNA and protein levels. As a solution to this problem, we propose a general method that is based on algorithmic information theory and minimal length encoding. We show that algorithmic mutual information factors out the sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. We extend the recently developed algorithmic significance method to show that significance depends exponentially on algorithmic mutual information.

  10. Comparison of sequence of cDNA clone with other genomic and cDNA sequences for human C-reactive protein

    SciTech Connect

    Tenchini, M.L.; Bossi, E.; Marchetti, L.; Malcovati, M. ); Lorenzetti, R. )

    1992-04-01

    A clone for C-reactive protein (CRP) has been isolated from a human liver cDNA library; this clone harbors a plasmid, pC81, which has an insert of 1631 bp. When compared to genomic and cDNA sequences published to date now, pC81 has revealed homologies and differences that might help to clarify the structure of this gene and the presence of allelic variants in man.

  11. Molecular cloning and sequence analysis of the Sta58 major antigen gene of Rickettsia tsutsugamushi: sequence homology and antigenic comparison of Sta58 to the 60-kilodalton family of stress proteins.

    PubMed Central

    Stover, C K; Marana, D P; Dasch, G A; Oaks, E V

    1990-01-01

    The scrub typhus 58-kilodalton (kDa) antigen (Sta58) of Rickettsia tsutsugamushi is a major protein antigen often recognized by humans infected with scrub typhus rickettsiae. A 2.9-kilobase HindIII fragment containing a complete sta58 gene was cloned in Escherichia coli and found to express the entire Sta58 antigen and a smaller protein with an apparent molecular mass of 11 kDa (Stp11). DNA sequence analysis of the 2.9-kilobase HindIII fragment revealed two adjacent open reading frames encoding proteins of 11 (Stp11) and 60 (Sta58) kDa. Comparisons of deduced amino acid sequences disclosed a high degree of homology between the R. tsutsugamushi proteins Stp11 and Sta58 and the E. coli proteins GroES and GroEL, respectively, and the family of primordial heat shock proteins designated Hsp10 Hsp60. Although the sequence homology between the Sta58 antigen and the Hsp60 protein family is striking, the Sta58 protein appeared to be antigenically distinct among a sample of other bacterial Hsp60 homologs, including the typhus group of rickettsiae. The antigenic uniqueness of the Sta58 antigen indicates that this protein may be a potentially protective antigen and a useful diagnostic reagent for scrub typhus fever. Images PMID:2108930

  12. Evaluation of global sequence comparison and one-to-one FASTA local alignment in regulatory allergenicity assessment of transgenic proteins in food crops.

    PubMed

    Song, Ping; Herman, Rod A; Kumpatla, Siva

    2014-09-01

    To address the high false positive rate using >35% identity over 80 amino acids in the regulatory assessment of transgenic proteins for potential allergenicity and the change of E-value with database size, the Needleman-Wunsch global sequence alignment and a one-to-one (1:1) local FASTA search (one protein in the target database at a time) using FASTA were evaluated by comparing proteins randomly selected from Arabidopsis, rice, corn, and soybean with known allergens in a peer-reviewed allergen database (http://www.allergenonline.org/). Compared with the approach of searching >35%/80aa+, the false positive rate measured by specificity rate for identification of true allergens was reduced by a 1:1 global sequence alignment with a cut-off threshold of ≧30% identity and a 1:1 FASTA local alignment with a cut-off E-value of ≦1.0E-09 while maintaining the same sensitivity. Hence, a 1:1 sequence comparison, especially using the FASTA local alignment tool with a biological relevant E-value of 1.0E-09 as a threshold, is recommended for the regulatory assessment of sequence identities between transgenic proteins in food crops and known allergens.

  13. Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity.

    PubMed

    Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S

    2015-09-01

    The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. © 2015 The Authors Protein Science

  14. ProteinArchitect: protein evolution above the sequence level.

    PubMed

    Haimel, Matthias; Pröll, Karin; Rebhan, Michael

    2009-07-15

    While many authors have discussed models and tools for studying protein evolution at the sequence level, molecular function is usually mediated by complex, higher order features such as independently folding domains and linear motifs that are based on or embedded in a particular arrangment of features such as secondary structure elements, transmembrane domains and regions with intrinsic disorder. This 'protein architecture' can, in its most simplistic representation, be visualized as domain organization cartoons that can be used to compare proteins in terms of the order of their mostly globular domains. Here, we describe a visual approach and a webserver for protein comparison that extend the domain organization cartoon concept. By developing an information-rich, compact visualization of different protein features above the sequence level, potentially related proteins can be compared at the level of propensities for secondary structure, transmembrane domains and intrinsic disorder, in addition to PFAM domains. A public Web server is available at www.proteinarchitect.net, while the code is provided at protarchitect.sourceforge.net. Due to recent advances in sequencing technologies we are now flooded with millions of predicted proteins that await comparative analysis. In many cases, mature tools focused on revealing hits with considerable global or local similarity to well-characterized proteins will not be able to lead us to testable hypotheses about a protein's function, or the function of a particular region. The visual comparison of different types of protein features with ProteinArchitect will be useful when assessing the relevance of similarity search hits, to discover subgroups in protein families and superfamilies, and to understand protein regions with conserved features outside globular regions. Therefore, this approach is likely to help researchers to develop testable hypotheses about a protein's function even if is somewhat distant from the more

  15. Comparisons of Ribosomal Protein Gene Promoters Indicate Superiority of Heterologous Regulatory Sequences for Expressing Transgenes in Phytophthora infestans

    PubMed Central

    Khachatoorian, Careen; Judelson, Howard S.

    2015-01-01

    Molecular genetics approaches in Phytophthora research can be hampered by the limited number of known constitutive promoters for expressing transgenes and the instability of transgene activity. We have therefore characterized genes encoding the cytoplasmic ribosomal proteins of Phytophthora and studied their suitability for expressing transgenes in P. infestans. Phytophthora spp. encode a standard complement of 79 cytoplasmic ribosomal proteins. Several genes are duplicated, and two appear to be pseudogenes. Half of the genes are expressed at similar levels during all stages of asexual development, and we discovered that the majority share a novel promoter motif named the PhRiboBox. This sequence is enriched in genes associated with transcription, translation, and DNA replication, including tRNA and rRNA biogenesis. Promoters from the three P. infestans genes encoding ribosomal proteins S9, L10, and L23 and their orthologs from P. capsici were tested for their ability to drive transgenes in stable transformants of P. infestans. Five of the six promoters yielded strong expression of a GUS reporter, but the stability of expression was higher using the P. capsici promoters. With the RPS9 and RPL10 promoters of P. infestans, about half of transformants stopped making GUS over two years of culture, while their P. capsici orthologs conferred stable expression. Since cross-talk between native and transgene loci may trigger gene silencing, we encourage the use of heterologous promoters in transformation studies. PMID:26716454

  16. Comparisons of Ribosomal Protein Gene Promoters Indicate Superiority of Heterologous Regulatory Sequences for Expressing Transgenes in Phytophthora infestans.

    PubMed

    Poidevin, Laetitia; Andreeva, Kalina; Khachatoorian, Careen; Judelson, Howard S

    2015-01-01

    Molecular genetics approaches in Phytophthora research can be hampered by the limited number of known constitutive promoters for expressing transgenes and the instability of transgene activity. We have therefore characterized genes encoding the cytoplasmic ribosomal proteins of Phytophthora and studied their suitability for expressing transgenes in P. infestans. Phytophthora spp. encode a standard complement of 79 cytoplasmic ribosomal proteins. Several genes are duplicated, and two appear to be pseudogenes. Half of the genes are expressed at similar levels during all stages of asexual development, and we discovered that the majority share a novel promoter motif named the PhRiboBox. This sequence is enriched in genes associated with transcription, translation, and DNA replication, including tRNA and rRNA biogenesis. Promoters from the three P. infestans genes encoding ribosomal proteins S9, L10, and L23 and their orthologs from P. capsici were tested for their ability to drive transgenes in stable transformants of P. infestans. Five of the six promoters yielded strong expression of a GUS reporter, but the stability of expression was higher using the P. capsici promoters. With the RPS9 and RPL10 promoters of P. infestans, about half of transformants stopped making GUS over two years of culture, while their P. capsici orthologs conferred stable expression. Since cross-talk between native and transgene loci may trigger gene silencing, we encourage the use of heterologous promoters in transformation studies.

  17. Supercomputers and biological sequence comparison algorithms.

    PubMed

    Core, N G; Edmiston, E W; Saltz, J H; Smith, R M

    1989-12-01

    Comparison of biological (DNA or protein) sequences provides insight into molecular structure, function, and homology and is increasingly important as the available databases become larger and more numerous. One method of increasing the speed of the calculations is to perform them in parallel. We present the results of initial investigations using two dynamic programming algorithms on the Intel iPSC hypercube and the Connection Machine as well as an inexpensive, heuristically-based algorithm on the Encore Multimax.

  18. Molecular characterization of a novel pattern recognition protein from nonspecific cytotoxic cells: sequence analysis, phylogenetic comparisons and anti-microbial activity of a recombinant homologue.

    PubMed

    Evans, Donald L; Kaur, Harjeet; Leary, John; Praveen, Kesavannair; Jaso-Friedmann, Liliana

    2005-01-01

    Nonspecific cytotoxic cells (NCC) are the first identified and most extensively studied killer cell population in teleosts. NCC kill a wide variety of target cells including tumor cells, virally transformed cells and protozoan parasites. The present study identified a novel evolutionarily conserved oligodeoxynucleotide (ODN) binding membrane protein expressed by channel catfish (Ictalurus punctatus) NCC. Peptide fingerprinting analysis of the ODN binding protein (referred to as NCC cationic anti-microbial protein-1/ncamp-1) identified a peptide that was used to design degenerate primers. A catfish NCC cDNA library was used as template with these primers and the PCR-amplified product was sequenced. The translated sequence contained 203 amino acids (molecular mass of 22,064.63 Da) with characteristic lysine rich regions and a pI=pH 10.75. Sequence comparisons of this protein indicated similarity to zebrafish (51.2%) histone family member 1-X and (to a lesser extent) to trout H1. A search of EST databases confirmed that ncamp-1 is also expressed in various tissues of channel catfish as well as zebrafish. Inspection for signature repeats in ncamp-1 and comparisons with histone-like peptides from different species indicated the presence of multiple lysine based motifs composed of AKKA or PKK repeats. The novel protein was cloned, expressed in E. coli and the recombinant was used to generate rabbit anti-serum. The recombinant ncamp-1 bound GpC and CpG ODNs and was detected with homologous anti-ncamp-1 polyclonal antibodies. Western blots of NCC membranes using anti-ncamp-1 serum detected a 29 kDa protein. Binding competition experiments demonstrated that anti-ncamp-1 antibodies and GpC bound to the same protein on NCC. Two different truncated forms of ncamp-1 as well as the full-length recombinant protein exhibited anti-microbial activity. The present study demonstrated the expression by NCC of a new membrane protein that may participate in the recognition of bacterial

  19. Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity

    PubMed Central

    Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S

    2015-01-01

    The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. PMID:26073648

  20. Graphene Nanopores for Protein Sequencing

    PubMed Central

    Wilson, James; Sloman, Leila; He, Zhiren

    2016-01-01

    An inexpensive, reliable method for protein sequencing is essential to unraveling the biological mechanisms governing cellular behavior and disease. Current protein sequencing methods suffer from limitations associated with the size of proteins that can be sequenced, the time, and the cost of the sequencing procedures. Here, we report the results of all-atom molecular dynamics simulations that investigated the feasibility of using graphene nanopores for protein sequencing. We focus our study on the biologically significant phenylalanine-glycine repeat peptides (FG-nups)—parts of the nuclear pore transport machinery. Surprisingly, we found FG-nups to behave similarly to single stranded DNA: the peptides adhere to graphene and exhibit step-wise translocation when subject to a transmembrane bias or a hydrostatic pressure gradient. Reducing the peptide’s charge density or increasing the peptide’s hydrophobicity was found to decrease the translocation speed. Yet, unidirectional and stepwise translocation driven by a transmembrane bias was observed even when the ratio of charged to hydrophobic amino acids was as low as 1:8. The nanopore transport of the peptides was found to produce stepwise modulations of the nanopore ionic current correlated with the type of amino acids present in the nanopore, suggesting that protein sequencing by measuring ionic current blockades may be possible. PMID:27746710

  1. Purification and N-terminal amino acid sequence comparisons of structural proteins from retrovirus-D/Washington and Mason-Pfizer monkey virus.

    PubMed Central

    Henderson, L E; Sowder, R; Smythers, G; Benveniste, R E; Oroszlan, S

    1985-01-01

    A new D-type retrovirus originally designated SAIDS-D/Washington and here referred to as retrovirus-D/Washington (R-D/W) was recently isolated at the University of Washington Primate Center, Seattle, Wash., from a rhesus monkey with an acquired immunodeficiency syndrome and retroperitoneal fibromatosis. To better establish the relationship of this new D-type virus to the prototype D-type virus, Mason-Pfizer monkey virus (MPMV), we have purified and compared six structural proteins from each virus. The proteins purified from each D-type retrovirus include p4, p10, p12, p14, p27, and a phosphoprotein designated pp18 for MPMV and pp20 for R-D/W. Amino acid analysis and N-terminal amino acid sequence analysis show that the p4, p12, p14, and p27 proteins of R-D/W are distinct from the homologous proteins of MPMV but that these proteins from the two different viruses share a high degree of amino acid sequence homology. The p10 proteins from the two viruses have similar amino acid compositions, and both are blocked to N-terminal Edman degradation. The phosphoproteins from the two viruses each contain phosphoserine but are different from each other in amino acid composition, molecular weight, and N-terminal amino acid sequence. The data thus show that each of the R-D/W proteins examined is distinguishable from its MPMV homolog and that a major difference between these two D-type retroviruses is found in the viral phosphoproteins. The N-terminal amino acid sequences of D-type retroviral proteins were used to search for sequence homologies between D-type and other retroviral amino acid sequences. An unexpected amino acid sequence homology was found between R-D/W pp20 (a gag protein) and a 28-residue segment of the env precursor polyprotein of Rous sarcoma virus. The N-terminal amino acid sequences of the D-type major gag protein (p27) and the nucleic acid-binding protein (p14) show only limited amino acid sequence homology to functionally homologous proteins of C

  2. Rosetta stone method for detecting protein function and protein-protein interactions from genome sequences

    DOEpatents

    Eisenberg, David; Marcotte, Edward M.; Pellegrini, Matteo; Thompson, Michael J.; Yeates, Todd O.

    2002-10-15

    A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.

  3. Distinguishing Proteins From Arbitrary Amino Acid Sequences

    PubMed Central

    Yau, Stephen S.-T.; Mao, Wei-Guang; Benson, Max; He, Rong Lucy

    2015-01-01

    What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe. PMID:25609314

  4. Screening, diversity and partial sequence comparison of vegetative insecticidal protein (vip3A) genes in the local isolates of Bacillus thuringiensis Berliner.

    PubMed

    Asokan, R; Swamy, H M Mahadeva; Arora, D K

    2012-04-01

    Characterization, direct sequencing of the PCR amplicon and phylogenetic relationship was done to discover a novel Vip protein genes of the Bt isolates, to improve the prospects for insect control, more Vip proteins should be sought out and researched to predict their insecticidal activity. Characterization was based on direct sequencing of PCR amplicon using primers specific to vip3A gene was presented here. 12 out of 18 isolates screened were positive for vip gene-specific primers. Homology search for the partial sequences using BLAST showed that 11 isolates had high similarity to vip3Aa gene and only one fragment with vip3Ae gene (25-100% at nucleotide and amino acid level). Phylogenetic analysis showed that the gene sequences were responsible for geographic separation for divergence within vip genes, consistent with the evaluation of distinct bacterial population. Despite the geographical distances, strains harbouring vip genes have originated from common ancestors may significantly contribute to control resistant insect pests. Some strains have evolved to be quite distinct and others remain as members of closely related groups. The reported method is a powerful tool to find novel Vip3A proteins from large-scale Bt strains which is effective in terms of time and cost. Further the Vip proteins produced by different strains of B. thuringiensis are unique in terms of the sequence divergence and hence may also differ in their insecticidal activities.

  5. Exploration of sequence space for protein engineering.

    PubMed

    Gustafsson, C; Govindarajan, S; Emig, R

    2001-01-01

    The process of protein engineering is currently evolving towards a heuristic understanding of the sequence-function relationship. Improved DNA sequencing capacity, efficient protein function characterization and improved quality of data points in conjunction with well-established statistical tools from other industries are changing the protein engineering field. Algorithms capturing the heuristic sequence-function relationships will have a drastic impact on the field of protein engineering. In this review, several alternative approaches to quantitatively assess sequence space are discussed and the relatively few examples of wet-lab validation of statistical sequence-function characterization/correlation are described.

  6. Method and apparatus for biological sequence comparison

    DOEpatents

    Marr, Thomas G.; Chang, William I-Wei

    1997-01-01

    A method and apparatus for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence.

  7. Method and apparatus for biological sequence comparison

    DOEpatents

    Marr, T.G.; Chang, W.I.

    1997-12-23

    A method and apparatus are disclosed for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence. 5 figs.

  8. A new graphical representation of protein sequences and its applications

    NASA Astrophysics Data System (ADS)

    Hou, Wenbing; Pan, Qiuhui; He, Mingfeng

    2016-02-01

    Sequence analysis is one of the foundations in bioinformatics for the abundant information hidden in the sequences. It is helpful for scientists' study on the function of DNA, proteins and cells. In this paper, we outline a novel method for protein sequences similarity analysis based on the physical-chemical properties of amino acids. We consider the protein sequence as a rigid-body with mass. Then we introduce the moment of inertia to the calculation of similarity of sequences and the sequences are transformed into vectors by the tensor for moment of inertia. The Euclidean distance is employed as a measurement of the similarities. At last, the comparison with other references' results shows our approach is reasonable and effective.

  9. PROMPT: a protein mapping and comparison tool

    PubMed Central

    Schmidt, Thorsten; Frishman, Dmitrij

    2006-01-01

    Background Comparison of large protein datasets has become a standard task in bioinformatics. Typically researchers wish to know whether one group of proteins is significantly enriched in certain annotation attributes or sequence properties compared to another group, and whether this enrichment is statistically significant. In order to conduct such comparisons it is often required to integrate molecular sequence data and experimental information from disparate incompatible sources. While many specialized programs exist for comparisons of this kind in individual problem domains, such as expression data analysis, no generic software solution capable of addressing a wide spectrum of routine tasks in comparative proteomics is currently available. Results PROMPT is a comprehensive bioinformatics software environment which enables the user to compare arbitrary protein sequence sets, revealing statistically significant differences in their annotation features. It allows automatic retrieval and integration of data from a multitude of molecular biological databases as well as from a custom XML format. Similarity-based mapping of sequence IDs makes it possible to link experimental information obtained from different sources despite discrepancies in gene identifiers and minor sequence variation. PROMPT provides a full set of statistical procedures to address the following four use cases: i) comparison of the frequencies of categorical annotations between two sets, ii) enrichment of nominal features in one set with respect to another one, iii) comparison of numeric distributions, and iv) correlation of numeric variables. Analysis results can be visualized in the form of plots and spreadsheets and exported in various formats, including Microsoft Excel. Conclusion PROMPT is a versatile, platform-independent, easily expandable, stand-alone application designed to be a practical workhorse in analysing and mining protein sequences and associated annotation. The availability of the

  10. Multiple alignment-free sequence comparison

    PubMed Central

    Ren, Jie; Song, Kai; Sun, Fengzhu; Deng, Minghua; Reinert, Gesine

    2013-01-01

    Motivation: Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, and , extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, , and , averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. Results: Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics. Availability: Our implementation of the five statistics is available as R package named ‘multiAlignFree’ at be http://www-rcf.usc.edu/∼fsun/Programs/multiAlignFree/multiAlignFreemain.html. Contact: reinert@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23990418

  11. The partial sequencing of the genomic RNA of a UK isolate of Pepino mosaic virus and the comparison of the coat protein sequence with other isolates from Europe and Peru.

    PubMed

    Mumford, R A; Metcalfe, E J

    2001-12-01

    A 3599 nucleotide portion of the genomic RNA of a UK isolate of Pepino mosaic virus (PepMV), isolated from tomato, has been sequenced (Accession No. AF340024). The region sequenced includes the 3'-end of the RNA polymerase, the triple gene block (TGB), the coat protein (CP) and 3' untranslated region (UTR). In addition, the CP sequences of another 15 PepMV isolates, including 14 European tomato isolates and a Peruvian pepino isolate, have been determined and compared. This analysis shows that all the tomato isolates share over 99% identity, but only between 96-97% identity with the Peruvian pepino isolate.

  12. HIV protein sequence hotspots for crosstalk with host hub proteins.

    PubMed

    Sarmady, Mahdi; Dampier, William; Tozeren, Aydin

    2011-01-01

    HIV proteins target host hub proteins for transient binding interactions. The presence of viral proteins in the infected cell results in out-competition of host proteins in their interaction with hub proteins, drastically affecting cell physiology. Functional genomics and interactome datasets can be used to quantify the sequence hotspots on the HIV proteome mediating interactions with host hub proteins. In this study, we used the HIV and human interactome databases to identify HIV targeted host hub proteins and their host binding partners (H2). We developed a high throughput computational procedure utilizing motif discovery algorithms on sets of protein sequences, including sequences of HIV and H2 proteins. We identified as HIV sequence hotspots those linear motifs that are highly conserved on HIV sequences and at the same time have a statistically enriched presence on the sequences of H2 proteins. The HIV protein motifs discovered in this study are expressed by subsets of H2 host proteins potentially outcompeted by HIV proteins. A large subset of these motifs is involved in cleavage, nuclear localization, phosphorylation, and transcription factor binding events. Many such motifs are clustered on an HIV sequence in the form of hotspots. The sequential positions of these hotspots are consistent with the curated literature on phenotype altering residue mutations, as well as with existing binding site data. The hotspot map produced in this study is the first global portrayal of HIV motifs involved in altering the host protein network at highly connected hub nodes.

  13. A platform for biological sequence comparison on parallel computers.

    PubMed

    Deshpande, A S; Richards, D S; Pearson, W R

    1991-04-01

    We have written two programs for searching biological sequence databases that run on Intel hypercube computers. PSCANLIB compares a single sequence against a sequence library, and PCOMPLIB compares all the entries in one sequence library against a second library. The programs provide a general framework for similarity searching; they include functions for reading in query sequences, search parameters and library entries, and reporting the results of a search. We have isolated the code for the specific function that calculates the similarity score between the query and library sequence; alternative searching algorithms can be implemented by editing two files. We have implemented the rapid FASTA sequence comparison algorithm and the more rigorous Smith-Waterman algorithm within this framework. The PSCANLIB program on a 16 node iPSC/2 80386-based hypercube can compare a 229 amino acid protein sequence with a 3.4 million residue sequence library in approximately 16 s with the FASTA algorithm. Using the Smith-Waterman algorithm, the same search takes 35 min. The PCOMPLIB program can compare a 0.8 million amino acid protein sequence library with itself in 5.3 min with FASTA on a third-generation 32 node Intel iPSC/860 hypercube.

  14. Recently published protein sequences. I.

    NASA Technical Reports Server (NTRS)

    Jukes, T. H.; Holmquist, R.

    1972-01-01

    Some polypeptide sequences that have been published in the 1972 scientific literature are listed. Only selected sequences are included. The compilation has two objectives. Current information between periods when more comprehensive compilations are published is to be assembled and the use of data that do not include arrangements of unsequenced peptides for 'maximum homology' is to be encouraged.

  15. [Sequence analysis of the coat protein gene of Chinese soybean mosaic virus strain SC7 and comparison with those of SMV strains from the USA].

    PubMed

    Cai, Chun-Mei; Jiang, Xiao; Zhao, Chun-Mei; Ma, Jian-Xin

    2014-09-01

    To unveil genetic variations between the predominant soybean mosaic virus (SMV) strains in China and in the USA, as well as to reveal the potential relevance between the similarity of gene sequences and the virulence of the viruses, we isolated and sequenced the coat protein (CP) gene of Chinese SMV strain SC7 by RT-PCR and compared the SC7 sequence with those of SMV strains from the USA. Analysis is showed that the CP gene of SC7 was 795 nucleotides in length and encoded 265 in amino acids'. The CP gene of SC7 and those of the strains from the USA exhibited 4%-5% nucleotide diversity and 1%-2% diversity amino acids. The conserved amino-acid sequence associated with aphid spread in the USA strains was DAG, and corresponded to DAD in SC7. The virulence of SC7 was greater than that of the SMV strains from the USA. Nevertheless, no clear relationships between sequence similarity of the CP genes from different strains and their virulence on differential hosts were found.

  16. A mathematical framework for protein structure comparison.

    PubMed

    Liu, Wei; Srivastava, Anuj; Zhang, Jinfeng

    2011-02-03

    Comparison of protein structures is important for revealing the evolutionary relationship among proteins, predicting protein functions and predicting protein structures. Many methods have been developed in the past to align two or multiple protein structures. Despite the importance of this problem, rigorous mathematical or statistical frameworks have seldom been pursued for general protein structure comparison. One notable issue in this field is that with many different distances used to measure the similarity between protein structures, none of them are proper distances when protein structures of different sequences are compared. Statistical approaches based on those non-proper distances or similarity scores as random variables are thus not mathematically rigorous. In this work, we develop a mathematical framework for protein structure comparison by treating protein structures as three-dimensional curves. Using an elastic Riemannian metric on spaces of curves, geodesic distance, a proper distance on spaces of curves, can be computed for any two protein structures. In this framework, protein structures can be treated as random variables on the shape manifold, and means and covariance can be computed for populations of protein structures. Furthermore, these moments can be used to build Gaussian-type probability distributions of protein structures for use in hypothesis testing. The covariance of a population of protein structures can reveal the population-specific variations and be helpful in improving structure classification. With curves representing protein structures, the matching is performed using elastic shape analysis of curves, which can effectively model conformational changes and insertions/deletions. We show that our method performs comparably with commonly used methods in protein structure classification on a large manually annotated data set.

  17. The Shannon information entropy of protein sequences.

    PubMed Central

    Strait, B J; Dewey, T G

    1996-01-01

    A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed. PMID:8804598

  18. Amyloidogenic sequences in native protein structures

    PubMed Central

    Tzotzos, Susan; Doig, Andrew J

    2010-01-01

    Numerous short peptides have been shown to form β-sheet amyloid aggregates in vitro. Proteins that contain such sequences are likely to be problematic for a cell, due to their potential to aggregate into toxic structures. We investigated the structures of 30 proteins containing 45 sequences known to form amyloid, to see how the proteins cope with the presence of these potentially toxic sequences, studying secondary structure, hydrogen-bonding, solvent accessible surface area and hydrophobicity. We identified two mechanisms by which proteins avoid aggregation: Firstly, amyloidogenic sequences are often found within helices, despite their inherent preference to form β structure. Helices may offer a selective advantage, since in order to form amyloid the sequence will presumably have to first unfold and then refold into a β structure. Secondly, amyloidogenic sequences that are found in β structure are usually buried within the protein. Surface exposed amyloidogenic sequences are not tolerated in strands, presumably because they lead to protein aggregation via assembly of the amyloidogenic regions. The use of α-helices, where amyloidogenic sequences are forced into helix, despite their intrinsic preference for β structure, is thus a widespread mechanism to avoid protein aggregation. PMID:20027621

  19. Construction of validated, non-redundant composite protein sequence databases.

    PubMed

    Bleasby, A J; Wootton, J C

    1990-01-01

    A strategy has been developed for the construction of a validated, comprehensive composite protein sequence database. Entries are amalgamated from primary source data bases by a largely automated set of processes in which redundant and trivially different entries are eliminated. A modular approach has been adopted to allow scientific judgement to be used at each stage of database processing and amalgamation. Source databases are assigned a priority depending on the quality of sequence validation and commenting. Rejection of entries from the lower priority database, in each pairwise comparison of databases, is carried out according to optionally defined redundancy criteria based on sequence segment mismatches. Efficient algorithms for this methodology are embodied in the COMPO software system. COMPO has been applied for over 2 years in construction and regular updating of the OWL composite protein sequence database from the source databases NBRF-PIR, SWISS-PROT, a GenBank translation retrieved from the feature tables, NBRF-NEW, NEWAT86, PSD-KYOTO and the sequences contained in the Brookhaven protein structure databank. OWL is part of the ISIS integrated data resource of protein sequence and structure [Akrigg et al. (1988) Nature, 335, 745-746]. The modular nature of the integration process greatly facilitates the frequent updating of OWL following releases of the source databases. The extent of redundancy in these sources is revealed by the comparison process. The advantages of a robust composite database for sequence similarity searching and information retrieval are discussed.

  20. Comparison of the complete sequences of three different isolates of Pepino mosaic virus: size variability of the TGBp3 protein between tomato and L. peruvianum isolates.

    PubMed

    López, C; Soler, S; Nuez, F

    2005-03-01

    The complete nucleotide sequence of the genomes of two Spanish isolates (LE-2000 and LE-2002) from tomato and one Peruvian isolate (LP-2001) from Lycopersicon peruvianum of the Pepino mosaic virus (PepMV) were determined. The tomato isolates share identities higher than 99%, while the genome of LP-2001 had mean nucleotide identities of 95.6% to 96.0% with tomato isolates. The predicted amino acid sequences showed similarities ranging between 95.2% and 100% with TGBp3 and TGBp2 and CP proteins, respectively. In LP-2001 two main differences were found with respect to the tomato isolates; (i) the 5' untranslated region (UTR) was 2 nt shorter by deletion at position 12-13 and it had some polymorphims at the putative promoter sequence reported for PepMV tomato isolates and other potexviruses, which could be functionally significant for RNA replication, and (ii) the TGBp3 protein had two extra amino acids in the C-terminal region.

  1. Fold homology detection using sequence fragment composition profiles of proteins.

    PubMed

    Solis, Armando D; Rackovsky, Shalom R

    2010-10-01

    The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so-called "twilight zone" problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment-free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20-letter amino acid alphabet) into a more tractable number of reduced tetramers (approximately 15-30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver-operating characteristic measure, we demonstrate potentially significant improvement in using information-optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the "twilight zone". 2010 Wiley-Liss, Inc.

  2. Fold Homology Detection Using Sequence Fragment Composition Profiles of Proteins

    PubMed Central

    Solis, Armando D.; Rackovsky, Shalom R.

    2010-01-01

    The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so-called “twilight zone” problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment-free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20-letter amino acid alphabet) into a more tractable number of reduced tetramers (around 15 to 30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database, that share low pairwise sequence similarity. Using the receiver operating characteristic (ROC) measure, we demonstrate potentially significant improvement in using information-optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the “twilight zone”. PMID:20635424

  3. Protein Sequencing with Tandem Mass Spectrometry

    NASA Astrophysics Data System (ADS)

    Ziady, Assem G.; Kinter, Michael

    The recent introduction of electrospray ionization techniques that are suitable for peptides and whole proteins has allowed for the design of mass spectrometric protocols that provide accurate sequence information for proteins. The advantages gained by these approaches over traditional Edman Degradation sequencing include faster analysis and femtomole, sometimes attomole, sensitivity. The ability to efficiently identify proteins has allowed investigators to conduct studies on their differential expression or modification in response to various treatments or disease states. In this chapter, we discuss the use of electrospray tandem mass spectrometry, a technique whereby protein-derived peptides are subjected to fragmentation in the gas phase, revealing sequence information for the protein. This powerful technique has been instrumental for the study of proteins and markers associated with various disorders, including heart disease, cancer, and cystic fibrosis. We use the study of protein expression in cystic fibrosis as an example.

  4. A Comparative Study of Protein Sequence Clustering Algorithms

    NASA Astrophysics Data System (ADS)

    Eldin, A. Sharaf; Abdelgaber, S.; Soliman, T.; Kassim, S.; Abdo, A.

    In this paper, we survey four clustering techniques and discuss their advantages and drawbacks. A review of eight different protein sequence clustering algorithms has been accomplished. Moreover, a comparison between the algorithms on the basis of some factors has been presented.

  5. Finding important sites in protein sequences

    PubMed Central

    Bickel, Peter J.; Kechris, Katherina J.; Spector, Philip C.; Wedemayer, Gary J.; Glazer, Alexander N.

    2002-01-01

    By using sequence information from an aligned protein family, a procedure is exhibited for finding sites that may be functionally or structurally critical to the protein. Features based on sequence conservation within subfamilies in the alignment and associations between sites are used to select the sites. The sites are subject to statistical evaluation correcting for phylogenetic bias in the collection of sequences. This method is applied to two families: the phycobiliproteins, light-harvesting proteins in cyanobacteria, red algae, and cryptomonads, and the globins that function in oxygen storage and transport. The sites identified by the procedure are located in key structural positions and merit further experimental study. PMID:12417758

  6. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks.

    PubMed

    Cao, Renzhi; Cheng, Jianlin

    2016-01-15

    Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein-protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene-gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein-protein interaction and spatial gene-gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein-protein interaction and spatial gene-gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile-sequence comparison, profile-profile comparison, and domain co-occurrence networks according to the maximum F-measure. Copyright © 2015 Elsevier Inc. All rights reserved.

  7. Protein structure determination using metagenome sequence data.

    PubMed

    Ovchinnikov, Sergey; Park, Hahnbeom; Varghese, Neha; Huang, Po-Ssu; Pavlopoulos, Georgios A; Kim, David E; Kamisetty, Hetunandan; Kyrpides, Nikos C; Baker, David

    2017-01-20

    Despite decades of work by structural biologists, there are still ~5200 protein families with unknown structure outside the range of comparative modeling. We show that Rosetta structure prediction guided by residue-residue contacts inferred from evolutionary information can accurately model proteins that belong to large families and that metagenome sequence data more than triple the number of protein families with sufficient sequences for accurate modeling. We then integrate metagenome data, contact-based structure matching, and Rosetta structure calculations to generate models for 614 protein families with currently unknown structures; 206 are membrane proteins and 137 have folds not represented in the Protein Data Bank. This approach provides the representative models for large protein families originally envisioned as the goal of the Protein Structure Initiative at a fraction of the cost. Copyright © 2017, American Association for the Advancement of Science.

  8. Sequencing proteins with transverse ionic transport

    NASA Astrophysics Data System (ADS)

    Boynton, Paul; di Ventra, Massimiliano

    2015-03-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms. By obtaining the order of the amino acids that composes a given protein one can determine both its secondary and tertiary structures through protein structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Mass spectrometry is the current technique of choice for de novo sequencing, but because some amino acids have the same mass the sequence cannot be completely determined in many cases. In this paper we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel, similar to that proposed in for DNA sequencing. Indeed, we find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing.

  9. Protein structure prediction from sequence variation

    PubMed Central

    Marks, Debora S; Hopf, Thomas A; Sander, Chris

    2015-01-01

    Genomic sequences contain rich evolutionary information about functional constraints on macromolecules such as proteins. This information can be efficiently mined to detect evolutionary couplings between residues in proteins and address the long-standing challenge to compute protein three-dimensional structures from amino acid sequences. Substantial progress has recently been made on this problem owing to the explosive growth in available sequences and the application of global statistical methods. In addition to three-dimensional structure, the improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes. We expect computation of covariation patterns to complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics. PMID:23138306

  10. Protein Structure Determination using Metagenome sequence data

    PubMed Central

    Ovchinnikov, Sergey; Park, Hahnbeom; Varghese, Neha; Huang, Po-Ssu; Pavlopoulos, Georgios A.; Kim, David E.; Kamisetty, Hetunandan; Kyrpides, Nikos C.; Baker, David

    2017-01-01

    Despite decades of work by structural biologists, there are still ~5200 protein families with unknown structure outside the range of comparative modeling. We show that Rosetta structure prediction guided by residue-residue contacts inferred from evolutionary information can accurately model proteins that belong to large families, and that metagenome sequence data more than triples the number of protein families with sufficient sequences for accurate modeling. We then integrate metagenome data, contact based structure matching and Rosetta structure calculations to generate models for 614 protein families with currently unknown structures; 206 are membrane proteins and 137 have folds not represented in the PDB. This approach provides the representative models for large protein families originally envisioned as the goal of the protein structure initiative at a fraction of the cost. PMID:28104891

  11. Alignments of DNA and protein sequences containing frameshift errors.

    PubMed

    Guan, X; Uberbacher, E C

    1996-02-01

    Molecular sequences, like all experimental data, are subject to error. Many current DNA sequencing protocols have very significant error rates and often generate artefactual insertions and deletions of bases (indels) which corrupt the translation of sequences and compromise the detection of protein homologies. The impact of these errors on the utility of molecular sequence data is dependent on the analytic technique used to interpret the data. In the presence of frameshift errors, standard algorithms using six-frame translation can miss important homologies because only subfragments of the correct translation are available in any given frame. We present a new algorithm which can detect and correct frameshift errors in DNA sequences during comparison of translated sequences with protein sequences in the databases. This algorithm can recognize homologous proteins sharing 30% identity even in the presence of a 7% frameshift error rate. Our algorithm uses dynamic programming, producing a guaranteed optimal alignment in the presence of frameshifts, and has a sensitivity equivalent to Smith-Waterman. The computational efficiency of the algorithm is O(nm) where n and m are the sizes of two sequences being compared. The algorithm does not rely on prior knowledge or heuristic rules and performs significantly better than any previously reported method.

  12. Inferring interaction partners from protein sequences

    PubMed Central

    Bitbol, Anne-Florence; Dwyer, Robert S.; Colwell, Lucy J.; Wingreen, Ned S.

    2016-01-01

    Specific protein−protein interactions are crucial in the cell, both to ensure the formation and stability of multiprotein complexes and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify, from sequence data alone, which proteins are specific interaction partners. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the 3D structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithm’s performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from noninteracting ones, using only sequence data. PMID:27663738

  13. Simultaneous Alignment and Folding of Protein Sequences

    PubMed Central

    Waldispühl, Jérôme; O'Donnell, Charles W.; Will, Sebastian; Devadas, Srinivas; Backofen, Rolf

    2014-01-01

    Abstract Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structure prediction where current approaches fail. Importantly, partiFold-Align requires no prior training. These general techniques are widely applicable to many more protein families (partiFold-Align is available at http://partifold.csail.mit.edu/). PMID:24766258

  14. ``Sequence space soup'' of proteins and copolymers

    NASA Astrophysics Data System (ADS)

    Chan, Hue Sun; Dill, Ken A.

    1991-09-01

    To study the protein folding problem, we use exhaustive computer enumeration to explore ``sequence space soup,'' an imaginary solution containing the ``native'' conformations (i.e., of lowest free energy) under folding conditions, of every possible copolymer sequence. The model is of short self-avoiding chains of hydrophobic (H) and polar (P) monomers configured on the two-dimensional square lattice. By exhaustive enumeration, we identify all native structures for every possible sequence. We find that random sequences of H/P copolymers will bear striking resemblance to known proteins: Most sequences under folding conditions will be approximately as compact as known proteins, will have considerable amounts of secondary structure, and it is most probable that an arbitrary sequence will fold to a number of lowest free energy conformations that is of order one. In these respects, this simple model shows that proteinlike behavior should arise simply in copolymers in which one monomer type is highly solvent averse. It suggests that the structures and uniquenesses of native proteins are not consequences of having 20 different monomer types, or of unique properties of amino acid monomers with regard to special packing or interactions, and thus that simple copolymers might be designable to collapse to proteinlike structures and properties. A good strategy for designing a sequence to have a minimum possible number of native states is to strategically insert many P monomers. Thus known proteins may be marginally stable due to a balance: More H residues stabilize the desired native state, but more P residues prevent simultaneous stabilization of undesired native states.

  15. Prediction of protein function from protein sequence and structure.

    PubMed

    Whisstock, James C; Lesk, Arthur M

    2003-08-01

    The sequence of a genome contains the plans of the possible life of an organism, but implementation of genetic information depends on the functions of the proteins and nucleic acids that it encodes. Many individual proteins of known sequence and structure present challenges to the understanding of their function. In particular, a number of genes responsible for diseases have been identified but their specific functions are unknown. Whole-genome sequencing projects are a major source of proteins of unknown function. Annotation of a genome involves assignment of functions to gene products, in most cases on the basis of amino-acid sequence alone. 3D structure can aid the assignment of function, motivating the challenge of structural genomics projects to make structural information available for novel uncharacterized proteins. Structure-based identification of homologues often succeeds where sequence-alone-based methods fail, because in many cases evolution retains the folding pattern long after sequence similarity becomes undetectable. Nevertheless, prediction of protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. Many methods of function prediction rely on identifying similarity in sequence and/or structure between a protein of unknown function and one or more well-understood proteins. Alternative methods include inferring conservation patterns in members of a functionally uncharacterized family for which many sequences and structures are known. However, these inferences are tenuous. Such methods provide reasonable guesses at function, but are far from foolproof. It is therefore fortunate that the development of whole-organism approaches and comparative genomics permits other approaches to function prediction when the data are available. These include the use of protein-protein interaction patterns, and correlations between occurrences of related proteins in different organisms, as

  16. Sequence information signal processor for local and global string comparisons

    DOEpatents

    Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

    1997-01-01

    A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.

  17. Using Dali for structural comparison of proteins.

    PubMed

    Holm, Liisa; Kääriäinen, Sakari; Wilton, Chris; Plewczynski, Dariusz

    2006-07-01

    The Dali program is widely used for carrying out automatic comparisons of protein structures determined by X-ray crystallography or NMR. The most familiar version is the Dali server, which performs a database search comparing a query structure supplied by the user against the database of known structures (PDB) and returns the list of structural neighbors by e-mail. The more recently introduced DaliLite server compares two structures against each other and visualizes the result interactively. The Dali database is a structural classification based on precomputed all-against-all structural similarities within the PDB. The resulting hierarchical classification can be browsed on the Web and is linked to protein sequence classification resources. All Dali resources use an identical algorithm for structure comparison. Users may run Dali using the Web, or the program may be downloaded to be run locally on Linux computers.

  18. A Bioinformatic Approach to Inter Functional Interactions within Protein Sequences

    DTIC Science & Technology

    2009-02-23

    22] and Mycobacterium leprae [23], and secondly more closely related pathogenic genomes of Leptospira interrogans serovars Lai [24] and Leptospira...evident from Table 1b. The M. tuberculosis H37Rv genome contains 4,411,532 nucleotides coding for 3989 proteins sequences, and M. leprae contains...genomes using the PHOGs reduces the dimensionality of the alignment task. In the case of the M. tuberculosis H37Rv vs. M. leprae comparison, the

  19. HPMV: human protein mutation viewer - relating sequence mutations to protein sequence architecture and function changes.

    PubMed

    Sherman, Westley Arthur; Kuchibhatla, Durga Bhavani; Limviphuvadh, Vachiranee; Maurer-Stroh, Sebastian; Eisenhaber, Birgit; Eisenhaber, Frank

    2015-10-01

    Next-generation sequencing advances are rapidly expanding the number of human mutations to be analyzed for causative roles in genetic disorders. Our Human Protein Mutation Viewer (HPMV) is intended to explore the biomolecular mechanistic significance of non-synonymous human mutations in protein-coding genomic regions. The tool helps to assess whether protein mutations affect the occurrence of sequence-architectural features (globular domains, targeting signals, post-translational modification sites, etc.). As input, HPMV accepts protein mutations - as UniProt accessions with mutations (e.g. HGVS nomenclature), genome coordinates, or FASTA sequences. As output, HPMV provides an interactive cartoon showing the mutations in relation to elements of the sequence architecture. A large variety of protein sequence architectural features were selected for their particular relevance to mutation interpretation. Clicking a sequence feature in the cartoon expands a tree view of additional information including multiple sequence alignments of conserved domains and a simple 3D viewer mapping the mutation to known PDB structures, if available. The cartoon is also correlated with a multiple sequence alignment of similar sequences from other organisms. In cases where a mutation is likely to have a straightforward interpretation (e.g. a point mutation disrupting a well-understood targeting signal), this interpretation is suggested. The interactive cartoon can be downloaded as standalone viewer in Java jar format to be saved and viewed later with only a standard Java runtime environment. The HPMV website is: http://hpmv.bii.a-star.edu.sg/ .

  20. Sequence Motifs in MADS Transcription Factors Responsible for Specificity and Diversification of Protein-Protein Interaction

    PubMed Central

    van Dijk, Aalt D. J.; Morabito, Giuseppa; Fiers, Martijn; van Ham, Roeland C. H. J.; Angenent, Gerco C.; Immink, Richard G. H.

    2010-01-01

    Protein sequences encompass tertiary structures and contain information about specific molecular interactions, which in turn determine biological functions of proteins. Knowledge about how protein sequences define interaction specificity is largely missing, in particular for paralogous protein families with high sequence similarity, such as the plant MADS domain transcription factor family. In comparison to the situation in mammalian species, this important family of transcription regulators has expanded enormously in plant species and contains over 100 members in the model plant species Arabidopsis thaliana. Here, we provide insight into the mechanisms that determine protein-protein interaction specificity for the Arabidopsis MADS domain transcription factor family, using an integrated computational and experimental approach. Plant MADS proteins have highly similar amino acid sequences, but their dimerization patterns vary substantially. Our computational analysis uncovered small sequence regions that explain observed differences in dimerization patterns with reasonable accuracy. Furthermore, we show the usefulness of the method for prediction of MADS domain transcription factor interaction networks in other plant species. Introduction of mutations in the predicted interaction motifs demonstrated that single amino acid mutations can have a large effect and lead to loss or gain of specific interactions. In addition, various performed bioinformatics analyses shed light on the way evolution has shaped MADS domain transcription factor interaction specificity. Identified protein-protein interaction motifs appeared to be strongly conserved among orthologs, indicating their evolutionary importance. We also provide evidence that mutations in these motifs can be a source for sub- or neo-functionalization. The analyses presented here take us a step forward in understanding protein-protein interactions and the interplay between protein sequences and network evolution. PMID

  1. Structural Alphabets for Protein Structure Classification: a Comparison Study

    PubMed Central

    Le, Quan; Pollastri, Gianluca; Koehl, Patrice

    2009-01-01

    Finding structural similarities between proteins often helps revealing shared functionality which otherwise might not be detected by native sequence information alone. Such similarity is usually detected and quantified by protein structure alignment. Determining the optimal alignment between two protein structures remains however a hard problem. An alternative approach is to approximate each protein 3D structure using a sequence of motifs derived from a structural alphabet. Using this approach, structure comparison is performed by comparing the corresponding motif sequences, or structural sequences. In this paper, we measure the performance of such alphabets in the context of the protein structure classification problem. We consider both local and global structural sequences. Each letter of a local structural sequence corresponds to the best matching fragment to the corresponding local segment of the protein structure. The global structural sequence is designed to generate the best possible complete chain that matches the full protein structure. We use an alphabet of 20 letters, corresponding to a library of 20 motifs or protein fragments of size 4 residues. We show that the global structural sequences approximate well the native structures of proteins, with an average cRMS of 0.69 Å over 2225 test proteins. The approximation is best for all α-proteins, while relatively poorer for all β-proteins. We then test the performance of four different sequence representations of proteins (their native sequence, the sequence of their secondary structure elements, and the local and global structural sequences based on our fragment library) with different classifiers in their ability to classify proteins that belong to five distinct folds of CATH. Without surprise, the primary sequence alone performs poorly as a structure classifier. We show that addition of either secondary structure information or local information from the structural sequence considerably improves the

  2. Sequence analysis of the AAA protein family.

    PubMed Central

    Beyer, A.

    1997-01-01

    The AAA protein family, a recently recognized group of Walker-type ATPases, has been subjected to an extensive sequence analysis. Multiple sequence alignments revealed the existence of a region of sequence similarity, the so-called AAA cassette. The borders of this cassette were localized and within it, three boxes of a high degree of conservation were identified. Two of these boxes could be assigned to substantial parts of the ATP binding site (namely, to Walker motifs A and B); the third may be a portion of the catalytic center. Phylogenetic trees were calculated to obtain insights into the evolutionary history of the family. Subfamilies with varying degrees of intra-relatedness could be discriminated; these relationships are also supported by analysis of sequences outside the canonical AAA boxes: within the cassette are regions that are strongly conserved within each subfamily, whereas little or even no similarity between different subfamilies can be observed. These regions are well suited to define fingerprints for subfamilies. A secondary structure prediction utilizing all available sequence information was performed and the result was fitted to the general 3D structure of a Walker A/GTPase. The agreement was unexpectedly high and strongly supports the conclusion that the AAA family belongs to the Walker superfamily of A/GTPases. PMID:9336829

  3. Sequence determinants of protein aggregation: tools to increase protein solubility

    PubMed Central

    Ventura, Salvador

    2005-01-01

    Escherichia coli is one of the most widely used hosts for the production of recombinant proteins. However, very often the target protein accumulates into insoluble aggregates in a misfolded and biologically inactive form. Bacterial inclusion bodies are major bottlenecks in protein production and are hampering the development of top priority research areas such structural genomics. Inclusion body formation was formerly considered to occur via non-specific association of hydrophobic surfaces in folding intermediates. Increasing evidence, however, indicates that protein aggregation in bacteria resembles to the well-studied process of amyloid fibril formation. Both processes appear to rely on the formation of specific, sequence-dependent, intermolecular interactions driving the formation of structured protein aggregates. This similarity in the mechanisms of aggregation will probably allow applying anti-aggregational strategies already tested in the amyloid context to the less explored area of protein aggregation inside bacteria. Specifically, new sequence-based approaches appear as promising tools to tune protein aggregation in biotechnological processes. PMID:15847694

  4. Detailed protein sequence alignment based on Spectral Similarity Score (SSS)

    PubMed Central

    Gupta, Kshitiz; Thomas, Dina; Vidya, SV; Venkatesh, KV; Ramakumar, S

    2005-01-01

    Background The chemical property and biological function of a protein is a direct consequence of its primary structure. Several algorithms have been developed which determine alignment and similarity of primary protein sequences. However, character based similarity cannot provide insight into the structural aspects of a protein. We present a method based on spectral similarity to compare subsequences of amino acids that behave similarly but are not aligned well by considering amino acids as mere characters. This approach finds a similarity score between sequences based on any given attribute, like hydrophobicity of amino acids, on the basis of spectral information after partial conversion to the frequency domain. Results Distance matrices of various branches of the human kinome, that is the full complement of human kinases, were developed that matched the phylogenetic tree of the human kinome establishing the efficacy of the global alignment of the algorithm. PKCd and PKCe kinases share close biological properties and structural similarities but do not give high scores with character based alignments. Detailed comparison established close similarities between subsequences that do not have any significant character identity. We compared their known 3D structures to establish that the algorithm is able to pick subsequences that are not considered similar by character based matching algorithms but share structural similarities. Similarly many subsequences with low character identity were picked between xyna-theau and xyna-clotm F/10 xylanases. Comparison of 3D structures of the subsequences confirmed the claim of similarity in structure. Conclusion An algorithm is developed which is inspired by successful application of spectral similarity applied to music sequences. The method captures subsequences that do not align by traditional character based alignment tools but give rise to similar secondary and tertiary structures. The Spectral Similarity Score (SSS) is an

  5. Benchmarking NMR experiments: A relational database of protein pulse sequences

    NASA Astrophysics Data System (ADS)

    Senthamarai, Russell R. P.; Kuprov, Ilya; Pervushin, Konstantin

    2010-03-01

    Systematic benchmarking of multi-dimensional protein NMR experiments is a critical prerequisite for optimal allocation of NMR resources for structural analysis of challenging proteins, e.g. large proteins with limited solubility or proteins prone to aggregation. We propose a set of benchmarking parameters for essential protein NMR experiments organized into a lightweight (single XML file) relational database (RDB), which includes all the necessary auxiliaries (waveforms, decoupling sequences, calibration tables, setup algorithms and an RDB management system). The database is interfaced to the Spinach library ( http://spindynamics.org), which enables accurate simulation and benchmarking of NMR experiments on large spin systems. A key feature is the ability to use a single user-specified spin system to simulate the majority of deposited solution state NMR experiments, thus providing the (hitherto unavailable) unified framework for pulse sequence evaluation. This development enables predicting relative sensitivity of deposited implementations of NMR experiments, thus providing a basis for comparison, optimization and, eventually, automation of NMR analysis. The benchmarking is demonstrated with two proteins, of 170 amino acids I domain of αXβ2 Integrin and 440 amino acids NS3 helicase.

  6. Sequence Analysis of Scaffolding Protein CipC and ORFXp, a New Cohesin-Containing Protein in Clostridium cellulolyticum: Comparison of Various Cohesin Domains and Subcellular Localization of ORFXp

    PubMed Central

    Pagès, Sandrine; Bélaïch, Anne; Fierobe, Henri-Pierre; Tardif, Chantal; Gaudin, Christian; Bélaïch, Jean-Pierre

    1999-01-01

    The gene encoding the scaffolding protein of the cellulosome from Clostridium cellulolyticum, whose partial sequence was published earlier (S. Pagès, A. Bélaïch, C. Tardif, C. Reverbel-Leroy, C. Gaudin, and J.-P. Bélaïch, J. Bacteriol. 178:2279–2286, 1996; C. Reverbel-Leroy, A. Bélaïch, A. Bernadac, C. Gaudin, J. P. Bélaïch, and C. Tardif, Microbiology 142:1013–1023, 1996), was completely sequenced. The corresponding protein, CipC, is composed of a cellulose binding domain at the N terminus followed by one hydrophilic domain (HD1), seven highly homologous cohesin domains (cohesin domains 1 to 7), a second hydrophilic domain, and a final cohesin domain (cohesin domain 8) which is only 57 to 60% identical to the seven other cohesin domains. In addition, a second gene located 8.89 kb downstream of cipC was found to encode a three-domain protein, called ORFXp, which includes a cohesin domain. By using antiserum raised against the latter, it was observed that ORFXp is associated with the membrane of C. cellulolyticum and is not detected in the cellulosome fraction. Western blot and BIAcore experiments indicate that cohesin domains 1 and 8 from CipC recognize the same dockerins and have similar affinity for CelA (Ka = 4.8 × 109 M−1) whereas the cohesin from ORFXp, although it is also able to bind all cellulosome components containing a dockerin, has a 19-fold lower Ka for CelA (2.6 × 108 M−1). Taken together, these data suggest that ORFXp may play a role in cellulosome assembly. PMID:10074072

  7. Diverse nucleotide compositions and sequence fluctuation in Rubisco protein genes

    NASA Astrophysics Data System (ADS)

    Holden, Todd; Dehipawala, S.; Cheung, E.; Bienaime, R.; Ye, J.; Tremberger, G., Jr.; Schneider, P.; Lieberman, D.; Cheung, T.

    2011-10-01

    The Rubisco protein-enzyme is arguably the most abundance protein on Earth. The biology dogma of transcription and translation necessitates the study of the Rubisco genes and Rubisco-like genes in various species. Stronger correlation of fractal dimension of the atomic number fluctuation along a DNA sequence with Shannon entropy has been observed in the studied Rubisco-like gene sequences, suggesting a more diverse evolutionary pressure and constraints in the Rubisco sequences. The strategy of using metal for structural stabilization appears to be an ancient mechanism, with data from the porphobilinogen deaminase gene in Capsaspora owczarzaki and Monosiga brevicollis. Using the chi-square distance probability, our analysis supports the conjecture that the more ancient Rubisco-like sequence in Microcystis aeruginosa would have experienced very different evolutionary pressure and bio-chemical constraint as compared to Bordetella bronchiseptica, the two microbes occupying either end of the correlation graph. Our exploratory study would indicate that high fractal dimension Rubisco sequence would support high carbon dioxide rate via the Michaelis- Menten coefficient; with implication for the control of the whooping cough pathogen Bordetella bronchiseptica, a microbe containing a high fractal dimension Rubisco-like sequence (2.07). Using the internal comparison of chi-square distance probability for 16S rRNA (~ E-22) versus radiation repair Rec-A gene (~ E-05) in high GC content Deinococcus radiodurans, our analysis supports the conjecture that high GC content microbes containing Rubisco-like sequence are likely to include an extra-terrestrial origin, relative to Deinococcus radiodurans. Similar photosynthesis process that could utilize host star radiation would not compete with radiation resistant process from the biology dogma perspective in environments such as Mars and exoplanets.

  8. Giant panda ribosomal protein S14: cDNA, genomic sequence cloning, sequence analysis, and overexpression.

    PubMed

    Wu, G-F; Hou, Y-L; Hou, W-R; Song, Y; Zhang, T

    2010-10-13

    RPS14 is a component of the 40S ribosomal subunit encoded by the RPS14 gene and is required for its maturation. The cDNA and the genomic sequence of RPS14 were cloned successfully from the giant panda (Ailuropoda melanoleuca) using RT-PCR technology and touchdown-PCR, respectively; they were both sequenced and analyzed. The length of the cloned cDNA fragment was 492 bp; it contained an open-reading frame of 456 bp, encoding 151 amino acids. The length of the genomic sequence is 3421 bp; it contains four exons and three introns. Alignment analysis indicates that the nucleotide sequence shares a high degree of homology with those of Homo sapiens, Bos taurus, Mus musculus, Rattus norvegicus, Gallus gallus, Xenopus laevis, and Danio rerio (93.64, 83.37, 92.54, 91.89, 87.28, 84.21, and 84.87%, respectively). Comparison of the deduced amino acid sequences of the giant panda with those of these other species revealed that the RPS14 of giant panda is highly homologous with those of B. taurus, R. norvegicus and D. rerio (85.99, 99.34 and 99.34%, respectively), and is 100% identical with the others. This degree of conservation of RPS14 suggests evolutionary selection. Topology prediction shows that there are two N-glycosylation sites, three protein kinase C phosphorylation sites, two casein kinase II phosphorylation sites, four N-myristoylation sites, two amidation sites, and one ribosomal protein S11 signature in the RPS14 protein of the giant panda. The RPS14 gene can be readily expressed in Escherichia coli. When it was fused with the N-terminally His-tagged protein, it gave rise to accumulation of an expected 22-kDa polypeptide, in good agreement with the predicted molecular weight. The expression product obtained can be purified for studies of its function.

  9. Integrated visual analysis of protein structures, sequences, and feature data

    PubMed Central

    2015-01-01

    Background To understand the molecular mechanisms that give rise to a protein's function, biologists often need to (i) find and access all related atomic-resolution 3D structures, and (ii) map sequence-based features (e.g., domains, single-nucleotide polymorphisms, post-translational modifications) onto these structures. Results To streamline these processes we recently developed Aquaria, a resource offering unprecedented access to protein structure information based on an all-against-all comparison of SwissProt and PDB sequences. In this work, we provide a requirements analysis for several frequently occuring tasks in molecular biology and describe how design choices in Aquaria meet these requirements. Finally, we show how the interface can be used to explore features of a protein and gain biologically meaningful insights in two case studies conducted by domain experts. Conclusions The user interface design of Aquaria enables biologists to gain unprecedented access to molecular structures and simplifies the generation of insight. The tasks involved in mapping sequence features onto structures can be conducted easier and faster using Aquaria. PMID:26329268

  10. DNA Sequencing Using an Engineered Protein Nanopore

    NASA Astrophysics Data System (ADS)

    Gundlach, Jens H.

    2010-03-01

    Inexpensive and fast sequencing of DNA is of paramount importance to medicine, the life sciences and to many other applications. Because of the nanometer diameter of DNA a nanometer-scale reader directly interfaced to macroscopic observables seems particularly attractive. We are working on a new single molecule technique based on a biological pore embedded in a lipid bilayer. When a voltage is applied across the bilayer an ion current is measured that flows through the nanometer opening of the pore. Poly-negatively charged single stranded DNA passes through the pore and reduces the ion current with the remaining ion current being indicative of the nucleotide type in the constriction of the pore. The protein pore that we introduced to the field, MspA, has a shape ideally suited to nanopore sequencing, has robustness comparable to solid state devices, is easily reproduced with sub-nanometer level precision and is engineerable using genetic mutations. I will present proof-of-principle data showing that this technique can lead to a direct very inexpensive and fast sequencing technology. The experimental electronic signatures of the DNA translocation process provide an ideal test bed for molecular dynamics simulations, which in turn allows developing intuition and prediction of nanoscale dynamics.

  11. Comparison of Next-Generation Sequencing Systems

    PubMed Central

    Liu, Lin; Li, Yinhu; Li, Siliang; Hu, Ni; He, Yimin; Pong, Ray; Lin, Danni; Lu, Lihua; Law, Maggie

    2012-01-01

    With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world's biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized. PMID:22829749

  12. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing

    PubMed Central

    Song, Kai; Ren, Jie; Reinert, Gesine; Deng, Minghua

    2014-01-01

    With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data. PMID:24064230

  13. Molecular Cloning and Sequence Analysis of the Sta58 Major Antigen Gene of Rickettsia tsutsugamushi: Sequence homology and Antigenic Comparison of Sta58 to the 60-Kilodalton Family of Stress Proteins

    DTIC Science & Technology

    1990-05-01

    I’t" FILE COPY ( ... ’ ",, , 1w Til PACE I -N PAE Form Ap edAD -A 2 109 TR~yNATO PAGE OMB N07048 C Unclassified Ib. RESTRICTIVE MARKINGS...chemistry (2). Sequencing-gel electrophoresis was per- Bacterial strains, media , and passage and preparation of R. formed on 6% polyacrylamide-7 M urea

  14. Sequence-Based Prediction of Type III Secreted Proteins

    PubMed Central

    Arnold, Roland; Brandmaier, Stefan; Kleine, Frederick; Tischler, Patrick; Heinz, Eva; Behrens, Sebastian; Niinikoski, Antti; Mewes, Hans-Werner; Horn, Matthias; Rattei, Thomas

    2009-01-01

    The type III secretion system (TTSS) is a key mechanism for host cell interaction used by a variety of bacterial pathogens and symbionts of plants and animals including humans. The TTSS represents a molecular syringe with which the bacteria deliver effector proteins directly into the host cell cytosol. Despite the importance of the TTSS for bacterial pathogenesis, recognition and targeting of type III secreted proteins has up until now been poorly understood. Several hypotheses are discussed, including an mRNA-based signal, a chaperon-mediated process, or an N-terminal signal peptide. In this study, we systematically analyzed the amino acid composition and secondary structure of N-termini of 100 experimentally verified effector proteins. Based on this, we developed a machine-learning approach for the prediction of TTSS effector proteins, taking into account N-terminal sequence features such as frequencies of amino acids, short peptides, or residues with certain physico-chemical properties. The resulting computational model revealed a strong type III secretion signal in the N-terminus that can be used to detect effectors with sensitivity of ∼71% and selectivity of ∼85%. This signal seems to be taxonomically universal and conserved among animal pathogens and plant symbionts, since we could successfully detect effector proteins if the respective group was excluded from training. The application of our prediction approach to 739 complete bacterial and archaeal genome sequences resulted in the identification of between 0% and 12% putative TTSS effector proteins. Comparison of effector proteins with orthologs that are not secreted by the TTSS showed no clear pattern of signal acquisition by fusion, suggesting convergent evolutionary processes shaping the type III secretion signal. The newly developed program EffectiveT3 (http://www.chlamydiaedb.org) is the first universal in silico prediction program for the identification of novel TTSS effectors. Our findings will

  15. Parallel Computation of Multiple Biological Sequence Comparisons

    DTIC Science & Technology

    1989-07-01

    Stearothermophilus 408 Bacillus Megaterium 411 Bacillus Brevis 354 Pseudomonas Fluorescens 375 Salmonella Typhi 377 Escherichia Coli 282 Saccharomyces Octosporus...This included implied secondary structure and conservation of pairs of nucleotides that are complementary. The first four sequences are all Bacillus ...need to obtain sequences of ribonuclease P RNA from additional species to provide a more 13 Length Name 401 Bacillus Subtilis 417 Bacillus

  16. Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.

    PubMed

    Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai

    2015-12-01

    The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences.

  17. Database Independent Protein Sequencing (DiPS) enables full-length de-novo protein and antibody sequence determination.

    PubMed

    Savidor, Alon; Barzilay, Rotem; Elinger, Dalia; Yarden, Yosef; Lindzen, Moshit; Gabashvili, Alexandra; Adiv Tal, Ophir; Levin, Yishai

    2017-03-27

    Traditional 'bottom-up' proteomics approaches use proteolytic digestion, LC-MS/MS and database searching to elucidate peptide identities and their parent proteins. Protein sequences absent from the database cannot be identified, and even if present in the database, complete sequence coverage is rarely achieved even for the most abundant proteins in the sample. Thus, sequencing of unknown proteins such as antibodies or constituents of metaproteomes remains a challenging problem. To date, there is no available method for full-length protein sequencing, independent of a reference database, in high throughput. Here we present Database Independent Protein Sequencing (DiPS), a method for unambiguous, rapid, database independent, full-length protein sequencing. The method is a novel combination of non-enzymatic, semi-random cleavage of the protein, LC-MS/MS analysis, peptide de novo sequencing, extraction of peptide tags, and their assembly into a consensus sequence using an algorithm named "Peptide Tag Assembler" (pTA). As proof-of-concept, the method was applied to samples of three known proteins representing three size classes and to a previously un-sequenced, clinically relevant, monoclonal antibody. Excluding leucine/isoleucine and glutamic-acid/deamidated glutamine ambiguities, end-to-end, full-length de novo sequencing was achieved with 99-100% accuracy for all benchmarking proteins and the antibody light chain. Accuracy of the sequenced antibody heavy chain, including the entire variable region, was also 100% but there was a 23 residue gap in the constant region sequence.

  18. Alignment-free protein interaction network comparison

    PubMed Central

    Ali, Waqar; Rito, Tiago; Reinert, Gesine; Sun, Fengzhu; Deane, Charlotte M.

    2014-01-01

    Motivation: Biological network comparison software largely relies on the concept of alignment where close matches between the nodes of two or more networks are sought. These node matches are based on sequence similarity and/or interaction patterns. However, because of the incomplete and error-prone datasets currently available, such methods have had limited success. Moreover, the results of network alignment are in general not amenable for distance-based evolutionary analysis of sets of networks. In this article, we describe Netdis, a topology-based distance measure between networks, which offers the possibility of network phylogeny reconstruction. Results: We first demonstrate that Netdis is able to correctly separate different random graph model types independent of network size and density. The biological applicability of the method is then shown by its ability to build the correct phylogenetic tree of species based solely on the topology of current protein interaction networks. Our results provide new evidence that the topology of protein interaction networks contains information about evolutionary processes, despite the lack of conservation of individual interactions. As Netdis is applicable to all networks because of its speed and simplicity, we apply it to a large collection of biological and non-biological networks where it clusters diverse networks by type. Availability and implementation: The source code of the program is freely available at http://www.stats.ox.ac.uk/research/proteins/resources. Contact: w.ali@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25161230

  19. Comparison of protein active site structures for functional annotation of proteins and drug design.

    PubMed

    Powers, Robert; Copeland, Jennifer C; Germer, Katherine; Mercier, Kelly A; Ramanathan, Viswanathan; Revesz, Peter

    2006-10-01

    Rapid and accurate functional assignment of novel proteins is increasing in importance, given the completion of numerous genome sequencing projects and the vastly expanding list of unannotated proteins. Traditionally, global primary-sequence and structure comparisons have been used to determine putative function. These approaches, however, do not emphasize similarities in active site configurations that are fundamental to a protein's activity and highly conserved relative to the global and more variable structural features. The Comparison of Protein Active Site Structures (CPASS) database and software enable the comparison of experimentally identified ligand-binding sites to infer biological function and aid in drug discovery. The CPASS database comprises the ligand-defined active sites identified in the protein data bank, where the CPASS program compares these ligand-defined active sites to determine sequence and structural similarity without maintaining sequence connectivity. CPASS will compare any set of ligand-defined protein active sites, irrespective of the identity of the bound ligand. Proteins 2006. (c) 2006 Wiley-Liss, Inc.

  20. PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context

    PubMed Central

    Zhou, Jiyun; Xu, Ruifeng; He, Yulan; Lu, Qin; Wang, Hongpeng; Kong, Bing

    2016-01-01

    Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community. PMID:27282833

  1. Sequence specific binding of chlamydial histone H1-like protein.

    PubMed Central

    Kaul, R; Allen, M; Bradbury, E M; Wenman, W M

    1996-01-01

    Chlamydia trachomatis is one of the few prokaryotic organisms known to contain proteins that bear homology to eukaryotic histone H1. Changes in macromolecular conformation of DNA mediated by the histone H1-like protein (Hc1) appear to regulate stage specific differentiation. We have developed a cross-linking immunoprecipitation protocol to examine in vivo protein-DNA interaction by immune precipitating chlamydial Hc1 cross linked to DNA. Our results strongly support the presence of sequence specific binding sites on the chlamydial plasmid and hc1 gene upstream of its open reading frame. The preferential binding sites were mapped to 520 bp BamHI-XhoI and 547 bp BamHI-DraI DNA fragments on the plasmid and hc1 respectively. Comparison of these two DNA sequences using Bestfit program has identified a 24 bp region with >75% identity that is unique to the chlamydial genome. Double-stranded DNA prepared by annealing complementary oligonucleotides corresponding to the conserved 24 bp region bind Hc1, in contrast to control sequences with similar A+T ratios. Further, Hc1 binds to DNA in a strand specific fashion, with preferential binding for only one strand. The site specific affinity to plasmid DNA was also demonstrated by atomic force microscopy data images. Binding was always followed by coiling, shrinking and aggregation of the affected DNA. Very low protein-DNA ratio was required if incubations were carried out in solution. However, if DNA was partially immobilized on mica substrate individual strands with dark foci were still visible even after the addition of excess Hc1. PMID:8760883

  2. Miraculous catch of iron-sulfur protein sequences in the Sargasso Sea.

    PubMed

    Meyer, Jacques

    2004-07-16

    Recent shotgun sequencing of filtered Sargasso Sea water samples has yielded data in astounding amount and diversity. Iron-sulfur proteins, which are ancient, diverse and ubiquitous, have been implemented here to further probe the sequence diversity of the Sargasso Sea database (SSDB). Sequence searches and comparisons confirm that the SSDB by and large equals in diversity the combined currently available databases. The data thus suggest that microbial diversity has so far been underestimated by orders of magnitude.

  3. Proteins: sequence to structure and function--current status.

    PubMed

    Shenoy, Sandhya R; Jayaram, B

    2010-11-01

    In an era that has been dominated by Structural Biology for the last 30-40 years, a dramatic change of focus towards sequence analysis has spurred the advent of the genome projects and the resultant diverging sequence/structure deficit. The central challenge of Computational Structural Biology is therefore to rationalize the mass of sequence information into biochemical and biophysical knowledge and to decipher the structural, functional and evolutionary clues encoded in the language of biological sequences. In investigating the meaning of sequences, two distinct analytical themes have emerged: in the first approach, pattern recognition techniques are used to detect similarity between sequences and hence to infer related structures and functions; in the second ab initio prediction methods are used to deduce 3D structure, and ultimately to infer function, directly from the linear sequence. In this article, we attempt to provide a critical assessment of what one may and may not expect from the biological sequences and to identify major issues yet to be resolved. The presentation is organized under several subtitles like protein sequences, pattern recognition techniques, protein tertiary structure prediction, membrane protein bioinformatics, human proteome, protein-protein interactions, metabolic networks, potential drug targets based on simple sequence properties, disordered proteins, the sequence-structure relationship and chemical logic of protein sequences.

  4. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    SciTech Connect

    Leung, Elo; Huang, Amy; Cadag, Eithon; Montana, Aldrin; Soliman, Jan Lorenz; Zhou, Carol L. Ecale

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resulting functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.

  5. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGES

    Leung, Elo; Huang, Amy; Cadag, Eithon; ...

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  6. The bioinformatics of nucleotide sequence coding for proteins requiring metal coenzymes and proteins embedded with metals

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Cheung, E.; Holden, T.; Sullivan, R.; Nguyen, A.; Lieberman, D.; Cheung, T.

    2015-09-01

    All metallo-proteins need post-translation metal incorporation. In fact, the isotope ratio of Fe, Cu, and Zn in physiology and oncology have emerged as an important tool. The nickel containing F430 is the prosthetic group of the enzyme methyl coenzyme M reductase which catalyzes the release of methane in the final step of methano-genesis, a prime energy metabolism candidate for life exploration space mission in the solar system. The 3.5 Gyr early life sulfite reductase as a life switch energy metabolism had Fe-Mo clusters. The nitrogenase for nitrogen fixation 3 billion years ago had Mo. The early life arsenite oxidase needed for anoxygenic photosynthesis energy metabolism 2.8 billion years ago had Mo and Fe. The selection pressure in metal incorporation inside a protein would be quantifiable in terms of the related nucleotide sequence complexity with fractal dimension and entropy values. Simulation model showed that the studied metal-required energy metabolism sequences had at least ten times more selection pressure relatively in comparison to the horizontal transferred sequences in Mealybug, guided by the outcome histogram of the correlation R-sq values. The metal energy metabolism sequence group was compared to the circadian clock KaiC sequence group using magnesium atomic level bond shifting mechanism in the protein, and the simulation model would suggest a much higher selection pressure for the energy life switch sequence group. The possibility of using Kepler 444 as an example of ancient life in Galaxy with the associated exoplanets has been proposed and is further discussed in this report. Examples of arsenic metal bonding shift probed by Synchrotron-based X-ray spectroscopy data and Zn controlled FOXP2 regulated pathways in human and chimp brain studied tissue samples are studied in relationship to the sequence bioinformatics. The analysis results suggest that relatively large metal bonding shift amount is associated with low probability correlation R

  7. Sequence space and the ongoing expansion of the protein universe.

    PubMed

    Povolotskaya, Inna S; Kondrashov, Fyodor A

    2010-06-17

    The need to maintain the structural and functional integrity of an evolving protein severely restricts the repertoire of acceptable amino-acid substitutions. However, it is not known whether these restrictions impose a global limit on how far homologous protein sequences can diverge from each other. Here we explore the limits of protein evolution using sequence divergence data. We formulate a computational approach to study the rate of divergence of distant protein sequences and measure this rate for ancient proteins, those that were present in the last universal common ancestor. We show that ancient proteins are still diverging from each other, indicating an ongoing expansion of the protein sequence universe. The slow rate of this divergence is imposed by the sparseness of functional protein sequences in sequence space and the ruggedness of the protein fitness landscape: approximately 98 per cent of sites cannot accept an amino-acid substitution at any given moment but a vast majority of all sites may eventually be permitted to evolve when other, compensatory, changes occur. Thus, approximately 3.5 x 10(9) yr has not been enough to reach the limit of divergent evolution of proteins, and for most proteins the limit of sequence similarity imposed by common function may not exceed that of random sequences.

  8. Sequence diversity of the Trypanosoma cruzi complement regulatory protein family.

    PubMed

    Beucher, M; Norris, K A

    2008-02-01

    As a central component of innate immunity, complement activation is a critical mechanism of containment and clearance of microbial pathogens in advance of the development of acquired immunity. Several pathogens restrict complement activation through the acquisition of host proteins that regulate complement activation or through the production of their own complement regulatory molecules (M. K. Liszewski, M. K. Leung, R. Hauhart, R. M. Buller, P. Bertram, X. Wang, A. M. Rosengard, G. J. Kotwal, and J. P. Atkinson, J. Immunol. 176:3725-3734, 2006; J. Lubinski, L. Wang, D. Mastellos, A. Sahu, J. D. Lambris, and H. M. Friedman, J. Exp. Med. 190:1637-1646, 1999). The infectious stage of the protozoan parasite Trypanosoma cruzi produces a surface-anchored complement regulatory protein (CRP) that functions to inhibit alternative and classical pathway complement activation (K. A. Norris, B. Bradt, N. R. Cooper, and M. So, J. Immunol. 147:2240-2247, 1991). This study addresses the genomic complexity of the T. cruzi CRP and its relationship to the T. cruzi supergene family comprising active trans-sialidase (TS) and TS-like proteins. The TS superfamily consists of several functionally distinct subfamilies that share a characteristic sialidase domain at their amino termini. These TS families include active TS, adhesions, CRPs, and proteins of unknown functions (G. A. Cross and G. B. Takle, Annu. Rev. Microbiol. 47:385-411, 1993). A sequence comparison search of GenBank using BLASTP revealed several full-length paralogs of CRP. These proteins share significant homology at their amino termini and a strong spatial conservation of cysteine residues. Alternative pathway complement regulation was confirmed for CRP paralogs with 58% (low) and 83% (high) identity to AAB49414. CRPs are functionally similar to the microbial and mammalian proteins that regulate complement activation. Sequence alignment of mammalian complement control proteins to CRP showed that these sequences are

  9. UFO: a web server for ultra-fast functional profiling of whole genome protein sequences.

    PubMed

    Meinicke, Peter

    2009-09-02

    Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families. Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time. For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address.

  10. UFO: a web server for ultra-fast functional profiling of whole genome protein sequences

    PubMed Central

    Meinicke, Peter

    2009-01-01

    Background Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families. Description Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time. Conclusion For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address. PMID:19725959

  11. Orpinomyces cellulase celf protein and coding sequences

    DOEpatents

    Li, Xin-Liang; Chen, Huizhong; Ljungdahl, Lars G.

    2000-09-05

    A cDNA (1,520 bp), designated celF, consisting of an open reading frame (ORF) encoding a polypeptide (CelF) of 432 amino acids was isolated from a cDNA library of the anaerobic rumen fungus Orpinomyces PC-2 constructed in Escherichia coli. Analysis of the deduced amino acid sequence showed that starting from the N-terminus, CelF consists of a signal peptide, a cellulose binding domain (CBD) followed by an extremely Asn-rich linker region which separate the CBD and the catalytic domains. The latter is located at the C-terminus. The catalytic domain of CelF is highly homologous to CelA and CelC of Orpinomyces PC-2, to CelA of Neocallimastix patriciarum and also to cellobiohydrolase IIs (CBHIIs) from aerobic fungi. However, Like CelA of Neocallimastix patriciarum, CelF does not have the noncatalytic repeated peptide domain (NCRPD) found in CelA and CelC from the same organism. The recombinant protein CelF hydrolyzes cellooligosaccharides in the pattern of CBHII, yielding only cellobiose as product with cellotetraose as the substrate. The genomic celF is interrupted by a 111 bp intron, located within the region coding for the CBD. The intron of the celF has features in common with genes from aerobic filamentous fungi.

  12. Cascaded walks in protein sequence space: use of artificial sequences in remote homology detection between natural proteins.

    PubMed

    Sandhya, S; Mudgal, R; Jayadev, C; Abhinandan, K R; Sowdhamini, R; Srinivasan, N

    2012-08-01

    Over the past two decades, many ingenious efforts have been made in protein remote homology detection. Because homologous proteins often diversify extensively in sequence, it is challenging to demonstrate such relatedness through entirely sequence-driven searches. Here, we describe a computational method for the generation of 'protein-like' sequences that serves to bridge gaps in protein sequence space. Sequence profile information, as embodied in a position-specific scoring matrix of multiply aligned sequences of bona fide family members, serves as the starting point in this algorithm. The observed amino acid propensity and the selection of a random number dictate the selection of a residue for each position in the sequence. In a systematic manner, and by applying a 'roulette-wheel' selection approach at each position, we generate parent family-like sequences and thus facilitate an enlargement of sequence space around the family. When generated for a large number of families, we demonstrate that they expand the utility of natural intermediately related sequences in linking distant proteins. In 91% of the assessed examples, inclusion of designed sequences improved fold coverage by 5-10% over searches made in their absence. Furthermore, with several examples from proteins adopting folds such as TIM, globin, lipocalin and others, we demonstrate that the success of including designed sequences in a database positively sensitized methods such as PSI-BLAST and Cascade PSI-BLAST and is a promising opportunity for enormously improved remote homology recognition using sequence information alone.

  13. Cloning, sequencing and expression of the transferrin-binding protein 1 gene from Actinobacillus pleuropneumoniae.

    PubMed Central

    Daban, M; Medrano, A; Querol, E

    1996-01-01

    Two outer-membrane proteins are involved in the uptake of iron from transferrin by certain Gram-negative bacteria, transferrin-binding proteins 1 and 2. The gene encoding transferrin-binding protein 1 from a serotype 1 isolate of the Gram-negative pathogen Actinobacillus pleuropneumoniae was cloned, and a fragment encoding 700 amino acids of Tbp1 was expressed in Escherichia coli. We also report here sequencing of the tbpl gene and a comparison of the deduced amino acid sequence with Tbpls from related species. The predicted polypeptide product of tbpl is a 106 kDa protein with a 22-residue signal peptide. PMID:8670116

  14. Folding and Stabilization of Native-Sequence-Reversed Proteins

    PubMed Central

    Zhang, Yuanzhao; Weber, Jeffrey K; Zhou, Ruhong

    2016-01-01

    Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols. PMID:27113844

  15. Folding and Stabilization of Native-Sequence-Reversed Proteins

    NASA Astrophysics Data System (ADS)

    Zhang, Yuanzhao; Weber, Jeffrey K.; Zhou, Ruhong

    2016-04-01

    Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols.

  16. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Hixson, Kim K.; Purvine, Samuel O.; Anderson, Gordon A.; Smith, Richard D.

    2008-10-15

    De novo sequencing has a promise to discover the protein post-translation modifications; however, such approach is still in their infancy and not widely applied for proteomics practices due to its limited reliability. In this work, we describe a de novo sequencing approach for discovery of protein modifications through identification of the UStags (Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry for peptides and polypeptides in a yeast lysate, and the de novo sequences obtained were filtered to define a more limited set of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags’ prefix and suffix sequences and the UStags themselves) were used to infer the possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances of yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. Random matching of the de novo sequences to the predicted sequences were examined with use of two random (false) databases, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity are described. The de novo-UStag complements the UStag method previously reported by enabling discovery of new protein modifications.

  17. Genomic Sequence Comparisons, 1987-2003 Final Report

    SciTech Connect

    George M. Church

    2004-07-29

    This project was to develop new DNA sequencing and RNA and protein quantitation methods and related genome annotation tools. The project began in 1987 with the development of multiplex sequencing (published in Science in 1988), and one of the first automated sequencing methods. This lead to the first commercial genome sequence in 1994 and to the establishment of the main commercial participants (GTC then Agencourt) in the public DOE/NIH genome project. In collaboration with GTC we contributed to one of the first complete DOE genome sequences, in 1997, that of Methanobacterium thermoautotropicum, a species of great relevance to energy-rich gas production.

  18. MSACompro: improving multiple protein sequence alignment by predicted structural features.

    PubMed

    Deng, Xin; Cheng, Jianlin

    2014-01-01

    Multiple Sequence Alignment (MSA) is an essential tool in protein structure modeling, gene and protein function prediction, DNA motif recognition, phylogenetic analysis, and many other bioinformatics tasks. Therefore, improving the accuracy of multiple sequence alignment is an important long-term objective in bioinformatics. We designed and developed a new method MSACompro to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. Different from the multiple sequence alignment methods that use the tertiary structure information of some sequences, our method uses the structural information purely predicted from sequences. In this chapter, we first introduce some background and related techniques in the field of multiple sequence alignment. Then, we describe the detailed algorithm of MSACompro. Finally, we show that integrating predicted protein structural information improved the multiple sequence alignment accuracy.

  19. MESSA: MEta-Server for protein Sequence Analysis

    PubMed Central

    2012-01-01

    Background Computational sequence analysis, that is, prediction of local sequence properties, homologs, spatial structure and function from the sequence of a protein, offers an efficient way to obtain needed information about proteins under study. Since reliable prediction is usually based on the consensus of many computer programs, meta-severs have been developed to fit such needs. Most meta-servers focus on one aspect of sequence analysis, while others incorporate more information, such as PredictProtein for local sequence feature predictions, SMART for domain architecture and sequence motif annotation, and GeneSilico for secondary and spatial structure prediction. However, as predictions of local sequence properties, three-dimensional structure and function are usually intertwined, it is beneficial to address them together. Results We developed a MEta-Server for protein Sequence Analysis (MESSA) to facilitate comprehensive protein sequence analysis and gather structural and functional predictions for a protein of interest. For an input sequence, the server exploits a number of select tools to predict local sequence properties, such as secondary structure, structurally disordered regions, coiled coils, signal peptides and transmembrane helices; detect homologous proteins and assign the query to a protein family; identify three-dimensional structure templates and generate structure models; and provide predictive statements about the protein's function, including functional annotations, Gene Ontology terms, enzyme classification and possible functionally associated proteins. We tested MESSA on the proteome of Candidatus Liberibacter asiaticus. Manual curation shows that three-dimensional structure models generated by MESSA covered around 75% of all the residues in this proteome and the function of 80% of all proteins could be predicted. Availability MESSA is free for non-commercial use at http://prodata.swmed.edu/MESSA/ PMID:23031578

  20. Comparison of 61 Sequenced Escherichia coli Genomes

    PubMed Central

    Lukjancenko, Oksana; Wassenaar, Trudy M.

    2010-01-01

    Escherichia coli is an important component of the biosphere and is an ideal model for studies of processes involved in bacterial genome evolution. Sixty-one publically available E. coli and Shigella spp. sequenced genomes are compared, using basic methods to produce phylogenetic and proteomics trees, and to identify the pan- and core genomes of this set of sequenced strains. A hierarchical clustering of variable genes allowed clear separation of the strains into clusters, including known pathotypes; clinically relevant serotypes can also be resolved in this way. In contrast, when in silico MLST was performed, many of the various strains appear jumbled and less well resolved. The predicted pan-genome comprises 15,741 gene families, and only 993 (6%) of the families are represented in every genome, comprising the core genome. The variable or ‘accessory’ genes thus make up more than 90% of the pan-genome and about 80% of a typical genome; some of these variable genes tend to be co-localized on genomic islands. The diversity within the species E. coli, and the overlap in gene content between this and related species, suggests a continuum rather than sharp species borders in this group of Enterobacteriaceae. PMID:20623278

  1. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

    PubMed

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.

  2. Evolutionary bridges to new protein folds: design of C-terminal Cro protein chameleon sequences

    PubMed Central

    Anderson, William J.; Van Dorn, Laura O.; Ingram, Wendy M.; Cordes, Matthew H. J.

    2011-01-01

    Regions of amino-acid sequence that are compatible with multiple folds may facilitate evolutionary transitions in protein structure. In a previous study, we described a heuristically designed chameleon sequence (SASF1, structurally ambivalent sequence fragment 1) that could adopt either of two naturally occurring conformations (α-helical or β-sheet) when incorporated as part of the C-terminal dimerization subdomain of two structurally divergent transcription factors, P22 Cro and λ Cro. Here we describe longer chameleon designs (SASF2 and SASF3) that in the case of SASF3 correspond to the full C-terminal half of the ordered region of a P22 Cro/λ Cro sequence alignment (residues 34–57). P22-SASF2 and λWDD-SASF2 show moderate thermal stability in denaturation curves monitored by circular dichroism (Tm values of 46 and 55°C, respectively), while P22-SASF3 and λWDD-SASF3 have somewhat reduced stability (Tm values of 33 and 49°C, respectively). 13C and 1H NMR secondary chemical shift analysis confirms two C-terminal α-helices for P22-SASF2 (residues 36–45 and 54–57) and two C-terminal β-strands for λWDD-SASF2 (residues 40–45 and 50–52), corresponding to secondary structure locations in the two parent sequences. Backbone relaxation data show that both chameleon sequences have a relatively well-ordered structure. Comparisons of 15N-1H correlation spectra for SASF2 and SASF3-containing proteins strongly suggest that SASF3 retains the chameleonism of SASF2. Both Cro C-terminal conformations can be encoded in a single sequence, showing the plausibility of linking different Cro folds by smooth evolutionary transitions. The N-terminal subdomain, though largely conserved in structure, also exerts an important contextual influence on the structure of the C-terminal region. PMID:21676898

  3. Searching gene and protein sequence databases.

    PubMed

    Barsalou, T; Brutlag, D L

    1991-01-01

    A large-scale effort to map and sequence the human genome is now under way. Crucial to the success of this research is a group of computer programs that analyze and compare data on molecular sequences. This article describes the classic algorithms for similarity searching and sequence alignment. Because good performance of these algorithms is critical to searching very large and growing databases, we analyze the running times of the algorithms and discuss recent improvements in this area.

  4. Nucleotide sequence of the coat protein gene of canine parvovirus.

    PubMed Central

    Rhode, S L

    1985-01-01

    The nucleotide sequence of the canine parvovirus (CPV2) from map units 33 to 95 has been determined. This includes the entire coat protein gene and noncoding sequences at the 3' end of the gene, exclusive of the terminal inverted repeat. The predicted capsid protein structures are discussed and compared with those of the rodent parvoviruses H-1 and MVM. PMID:3989914

  5. Detecting remotely related proteins by their interactions and sequence similarity

    PubMed Central

    Espadaler, Jordi; Aragüés, Ramón; Eswar, Narayanan; Marti-Renom, Marc A.; Querol, Enrique; Avilés, Francesc X.; Sali, Andrej; Oliva, Baldomero

    2005-01-01

    The function of an uncharacterized protein is usually inferred either from its homology to, or its interactions with, characterized proteins. Here, we use both sequence similarity and protein interactions to identify relationships between remotely related protein sequences. We rely on the fact that homologous sequences share similar interactions, and, therefore, the set of interacting partners of the partners of a given protein is enriched by its homologs. The approach was benchmarked by assigning the fold and functional family to test sequences of known structure. Specifically, we relied on 1,434 proteins with known folds, as defined in the Structural Classification of Proteins (SCOP) database, and with known interacting partners, as defined in the Database of Interacting Proteins (DIP). For this subset, the specificity of fold assignment was increased from 54% for position-specific iterative blast to 75% for our approach, with a concomitant increase in sensitivity for a few percentage points. Similarly, the specificity of family assignment at the e-value threshold of 10-8 was increased from 70% to 87%. The proposed method would be a useful tool for large-scale automated discovery of remote relationships between protein sequences, given its unique reliance on sequence similarity and protein-protein interactions. PMID:15883372

  6. Cloning and sequence of DNA encoding structural proteins of the autonomous parvovirus feline panleukopenia virus.

    PubMed Central

    Carlson, J; Rushlow, K; Maxwell, I; Maxwell, F; Winston, S; Hahn, W

    1985-01-01

    Approximately 80% of the genome of feline panleukopenia virus was cloned into pBR322. This DNA included the transcription unit for the major viral mRNA species. The nucleotide sequence of the cloned portion of the genome was determined. Comparison of the feline panleukopenia virus sequence with the sequences of the parvoviruses minute virus of mice and H-1 revealed considerable homology between the three viruses on both the nucleic acid and protein levels. Based on this homology, a model for the generation of the two size classes of viral structural proteins (VP1 and VP2') is proposed. Images PMID:2991581

  7. MIPS: a database for protein sequences and complete genomes.

    PubMed Central

    Mewes, H W; Hani, J; Pfeiffer, F; Frishman, D

    1998-01-01

    The MIPS group [Munich Information Center for Protein Sequences of the German National Center for Environment and Health (GSF)] at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, is involved in a number of data collection activities, including a comprehensive database of the yeast genome, a database reflecting the progress in sequencing the Arabidopsis thaliana genome, the systematic analysis of other small genomes and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). Through its WWW server (http://www.mips.biochem.mpg.de ) MIPS provides access to a variety of generic databases, including a database of protein families as well as automatically generated data by the systematic application of sequence analysis algorithms. The yeast genome sequence and its related information was also compiled on CD-ROM to provide dynamic interactive access to the 16 chromosomes of the first eukaryotic genome unraveled. PMID:9399795

  8. Fold Recognition Using Sequence Fingerprints of Protein Local Substructures

    SciTech Connect

    Kryshtafovych, A A; Hvidsten, T; Komorowski, J; Fidelis, K

    2003-06-04

    A protein local substructure (descriptor) is a set of several short non-overlapping fragments of the polypeptide chain. Each descriptor describes local environment of a particular residue and includes only those segments that are located in the proximity of this residue. Similar descriptors from the representative set of proteins were analyzed to reveal links between the substructures and sequences of their segments. Using detected sequence-based fingerprints specific geometrical conformations are assigned to new sequences. The ability of the approach to recognize correct SCOP folds was tested on 273 sequences from the 49 most popular folds. Good predictions were obtained in 85% of cases. No performance drop was observed with decreasing sequence similarity between target sequences and sequences from the training set of proteins.

  9. EVEREST: automatic identification and classification of protein domains in all protein sequences

    PubMed Central

    Portugaly, Elon; Harel, Amir; Linial, Nathan; Linial, Michal

    2006-01-01

    Background Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Results Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. Conclusion The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The

  10. EVEREST: automatic identification and classification of protein domains in all protein sequences.

    PubMed

    Portugaly, Elon; Harel, Amir; Linial, Nathan; Linial, Michal

    2006-06-02

    Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain

  11. Dynamics of domain coverage of the protein sequence universe.

    PubMed

    Rekapalli, Bhanu; Wuichet, Kristin; Peterson, Gregory D; Zhulin, Igor B

    2012-11-16

    The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its "dark matter". Here we suggest that true size of "dark matter" is much larger than stated by current definitions. We propose an approach to reducing the size of "dark matter" by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of "dark matter"; however, its absolute size increases substantially with the growth of sequence data.

  12. Dynamics of domain coverage of the protein sequence universe

    PubMed Central

    2012-01-01

    Background The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”. Results Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Conclusions Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data. PMID:23157439

  13. Intra-species sequence comparisons for annotating genomes

    SciTech Connect

    Boffelli, Dario; Weer, Claire V.; Weng, Li; Lewis, Keith D.; Shoukry, Malak I.; Pachter, Lior; Keys, David N.; Rubin, Edward M.

    2004-07-15

    Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intra-species sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents and a set of genomic intervals amplified, resequenced and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom and raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species. The sequence data from this study has been submitted to GenBank under accession nos. AY667278-AY667407.

  14. Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property.

    PubMed

    Zhong, Wei; Altun, Gulsah; Harrison, Robert; Tai, Phang C; Pan, Yi

    2005-09-01

    Information about local protein sequence motifs is very important to the analysis of biologically significant conserved regions of protein sequences. These conserved regions can potentially determine the diverse conformation and activities of proteins. In this work, recurring sequence motifs of proteins are explored with an improved K-means clustering algorithm on a new dataset. The structural similarity of these recurring sequence clusters to produce sequence motifs is studied in order to evaluate the relationship between sequence motifs and their structures. To the best of our knowledge, the dataset used by our research is the most updated dataset among similar studies for sequence motifs. A new greedy initialization method for the K-means algorithm is proposed to improve traditional K-means clustering techniques. The new initialization method tries to choose suitable initial points, which are well separated and have the potential to form high-quality clusters. Our experiments indicate that the improved K-means algorithm satisfactorily increases the percentage of sequence segments belonging to clusters with high structural similarity. Careful comparison of sequence motifs obtained by the improved and traditional algorithms also suggests that the improved K-means clustering algorithm may discover some relatively weak and subtle sequence motifs, which are undetectable by the traditional K-means algorithms. Many biochemical tests reported in the literature show that these sequence motifs are biologically meaningful. Experimental results also indicate that the improved K-means algorithm generates more detailed sequence motifs representing common structures than previous research. Furthermore, these motifs are universally conserved sequence patterns across protein families, overcoming some weak points of other popular sequence motifs. The satisfactory result of the experiment suggests that this new K-means algorithm may be applied to other areas of bioinformatics

  15. Does protein relatedness require sequence matching? Alignment via networks in sequence space.

    PubMed

    Frenkel, Zakharia M

    2008-10-01

    To establish possible function of a newly discovered protein, alignment of its sequence with other known sequences is required. When the similarity is marginal, the function remains uncertain. A principally new approach is suggested: to use networks in the protein sequence space. The functionality of the protein is firmly established via networks forming chains of consecutive pair-wise matching fragments. The distant relatives are, thus, considered as relatives, though in some cases, there is even no sequence match between the ends of the chain, while the entire chain belongs to the same functional and structural network.

  16. Sequencing proteins with transverse ionic transport in nanochannels

    PubMed Central

    Boynton, Paul; Di Ventra, Massimiliano

    2016-01-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer’s Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique’s potential for de novo protein sequencing. PMID:27140520

  17. Sequencing proteins with transverse ionic transport in nanochannels

    NASA Astrophysics Data System (ADS)

    Boynton, Paul; di Ventra, Massimiliano

    2016-05-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer’s Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique’s potential for de novo protein sequencing.

  18. Protein sequence classification with improved extreme learning machine algorithms.

    PubMed

    Cao, Jiuwen; Xiong, Lianglin

    2014-01-01

    Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms.

  19. Alignment-free sequence comparison based on next-generation sequencing reads.

    PubMed

    Song, Kai; Ren, Jie; Zhai, Zhiyuan; Liu, Xuemei; Deng, Minghua; Sun, Fengzhu

    2013-02-01

    Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, D(2), D(*)(2) and D(s)(2), both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both D(*)(2) and D(s)(2), outperform D(2) for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of D(*)(2) and D(s)(2). Finally, variations of these statistics, d(2), d(*)(2) and d(s)(2), respectively, are used to first cluster five mammalian species with known phylogenetic relationships, and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using d(s)(2) are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic d(s)(2) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.

  20. Amino acid sequences of proteins from Leptospira serovar pomona.

    PubMed

    Alves, S F; Lefebvre, R B; Probert, W

    2000-01-01

    This report describes a partial amino acid sequences from three putative outer envelope proteins from Leptospira serovar pomona. In order to obtain internal fragments for protein sequencing, enzymatic and chemical digestion was performed. The enzyme clostripain was used to digest the proteins 32 and 45 kDa. In situ digestion of 40 kDa molecular weight protein was accomplished using cyanogen bromide. The 32 kDa protein generated two fragments, one of 21 kDa and another of 10 kDa that yielded five residues. A fragment of 24 kDa that yielded nineteen residues of amino acids was obtained from 45 kDa protein. A fragment with a molecular weight of 20 kDa, yielding a twenty amino acids sequence from the 40 kDa protein.

  1. Analysis of protein sequence/structure similarity relationships.

    PubMed Central

    Gan, Hin Hark; Perlow, Rebecca A; Roy, Sharmili; Ko, Joy; Wu, Min; Huang, Jing; Yan, Shixiang; Nicoletta, Angelo; Vafai, Jonathan; Sun, Ding; Wang, Lihua; Noah, Joyce E; Pasquali, Samuela; Schlick, Tamar

    2002-01-01

    Current analyses of protein sequence/structure relationships have focused on expected similarity relationships for structurally similar proteins. To survey and explore the basis of these relationships, we present a general sequence/structure map that covers all combinations of similarity/dissimilarity relationships and provide novel energetic analyses of these relationships. To aid our analysis, we divide protein relationships into four categories: expected/unexpected similarity (S and S(?)) and expected/unexpected dissimilarity (D and D(?)) relationships. In the expected similarity region S, we show that trends in the sequence/structure relation can be derived based on the requirement of protein stability and the energetics of sequence and structural changes. Specifically, we derive a formula relating sequence and structural deviations to a parameter characterizing protein stiffness; the formula fits the data reasonably well. We suggest that the absence of data in region S(?) (high structural but low sequence similarity) is due to unfavorable energetics. In contrast to region S, region D(?) (high sequence but low structural similarity) is well-represented by proteins that can accommodate large structural changes. Our analyses indicate that there are several categories of similarity relationships and that protein energetics provide a basis for understanding these relationships. PMID:12414710

  2. A new method to analyze protein sequence similarity using Dynamic Time Warping.

    PubMed

    Hou, Wenbing; Pan, Qiuhui; Peng, Qianying; He, Mingfeng

    2017-03-01

    Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective.

  3. A 3D sequence-independent representation of the protein data bank.

    PubMed

    Fischer, D; Tsai, C J; Nussinov, R; Wolfson, H

    1995-10-01

    Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally non-redundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average approximately 2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 A, or 268 chains including lower resolution entries, NMR entries and models. The

  4. Comparison of mitochondrial genome sequences of pangolins (Mammalia, Pholidota).

    PubMed

    Hassanin, Alexandre; Hugot, Jean-Pierre; van Vuuren, Bettine Jansen

    2015-04-01

    The complete mitochondrial genome was sequenced for three species of pangolins, Manis javanica, Phataginus tricuspis, and Smutsia temminckii, and comparisons were made with two other species, Manis pentadactyla and Phataginus tetradactyla. The genome of Manidae contains the 37 genes found in a typical mammalian genome, and the structure of the control region is highly conserved among species. In Manis, the overall base composition differs from that found in African genera. Phylogenetic analyses support the monophyly of the genera Manis, Phataginus, and Smutsia, as well as the basal division between Maninae and Smutsiinae. Comparisons with GenBank sequences reveal that the reference genomes of M. pentadactyla and P. tetradactyla (accession numbers NC_016008 and NC_004027) were sequenced from misidentified taxa, and that a new species of tree pangolin should be described in Gabon.

  5. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.

    PubMed

    Wright, Erik S

    2015-10-06

    Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the

  6. What Makes a Protein Sequence a Prion?

    PubMed Central

    Sabate, Raimon; Rousseau, Frederic; Schymkowitz, Joost; Ventura, Salvador

    2015-01-01

    Typical amyloid diseases such as Alzheimer's and Parkinson's were thought to exclusively result from de novo aggregation, but recently it was shown that amyloids formed in one cell can cross-seed aggregation in other cells, following a prion-like mechanism. Despite the large experimental effort devoted to understanding the phenomenon of prion transmissibility, it is still poorly understood how this property is encoded in the primary sequence. In many cases, prion structural conversion is driven by the presence of relatively large glutamine/asparagine (Q/N) enriched segments. Several studies suggest that it is the amino acid composition of these regions rather than their specific sequence that accounts for their priogenicity. However, our analysis indicates that it is instead the presence and potency of specific short amyloid-prone sequences that occur within intrinsically disordered Q/N-rich regions that determine their prion behaviour, modulated by the structural and compositional context. This provides a basis for the accurate identification and evaluation of prion candidate sequences in proteomes in the context of a unified framework for amyloid formation and prion propagation. PMID:25569335

  7. A convenient and adaptable microcomputer environment for DNA and protein sequence manipulation and analysis.

    PubMed Central

    Pustell, J; Kafatos, F C

    1986-01-01

    We describe the further development of a widely used package of DNA and protein sequence analysis programs for microcomputers (1,2,3). The package now provides a screen oriented user interface, and an enhanced working environment with powerful formatting, disk access, and memory management tools. The new GenBank floppy disk database is supported transparently to the user and a similar version of the NBRF protein database is provided. The programs can use sequence file annotation to automatically annotate printouts and translate or extract specified regions from sequences by name. The sequence comparison programs can now perform a 5000 X 5000 bp analysis in 12 minutes on an IBM PC. A program to locate potential protein coding regions in nucleic acids, a digitizer interface, and other additions are also described. PMID:3753784

  8. Predicted networks of protein-protein interactions in Stegodyphus mimosarum by cross-species comparisons.

    PubMed

    Wang, Xiu; Jin, Yongfeng

    2017-09-11

    Stegodyphus mimosarum is a candidate model organism belonging to the class Arachnida in the phylum Arthropoda. Studies on the biology of S. mimosarum over the past several decades have consisted of behavioral research and comparison of gene sequences based on the assembled genome sequence. Given the lack of systematic protein analyses and the rich source of information in the genome, we predicted the relationships of proteins in S. mimosarum by bioinformatics comparison with genome-wide proteins from select model organisms using gene mapping. The protein-protein interactions (PPIs) of 11 organisms were integrated from four databases (BioGrid, InAct, MINT, and DIP). Here, we present comprehensive prediction and analysis of 3810 proteins in S. mimosarum with regard to interactions between proteins using PPI data of organisms. Interestingly, a portion of the protein interactions conserved among Saccharomyces cerevisiae, Homo sapiens, Arabidopsis thaliana, and Drosophila melanogaster were found to be associated with RNA splicing. In addition, overlap of predicted PPIs in reference organisms, Gene Ontology, and topology models in S. mimosarum are also reported. Addition of Stegodyphus, a spider representative of interactomic research, provides the possibility of obtaining deeper insights into the evolution of PPI networks among different animal species. This work largely supports the utility of the "stratus clouds" model for predicted PPIs, providing a roadmap for integrative systems biology in S. mimosarum.

  9. Dynein light chain association sequences can facilitate nuclear protein import.

    PubMed

    Moseley, Gregory W; Roth, Daniela Martino; DeJesus, Michelle A; Leyton, Denisse L; Filmer, Richard P; Pouton, Colin W; Jans, David A

    2007-08-01

    Nuclear localization sequence (NLS)-dependent nuclear protein import is not conventionally held to require interaction with microtubules (MTs) or components of the MT motor, dynein. Here we report for the first time the role of sequences conferring association with dynein light chains (DLCs) in NLS-dependent nuclear accumulation of the rabies virus P-protein. We find that P-protein nuclear accumulation is significantly enhanced by its dynein light chain association sequence (DLC-AS), dependent on MT integrity and association with DLCs, and that P-protein-DLC complexes can associate with MT cytoskeletal structures. We also find that P-protein DLC-AS, as well as analogous sequences from other proteins, acts as an independent module that can confer enhancement of nuclear accumulation to proteins carrying the P-protein NLS, as well as several heterologous NLSs. Photobleaching experiments in live cells demonstrate that the MT-dependent enhancement of NLS-mediated nuclear accumulation by the P-protein DLC-AS involves an increased rate of nuclear import. This is the first report of DLC-AS enhancement of NLS function, identifying a novel mechanism regulating nuclear transport with relevance to viral and cellular protein biology. Importantly, this data indicates that DLC-ASs represent versatile modules to enhance nuclear delivery with potential therapeutic application.

  10. Metabolic pathways variability and sequence/networks comparisons

    PubMed Central

    Tun, Kyaw; Dhar, Pawan K; Palumbo, Maria Concetta; Giuliani, Alessandro

    2006-01-01

    Background In this work a simple method for the computation of relative similarities between homologous metabolic network modules is presented. The method is similar to classical sequence alignment and allows for the generation of phenotypic trees amenable to be compared with correspondent sequence based trees. The procedure can be applied to both single metabolic modules and whole metabolic network data without the need of any specific assumption. Results We demonstrate both the ability of the proposed method to build reliable biological classification of a set of microrganisms and the strong correlation between the metabolic network wiringand involved enzymes sequence space. Conclusion The method represents a valuable tool for the investigation of genotype/phenotype correlationsallowing for a direct comparison of different species as for their metabolic machinery. In addition the detection of enzymes whose sequence space is maximally correlated with the metabolicnetwork space gives an indication of the most crucial (on an evolutionary viewpoint) steps of the metabolic process. PMID:16420696

  11. Base-sequence-dependent sliding of proteins on DNA.

    PubMed

    Barbi, M; Place, C; Popkov, V; Salerno, M

    2004-10-01

    The possibility that the sliding motion of proteins on DNA is influenced by the base sequence through a base pair reading interaction, is considered. Referring to the case of the T7 RNA-polymerase, we show that the protein should follow a noise-influenced sequence-dependent motion which deviate from the standard random walk usually assumed. The general validity and the implications of the results are discussed.

  12. Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences.

    PubMed

    Naylor, G J; Brown, W M

    1998-03-01

    Analyses of both the nucleotide and amino acid sequences derived from all 13 mitochondrial protein-encoding genes (12,234 bp) of 19 metazoan species, including that of the lancelet Branchiostoma floridae ("amphioxus"), fail to yield the widely accepted phylogeny for chordates and, within chordates, for vertebrates. Given the breadth and the compelling nature of the data supporting that phylogeny, relationships supported by the mitochondrial sequence comparisons are almost certainly incorrect, despite their being supported by equally weighted parsimony, distance, and maximum-likelihood analyses. The incorrect groupings probably result in part from convergent base-compositional similarities among some of the taxa, similarities that are strong enough to overwhelm the historical signal. Comparisons among very distantly related taxa are likely to be particularly susceptible to such artifacts, because the historical signal is already greatly attenuated. Empirical results underscore the need for approaches to phylogenetic inference that go beyond simple site-by-site comparison of aligned sequences. This study and others indicate that, once a sequence sample of reasonable size has been obtained, accurate phylogenetic estimation may be better served by incorporating knowledge of molecular structures and processes into inference models and by seeking additional higher order characters embedded in those sequences, than by gathering ever larger sequence samples from the same organisms in he hope that the historical signal will eventually prevail.

  13. PairsDB atlas of protein sequence space.

    PubMed

    Heger, Andreas; Korpelainen, Eija; Hupponen, Taavi; Mattila, Kimmo; Ollikainen, Vesa; Holm, Liisa

    2008-01-01

    Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria-for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at ftp://nic.funet.fi/pub/sci/molbio. PairsDB can also be accessed interactively at http://pairsdb.csc.fi. PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.

  14. Can computationally designed protein sequences improve secondary structure prediction?

    PubMed

    Bondugula, Rajkumar; Wallqvist, Anders; Lee, Michael S

    2011-05-01

    Computational sequence design methods are used to engineer proteins with desired properties such as increased thermal stability and novel function. In addition, these algorithms can be used to identify an envelope of sequences that may be compatible with a particular protein fold topology. In this regard, we hypothesized that sequence-property prediction, specifically secondary structure, could be significantly enhanced by using a large database of computationally designed sequences. We performed a large-scale test of this hypothesis with 6511 diverse protein domains and 50 designed sequences per domain. After analysis of the inherent accuracy of the designed sequences database, we realized that it was necessary to put constraints on what fraction of the native sequence should be allowed to change. With mutational constraints, accuracy was improved vs. no constraints, but the diversity of designed sequences, and hence effective size of the database, was moderately reduced. Overall, the best three-state prediction accuracy (Q(3)) that we achieved was nearly a percentage point improved over using a natural sequence database alone, well below the theoretical possibility for improvement of 8-10 percentage points. Furthermore, our nascent method was used to augment the state-of-the-art PSIPRED program by a percentage point.

  15. Using homology relations within a database markedly boosts protein sequence similarity search.

    PubMed

    Tong, Jing; Sadreyev, Ruslan I; Pei, Jimin; Kinch, Lisa N; Grishin, Nick V

    2015-06-02

    Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence-based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit's known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre.

  16. A probabilistic measure for alignment-free sequence comparison.

    PubMed

    Pham, Tuan D; Zuegg, Johannes

    2004-12-12

    Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. The method was tested against six DNA sequences, which are the thrA, thrB and thrC genes of the threonine operons from Escherichia coli K-12 and from Shigella flexneri; and one random sequence having the same base composition as thrA from E.coli. These results were compared with those obtained from CLUSTAL W algorithm (alignment-based) and the chaos game representation (alignment-free). The method was further tested against a more complex set of 40 DNA sequences and compared with other existing sequence similarity measures (alignment-free). All datasets and computer codes written in MATLAB are available upon request from the first author.

  17. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm.

    PubMed

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality.

  18. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm

    PubMed Central

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770

  19. Nucleotide sequences of the coat protein genes of two Japanese zucchini yellow mosaic virus isolates.

    PubMed

    Kundu, A K; Ohshima, K; Sako, N

    1997-10-01

    The nucleotide (nt) sequences of the coat protein (CP) genes of two Japanese zucchini yellow mosaic virus (ZYMV) isolates (ZYMV-169 and ZYMV-M) were determined. The CP genes of both isolates were 837 nt long and encoded 279 amino acids (aa). The nt and deduced aa sequence similarities between the two isolates were 92% and 94.6%, respectively. The deduced aa sequences of CPs of the Japanese isolates were compared with those of previously reported ZYMV isolates by phylogenetic analysis. This comparison lead us to divide all ZMYV isolates into 3 groups in which ZYMV-169 formed its own distinct group.

  20. Protein 3D Structure Computed from Evolutionary Sequence Variation

    PubMed Central

    Sheridan, Robert; Hopf, Thomas A.; Pagnani, Andrea; Zecchina, Riccardo; Sander, Chris

    2011-01-01

    The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein

  1. Protein 3D structure computed from evolutionary sequence variation.

    PubMed

    Marks, Debora S; Colwell, Lucy J; Sheridan, Robert; Hopf, Thomas A; Pagnani, Andrea; Zecchina, Riccardo; Sander, Chris

    2011-01-01

    The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing.In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy.We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues, including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7-4.8 Å C(α)-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures

  2. Evolutionary bridges to new protein folds: design of C-terminal Cro protein chameleon sequences.

    PubMed

    Anderson, William J; Van Dorn, Laura O; Ingram, Wendy M; Cordes, Matthew H J

    2011-09-01

    Regions of amino-acid sequence that are compatible with multiple folds may facilitate evolutionary transitions in protein structure. In a previous study, we described a heuristically designed chameleon sequence (SASF1, structurally ambivalent sequence fragment 1) that could adopt either of two naturally occurring conformations (α-helical or β-sheet) when incorporated as part of the C-terminal dimerization subdomain of two structurally divergent transcription factors, P22 Cro and λ Cro. Here we describe longer chameleon designs (SASF2 and SASF3) that in the case of SASF3 correspond to the full C-terminal half of the ordered region of a P22 Cro/λ Cro sequence alignment (residues 34-57). P22-SASF2 and λ(WDD)-SASF2 show moderate thermal stability in denaturation curves monitored by circular dichroism (T(m) values of 46 and 55°C, respectively), while P22-SASF3 and λ(WDD)-SASF3 have somewhat reduced stability (T(m) values of 33 and 49°C, respectively). (13)C and (1)H NMR secondary chemical shift analysis confirms two C-terminal α-helices for P22-SASF2 (residues 36-45 and 54-57) and two C-terminal β-strands for λ(WDD)-SASF2 (residues 40-45 and 50-52), corresponding to secondary structure locations in the two parent sequences. Backbone relaxation data show that both chameleon sequences have a relatively well-ordered structure. Comparisons of (15)N-(1)H correlation spectra for SASF2 and SASF3-containing proteins strongly suggest that SASF3 retains the chameleonism of SASF2. Both Cro C-terminal conformations can be encoded in a single sequence, showing the plausibility of linking different Cro folds by smooth evolutionary transitions. The N-terminal subdomain, though largely conserved in structure, also exerts an important contextual influence on the structure of the C-terminal region.

  3. 3D structures of membrane proteins from genomic sequencing

    PubMed Central

    Hopf, Thomas A.; Colwell, Lucy J.; Sheridan, Robert; Rost, Burkhard; Sander, Chris; Marks, Debora S.

    2012-01-01

    Summary We show that amino acid co-variation in proteins, extracted from the evolutionary sequence record, can be used to fold transmembrane proteins. We use this technique to predict previously unknown, 3D structures for 11 transmembrane proteins (with up to 14 helices) from their sequences alone. The prediction method (EVfold_membrane), applies a maximum entropy approach to infer evolutionary co-variation in pairs of sequence positions within a protein family and then generates all-atom models with the derived pairwise distance constraints. We benchmark the approach with blinded, de novo computation of known transmembrane protein structures from 23 families, demonstrating unprecedented accuracy of the method for large transmembrane proteins. We show how the method can predict oligomerization, functional sites, and conformational changes in transmembrane proteins. With the rapid rise in large-scale sequencing, more accurate and more comprehensive information on evolutionary constraints can be decoded from genetic variation, greatly expanding the repertoire of transmembrane proteins amenable to modelling by this method. PMID:22579045

  4. Relation between sequence and structure in membrane proteins.

    PubMed

    Olivella, Mireia; Gonzalez, Angel; Pardo, Leonardo; Deupi, Xavier

    2013-07-01

    Integral polytopic membrane proteins contain only two types of folds in their transmembrane domains: α-helix bundles and β-barrels. The increasing number of available crystal structures of these proteins permits an initial estimation of how sequence variability affects the structure conservation in their transmembrane domains. We, thus, aim to determine the pairwise sequence identity necessary to maintain the transmembrane molecular architectures compatible with the hydrophobic nature of the lipid bilayer. Root-mean-square deviation (rmsd) and sequence identity were calculated from the structural alignments of pairs of homologous polytopic membrane proteins sharing the same fold. Analysis of these data reveals that transmembrane segment pairs with sequence identity in the so-called 'twilight zone' (20-35%) display high-structural similarity (rmsd < 1.5 Å). Moreover, a large group of β-barrel pairs with low-sequence identity (<20%) still maintain a close structural similarity (rmsd < 2.5 Å). Thus, we conclude that fold preservation in transmembrane regions requires less sequence conservation than for globular proteins. These findings have direct implications in homology modeling of evolutionary-related membrane proteins. Supplementary data are available at Bioinformatics online.

  5. Molecular sled sequences are common in mammalian proteins

    PubMed Central

    Xiong, Kan; Blainey, Paul C.

    2016-01-01

    Recent work revealed a new class of molecular machines called molecular sleds, which are small basic molecules that bind and slide along DNA with the ability to carry cargo along DNA. Here, we performed biochemical and single-molecule flow stretching assays to investigate the basis of sliding activity in molecular sleds. In particular, we identified the functional core of pVIc, the first molecular sled characterized; peptide functional groups that control sliding activity; and propose a model for the sliding activity of molecular sleds. We also observed widespread DNA binding and sliding activity among basic polypeptide sequences that implicate mammalian nuclear localization sequences and many cell penetrating peptides as molecular sleds. These basic protein motifs exhibit weak but physiologically relevant sequence-nonspecific DNA affinity. Our findings indicate that many mammalian proteins contain molecular sled sequences and suggest the possibility that substantial undiscovered sliding activity exists among nuclear mammalian proteins. PMID:26857546

  6. Quantiprot - a Python package for quantitative analysis of protein sequences.

    PubMed

    Konopka, Bogumił M; Marciniak, Marta; Dyrka, Witold

    2017-07-17

    The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf's law coefficient. We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.

  7. Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.

    PubMed

    Doğan, Tunca; Karaçalı, Bilge

    2013-01-01

    Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.

  8. Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences.

    PubMed Central

    Schneider, T D

    1997-01-01

    A graphical method is presented for displaying how binding proteins and other macromolecules interact with individual bases of nucleotide sequences. Characters representing the sequence are either oriented normally and placed above a line indicating favorable contact, or upside-down and placed below the line indicating unfavorable contact. The positive or negative height of each letter shows the contribution of that base to the average sequence conservation of the binding site, as represented by a sequence logo. These sequence 'walkers' can be stepped along raw sequence data to visually search for binding sites. Many walkers, for the same or different proteins, can be simultaneously placed next to a sequence to create a quantitative map of a complex genetic region. One can alter the sequence to quantitatively engineer binding sites. Database anomalies can be visualized by placing a walker at the recorded positions of a binding molecule and by comparing this to locations found by scanning the nearby sequences. The sequence can also be altered to predict whether a change is a polymorphism or a mutation for the recognizer being modeled. PMID:9336476

  9. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing.

    PubMed

    Kanda, Kojun; Pflug, James M; Sproul, John S; Dasenko, Mark A; Maddison, David R

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles

  10. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

    PubMed

    Atkinson, Holly J; Morris, John H; Ferrin, Thomas E; Babbitt, Patricia C

    2009-01-01

    The dramatic increase in heterogeneous types of biological data--in particular, the abundance of new protein sequences--requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity--GPCRs and kinases from humans, and the crotonase superfamily of enzymes--we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.

  11. Taxonomic colouring of phylogenetic trees of protein sequences.

    PubMed

    Palidwor, Gareth; Reynaud, Emmanuel G; Andrade-Navarro, Miguel A

    2006-02-17

    Phylogenetic analyses of protein families are used to define the evolutionary relationships between homologous proteins. The interpretation of protein-sequence phylogenetic trees requires the examination of the taxonomic properties of the species associated to those sequences. However, there is no online tool to facilitate this interpretation, for example, by automatically attaching taxonomic information to the nodes of a tree, or by interactively colouring the branches of a tree according to any combination of taxonomic divisions. This is especially problematic if the tree contains on the order of hundreds of sequences, which, given the accelerated increase in the size of the protein sequence databases, is a situation that is becoming common. We have developed PhyloView, a web based tool for colouring phylogenetic trees upon arbitrary taxonomic properties of the species represented in a protein sequence phylogenetic tree. Provided that the tree contains SwissProt, SpTrembl, or GenBank protein identifiers, the tool retrieves the taxonomic information from the corresponding database. A colour picker displays a summary of the findings and allows the user to associate colours to the leaves of the tree according to any number of taxonomic partitions. Then, the colours are propagated to the branches of the tree. PhyloView can be used at http://www.ogic.ca/projects/phyloview/. A tutorial, the software with documentation, and GPL licensed source code, can be accessed at the same web address.

  12. Full Protein Sequence Redesign with an MMGBSA Energy Function.

    PubMed

    Gaillard, Thomas; Simonson, Thomas

    2017-10-10

    Computational protein design aims to create proteins with novel properties. A key element is the energy or scoring function used to select the sequences and conformations. We study the performance of an "MMGBSA" energy function, which combines molecular mechanics terms, a generalized Born and surface area (GBSA) solvent model, with approximations that make the model pairwise additive. Our approach is implemented in the Proteus software. The use of a physics-based energy function ensures a certain model transferability and explanatory power. As a first test, we redesign the sequence of nine proteins, one position at a time, with the rest of the protein having its native sequence and crystallographic conformation. As a second test, all positions are designed together. The contributions of individual energy terms are evaluated, and various parametrizations are compared. We find that the GB term significantly improves the results compared to simple Coulomb electrostatics but is affected by pairwise decomposition errors when all positions are designed together. The SA term, with distinct energy coefficients for nonpolar and polar atoms, makes a decisive contribution to obtain realistic protein sequences and can partially compensate for the absence of a GB term. With the best GBSA protocol, we obtain nativelike protein cores and Superfamily recognition of almost all of our sequences.

  13. Delineation of modular proteins: domain boundary prediction from sequence information.

    PubMed

    Kong, Lesheng; Ranganathan, Shoba

    2004-06-01

    The delineation of domain boundaries of a given sequence in the absence of known 3D structures or detectable sequence homology to known domains benefits many areas in protein science, such as protein engineering, protein 3D structure determination and protein structure prediction. With the exponential growth of newly determined sequences, our ability to predict domain boundaries rapidly and accurately from sequence information alone is both essential and critical from the viewpoint of gene function annotation. Anyone attempting to predict domain boundaries for a single protein sequence is invariably confronted with a plethora of databases that contain boundary information available from the internet and a variety of methods for domain boundary prediction. How are these derived and how well do they work? What definition of 'domain' do they use? We will first clarify the different definitions of protein domains, and then describe the available public databases with domain boundary information. Finally, we will review existing domain boundary prediction methods and discuss their strengths and weaknesses.

  14. MIPS: a database for genomes and protein sequences.

    PubMed

    Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B

    2002-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).

  15. Classification of Myoviridae bacteriophages using protein sequence similarity

    PubMed Central

    2009-01-01

    Background We advocate unifying classical and genomic classification of bacteriophages by integration of proteomic data and physicochemical parameters. Our previous application of this approach to the entirely sequenced members of the Podoviridae fully supported the current phage classification of the International Committee on Taxonomy of Viruses (ICTV). It appears that horizontal gene transfer generally does not totally obliterate evolutionary relationships between phages. Results CoreGenes/CoreExtractor proteome comparison techniques applied to 102 Myoviridae suggest the establishment of three subfamilies (Peduovirinae, Teequatrovirinae, the Spounavirinae) and eight new independent genera (Bcep781, BcepMu, FelixO1, HAP1, Bzx1, PB1, phiCD119, and phiKZ-like viruses). The Peduovirinae subfamily, derived from the P2-related phages, is composed of two distinct genera: the "P2-like viruses", and the "HP1-like viruses". At present, the more complex Teequatrovirinae subfamily has two genera, the "T4-like" and "KVP40-like viruses". In the genus "T4-like viruses" proper, four groups sharing >70% proteins are distinguished: T4-type, 44RR-type, RB43-type, and RB49-type viruses. The Spounavirinae contain the "SPO1-"and "Twort-like viruses." Conclusion The hierarchical clustering of these groupings provide biologically significant subdivisions, which are consistent with our previous analysis of the Podoviridae. PMID:19857251

  16. Nucleotide sequence of the gene encoding the nitrogenase iron protein of Thiobacillus ferrooxidans

    SciTech Connect

    Pretorius, I.M.; Rawlings, D.E.; O'Neill, E.G.; Jones, W.A.; Kirby, R.; Woods, D.R.

    1987-01-01

    The DNA sequence was determined for the cloned Thiobacillus ferrooxidans nifH and part of the nifD genes. The DNA chains were radiolabeled with (..cap alpha..-/sup 32/P)dCTP (3000 Ci/mmol) or (..cap alpha..-/sup 35/S)dCTP (400 Ci/mmol). A putative T. ferrooxidans nifH promoter was identified whose sequences showed perfect consensus with those of the Klebsiella pneumoniae nif promoter. Two putative consensus upstream activator sequences were also identified. The amino acid sequence was deduced from the DNA sequence. In a comparison of nifH DNA sequences from T. ferrooxidans and eight other nitrogen-fixing microbes, a Rhizobium sp. isolated from Parasponia andersonii showed the greatest homology (74%) and Clostridium pasteurianum (nifH1) showed the least homology (54%). In the comparison of the amino acid sequences of the Fe proteins, the Rhizobium sp. and Rhizobium japonicum showed the greatest homology (both 86%) and C. pasteurianum (nifH1 gene product) demonstrated the least homology (56%) to the T. ferrooxidans Fe protein.

  17. In silico comparative analysis of DNA and amino acid sequences for prion protein gene.

    PubMed

    Kim, Y; Lee, J; Lee, C

    2008-01-01

    Genetic variability might contribute to species specificity of prion diseases in various organisms. In this study, structures of the prion protein gene (PRNP) and its amino acids were compared among species of which sequence data were available. Comparisons of PRNP DNA sequences among 12 species including human, chimpanzee, monkey, bovine, ovine, dog, mouse, rat, wallaby, opossum, chicken and zebrafish allowed us to identify candidate regulatory regions in intron 1 and 3'-untranslated region (UTR) in addition to the coding region. Highly conserved putative binding sites for transcription factors, such as heat shock factor 2 (HSF2) and myocite enhancer factor 2 (MEF2), were discovered in the intron 1. In 3'-UTR, the functional sequence (ATTAAA) for nucleus-specific polyadenylation was found in all the analysed species. The functional sequence (TTTTTAT) for maturation-specific polyadenylation was identically observed only in ovine, and one or two nucleotide mismatches in the other species. A comparison of the amino acid sequences in 53 species revealed a large sequence identity. Especially the octapeptide repeat region was observed in all the species but frog and zebrafish. Functional changes and susceptibility to prion diseases with various isoforms of prion protein could be caused by numeric variability and conformational changes discovered in the repeat sequences.

  18. Protein sequences bound to mineral surfaces persist into deep time

    PubMed Central

    Demarchi, Beatrice; Hall, Shaun; Roncal-Herrero, Teresa; Freeman, Colin L; Woolley, Jos; Crisp, Molly K; Wilson, Julie; Fotakis, Anna; Fischer, Roman; Kessler, Benedikt M; Rakownikow Jersie-Christensen, Rosa; Olsen, Jesper V; Haile, James; Thomas, Jessica; Marean, Curtis W; Parkington, John; Presslee, Samantha; Lee-Thorp, Julia; Ditchfield, Peter; Hamilton, Jacqueline F; Ward, Martyn W; Wang, Chunting Michelle; Shaw, Marvin D; Harrison, Terry; Domínguez-Rodrigo, Manuel; MacPhee, Ross DE; Kwekason, Amandus; Ecker, Michaela; Kolska Horwitz, Liora; Chazan, Michael; Kröger, Roland; Thomas-Oates, Jane; Harding, John H; Cappellini, Enrico; Penkman, Kirsty; Collins, Matthew J

    2016-01-01

    Proteins persist longer in the fossil record than DNA, but the longevity, survival mechanisms and substrates remain contested. Here, we demonstrate the role of mineral binding in preserving the protein sequence in ostrich (Struthionidae) eggshell, including from the palaeontological sites of Laetoli (3.8 Ma) and Olduvai Gorge (1.3 Ma) in Tanzania. By tracking protein diagenesis back in time we find consistent patterns of preservation, demonstrating authenticity of the surviving sequences. Molecular dynamics simulations of struthiocalcin-1 and -2, the dominant proteins within the eggshell, reveal that distinct domains bind to the mineral surface. It is the domain with the strongest calculated binding energy to the calcite surface that is selectively preserved. Thermal age calculations demonstrate that the Laetoli and Olduvai peptides are 50 times older than any previously authenticated sequence (equivalent to ~16 Ma at a constant 10°C). DOI: http://dx.doi.org/10.7554/eLife.17092.001 PMID:27668515

  19. Efficient combination of multiple word models for improved sequence comparison.

    PubMed

    Huang, Xiaoqiu; Ye, Liang; Chou, Hui-Hsien; Yang, I-Hsuan; Chao, Kun-Mao

    2004-11-01

    Studies of efficient and sensitive sequence comparison methods are driven by a need to find homologous regions of weak similarity between large genomes. We describe an improved method for finding similar regions between two sets of DNA sequences. The new method generalizes existing methods by locating word matches between sequences under two or more word models and extending word matches into high-scoring segment pairs (HSPs). The method is implemented as a computer program named DDS2. Experimental results show that DDS2 can find more HSPs by using several word models than by using one word model. The DDS2 program is freely available for academic use in binary code form at http://bioinformatics.iastate.edu/aat/align/align.html and in source code form from the corresponding author.

  20. Nucleotide sequence of a cloned woodchuck hepatitis virus genome: comparison with the hepatitis B virus sequence.

    PubMed Central

    Galibert, F; Chen, T N; Mandart, E

    1982-01-01

    The complete nucleotide sequence of a woodchuck hepatitis virus genome cloned in Escherichia coli was determined by the method of Maxam and Gilbert. This sequence was found to be 3,308 nucleotides long. Potential ATG initiator triplets and nonsense codons were identified and used to locate regions with a substantial coding capacity. A striking similarity was observed between the organization of human hepatitis B virus and woodchuck hepatitis virus. Nucleotide sequences of these open regions in the woodchuck virus were compared with corresponding regions present in hepatitis B virus. This allowed the location of four viral genes on the L strand and indicated the absence of protein coded by the S strand. Evolution rates of the various parts of the genome as well as of the four different proteins coded by hepatitis B virus and woodchuck hepatitis virus were compared. These results indicated that: (i) the core protein has evolved slightly less rapidly than the other proteins; and (ii) when a region of DNA codes for two different proteins, there is less freedom for the DNA to evolve and, moreover, one of the proteins can evolve more rapidly than the other. A hairpin structure, very well conserved in the two genomes, was located in the only region devoid of coding function, suggesting the location of the origin of replication of the viral DNA. Images PMID:7086958

  1. A logical sequence search for S100B target proteins.

    PubMed Central

    McClintock, K. A.; Shaw, G. S.

    2000-01-01

    The EF-hand calcium-binding protein S100B has been shown to interact in vitro in a calcium-sensitive manner with many substrates. These potential S100B target proteins have been screened for the preservation of a previously identified consensus sequence across species. The results were compared to known structural and in vitro properties of the proteins to rationalize choices for potential binding partners. Our approach uncovered four oligomeric proteins tubulin (alpha and beta), glial fibrillary acidic protein (GFAP), desmin, and vimentin that have conserved regions matching the consensus sequence. In the type III intermediate filament proteins (GFAP, vimentin, and desmin), this region corresponds to a portion of a coiled-coil (helix 2A), the structural element responsible for their assembly. In tubulin, the sequence matches correspond to regions of alpha and beta tubulin found at the alpha beta tubulin interface. In both cases, these consensus sequence matches provide a logical explanation for in vitro observations that S100B is able to inhibit oligomerization of these proteins. PMID:11106180

  2. Increasing Sequence Diversity with Flexible Backbone Protein Design: The Complete Redesign of a Protein Hydrophobic Core

    SciTech Connect

    Murphy, Grant S.; Mills, Jeffrey L.; Miley, Michael J.; Machius, Mischa; Szyperski, Thomas; Kuhlman, Brian

    2015-10-15

    Protein design tests our understanding of protein stability and structure. Successful design methods should allow the exploration of sequence space not found in nature. However, when redesigning naturally occurring protein structures, most fixed backbone design algorithms return amino acid sequences that share strong sequence identity with wild-type sequences, especially in the protein core. This behavior places a restriction on functional space that can be explored and is not consistent with observations from nature, where sequences of low identity have similar structures. Here, we allow backbone flexibility during design to mutate every position in the core (38 residues) of a four-helix bundle protein. Only small perturbations to the backbone, 12 {angstrom}, were needed to entirely mutate the core. The redesigned protein, DRNN, is exceptionally stable (melting point >140C). An NMR and X-ray crystal structure show that the side chains and backbone were accurately modeled (all-atom RMSD = 1.3 {angstrom}).

  3. DNA Shape versus Sequence Variations in the Protein Binding Process.

    PubMed

    Chen, Chuanying; Pettitt, B Montgomery

    2016-02-02

    The binding process of a protein with a DNA involves three stages: approach, encounter, and association. It has been known that the complexation of protein and DNA involves mutual conformational changes, especially for a specific sequence association. However, it is still unclear how the conformation and the information in the DNA sequences affects the binding process. What is the extent to which the DNA structure adopted in the complex is induced by protein binding, or is instead intrinsic to the DNA sequence? In this study, we used the multiscale simulation method to explore the binding process of a protein with DNA in terms of DNA sequence, conformation, and interactions. We found that in the approach stage the protein can bind both the major and minor groove of the DNA, but uses different features to locate the binding site. The intrinsic conformational properties of the DNA play a significant role in this binding stage. By comparing the specific DNA with the nonspecific in unbound, intermediate, and associated states, we found that for a specific DNA sequence, ∼40% of the bending in the association forms is intrinsic and that ∼60% is induced by the protein. The protein does not induce appreciable bending of nonspecific DNA. In addition, we proposed that the DNA shape variations induced by protein binding are required in the early stage of the binding process, so that the protein is able to approach, encounter, and form an intermediate at the correct site on DNA. Copyright © 2016 Biophysical Society. Published by Elsevier Inc. All rights reserved.

  4. N-terminal sequence analysis of proteins and peptides.

    PubMed

    Reim, D F; Speicher, D W

    2001-05-01

    Amino-terminal (N-terminal) sequence analysis is used to identify the order of amino acids of proteins or peptides, starting at their N-terminal end. This unit describes the sequence analysis of protein or peptide samples in solution or bound to PVDF membranes using a Perkin-Elmer Procise Sequencer. Sequence analysis of protein or peptide samples in solution or bound to PVDF membranes using a Hewlett-Packard Model G1005A sequencer is also described. Methods are provided for optimizing separation of PTH amino acid derivatives on Perkin-Elmer instruments and for increasing the proportion of sample injected onto the PTH analyzer on older Perkin-Elmer instruments by installing a modified sample loop. The amount of data obtained from a single sequencer run is substantial, and careful interpretation of this data by an experienced scientist familiar with the current operation performance of the instrument used for this analysis is critically important. A discussion of data interpretation is therefore provided. Finally, discussion of optimization of sequencer performance as well as possible solutions to frequently encountered problems is included.

  5. Learning Cellular Sorting Pathways Using Protein Interactions and Sequence Motifs

    PubMed Central

    Lin, Tien-Ho; Bar-Joseph, Ziv

    2011-01-01

    Abstract Proper subcellular localization is critical for proteins to perform their roles in cellular functions. Proteins are transported by different cellular sorting pathways, some of which take a protein through several intermediate locations until reaching its final destination. The pathway a protein is transported through is determined by carrier proteins that bind to specific sequence motifs. In this article, we present a new method that integrates protein interaction and sequence motif data to model how proteins are sorted through these sorting pathways. We use a hidden Markov model (HMM) to represent protein sorting pathways. The model is able to determine intermediate sorting states and to assign carrier proteins and motifs to the sorting pathways. In simulation studies, we show that the method can accurately recover an underlying sorting model. Using data for yeast, we show that our model leads to accurate prediction of subcellular localization. We also show that the pathways learned by our model recover many known sorting pathways and correctly assign proteins to the path they utilize. The learned model identified new pathways and their putative carriers and motifs and these may represent novel protein sorting mechanisms. Supplementary results and software implementation are available from http://murphylab.web.cmu.edu/software/2010_RECOMB_pathways/. PMID:21999284

  6. Using homology relations within a database markedly boosts protein sequence similarity search

    PubMed Central

    Tong, Jing; Sadreyev, Ruslan I.; Pei, Jimin; Kinch, Lisa N.; Grishin, Nick V.

    2015-01-01

    Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence–based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit’s known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre. PMID:26038555

  7. Variability of the protein sequences of lcrV between epidemic and atypical rhamnose-positive strains of Yersinia pestis.

    PubMed

    Anisimov, Andrey P; Panfertsev, Evgeniy A; Svetoch, Tat'yana E; Dentovskaya, Svetlana V

    2007-01-01

    Sequencing of lcrV genes and comparison of the deduced amino acid sequences from ten Y. pestis strains belonging mostly to the group of atypical rhamnose-positive isolates (non-pestis subspecies or pestoides group) showed that the LcrV proteins analyzed could be classified into five sequence types. This classification was based on major amino acid polymorphisms among LcrV proteins in the four "hot points" of the protein sequences. Some additional minor polymorphisms were found throughout these sequence types. The "hot points" corresponded to amino acids 18 (Lys --> Asn), 72 (Lys --> Arg), 273 (Cys --> Ser), and 324-326 (Ser-Gly-Lys --> Arg) in the LcrV sequence of the reference Y. pestis strain CO92. One possible explanation for polymorphism in amino acid sequences of LcrV among different strains is that strain-specific variation resulted from adaptation of the plague pathogen to different rodent and lagomorph hosts.

  8. Comparison of solution-based exome capture methods for next generation sequencing

    PubMed Central

    2011-01-01

    Background Techniques enabling targeted re-sequencing of the protein coding sequences of the human genome on next generation sequencing instruments are of great interest. We conducted a systematic comparison of the solution-based exome capture kits provided by Agilent and Roche NimbleGen. A control DNA sample was captured with all four capture methods and prepared for Illumina GAII sequencing. Sequence data from additional samples prepared with the same protocols were also used in the comparison. Results We developed a bioinformatics pipeline for quality control, short read alignment, variant identification and annotation of the sequence data. In our analysis, a larger percentage of the high quality reads from the NimbleGen captures than from the Agilent captures aligned to the capture target regions. High GC content of the target sequence was associated with poor capture success in all exome enrichment methods. Comparison of mean allele balances for heterozygous variants indicated a tendency to have more reference bases than variant bases in the heterozygous variant positions within the target regions in all methods. There was virtually no difference in the genotype concordance compared to genotypes derived from SNP arrays. A minimum of 11× coverage was required to make a heterozygote genotype call with 99% accuracy when compared to common SNPs on genome-wide association arrays. Conclusions Libraries captured with NimbleGen kits aligned more accurately to the target regions. The updated NimbleGen kit most efficiently covered the exome with a minimum coverage of 20×, yet none of the kits captured all the Consensus Coding Sequence annotated exons. PMID:21955854

  9. Can natural proteins designed with 'inverted' peptide sequences adopt native-like protein folds?

    PubMed

    Sridhar, Settu; Guruprasad, Kunchur

    2014-01-01

    We have carried out a systematic computational analysis on a representative dataset of proteins of known three-dimensional structure, in order to evaluate whether it would possible to 'swap' certain short peptide sequences in naturally occurring proteins with their corresponding 'inverted' peptides and generate 'artificial' proteins that are predicted to retain native-like protein fold. The analysis of 3,967 representative proteins from the Protein Data Bank revealed 102,677 unique identical inverted peptide sequence pairs that vary in sequence length between 5-12 and 18 amino acid residues. Our analysis illustrates with examples that such 'artificial' proteins may be generated by identifying peptides with 'similar structural environment' and by using comparative protein modeling and validation studies. Our analysis suggests that natural proteins may be tolerant to accommodating such peptides.

  10. Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces.

    PubMed

    Aytuna, A Selim; Gursoy, Attila; Keskin, Ozlem

    2005-06-15

    Elucidation of the full network of protein-protein interactions is crucial for understanding of the principles of biological systems and processes. Thus, there is a need for in silico methods for predicting interactions. We present a novel algorithm for automated prediction of protein-protein interactions that employs a unique bottom-up approach combining structure and sequence conservation in protein interfaces. Running the algorithm on a template dataset of 67 interfaces and a sequentially non-redundant dataset of 6170 protein structures, 62 616 potential interactions are predicted. These interactions are compared with the ones in two publicly available interaction databases (Database of Interacting Proteins and Biomolecular Interaction Network Database) and also the Protein Data Bank. A significant number of predictions are verified in these databases. The unverified ones may correspond to (1) interactions that are not covered in these databases but known in literature, (2) unknown interactions that actually occur in nature and (3) interactions that do not occur naturally but may possibly be realized synthetically in laboratory conditions. Some unverified interactions, supported significantly with studies found in the literature, are discussed. http://gordion.hpc.eng.ku.edu.tr/prism agursoy@ku.edu.tr; okeskin@ku.edu.tr.

  11. Deep sequencing methods for protein engineering and design.

    PubMed

    Wrenbeck, Emily E; Faber, Matthew S; Whitehead, Timothy A

    2016-11-22

    The advent of next-generation sequencing (NGS) has revolutionized protein science, and the development of complementary methods enabling NGS-driven protein engineering have followed. In general, these experiments address the functional consequences of thousands of protein variants in a massively parallel manner using genotype-phenotype linked high-throughput functional screens followed by DNA counting via deep sequencing. We highlight the use of information rich datasets to engineer protein molecular recognition. Examples include the creation of multiple dual-affinity Fabs targeting structurally dissimilar epitopes and engineering of a broad germline-targeted anti-HIV-1 immunogen. Additionally, we highlight the generation of enzyme fitness landscapes for conducting fundamental studies of protein behavior and evolution. We conclude with discussion of technological advances. Copyright © 2016 Elsevier Ltd. All rights reserved.

  12. Testing statistical significance scores of sequence comparison methods with structure similarity

    PubMed Central

    Hulsen, Tim; de Vlieg, Jacob; Leunissen, Jack AM; Groenen, Peter MA

    2006-01-01

    Background In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. Results All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. Conclusion The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons. PMID:17038163

  13. MEME: discovering and analyzing DNA and protein sequence motifs.

    PubMed

    Bailey, Timothy L; Williams, Nadya; Misleh, Chris; Li, Wilfred W

    2006-07-01

    MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel 'signals' in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource (http://meme.nbcr.net) and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance.

  14. MEME: discovering and analyzing DNA and protein sequence motifs

    PubMed Central

    Bailey, Timothy L.; Williams, Nadya; Misleh, Chris; Li, Wilfred W.

    2006-01-01

    MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel ‘signals’ in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource () and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance. PMID:16845028

  15. Fast comparison of DNA sequences by oligonucleotide profiling

    PubMed Central

    Arnau, Vicente; Gallach, Miguel; Marín, Ignacio

    2008-01-01

    Background The comparison of DNA sequences is a traditional problem in genomics and bioinformatics. Many new opportunities emerge due to the improvement of personal computers, allowing the implementation of novel strategies of analysis. Findings We describe a new program, called UVWORD, which determines the number of times that each DNA word present in a sequence (target) is found in a second sequence (source), a procedure that we have called oligonucleotide profiling. On a standard computer, the user may search for words of a size ranging from k = 1 to k = 14 nucleotides. Average counts for groups of contiguous words may also be established. The rate of analysis on standard computers is from 3.4 (k = 14) to 16 millions of words per second (1 ≤ k ≤ 8). This makes feasible the fast screening of even the longest known DNA molecules. Discussion We show that the combination of the ability of analyzing words of relatively long size, which occur very rarely by chance, and the fast speed of the program allows to perform novel types of screenings, complementary to those provided by standard programs such as BLAST. This method can be used to determine oligonucleotide content, to characterize the distribution of repetitive sequences in chromosomes, to determine the evolutionary conservation of sequences in different species, to establish regions of similar DNA among chromosomes or genomes, etc. PMID:18710530

  16. WildSpan: mining structured motifs from protein sequences

    PubMed Central

    2011-01-01

    Background Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost. Results WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode of WildSpan is developed for

  17. WildSpan: mining structured motifs from protein sequences.

    PubMed

    Hsu, Chen-Ming; Chen, Chien-Yu; Liu, Baw-Jhiune

    2011-03-31

    Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost. WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode of WildSpan is developed for discovering

  18. A gelsolin-related protein from lobster muscle: cloning, sequence analysis and expression.

    PubMed Central

    Lück, A; D'Haese, J; Hinssen, H

    1995-01-01

    The tail muscle of the lobster Homarus americanus contains an actin-binding protein with an apparent molecular mass of 105 kDa determined by SDS/PAGE and gelsolin-like properties. We isolated this protein and peptide sequences were obtained after limited proteolysis with chymotrypsin. A tail-muscle-specific cDNA library was constructed in a lambda expression vector and a full-length clone was obtained by screening with a polyclonal anti-(crustacean gelsolin) antibody. The cDNA insert of approx. 3.2 kb length was sequenced. The cDNA contained an open reading frame of 2.265 kb, and the deduced amino acid sequence of 754 residues (83,469 Da) identified the protein as a cytoplasmic member of the gelsolin/villin protein family. Comparison of the lobster gelsolin amino acid sequence with other members of this protein family revealed the characteristic 6-fold repeated segmental structure as well as the three conserved sequence motifs typical of each segment [Way and Weeds (1988) J. Mol. Biol. 203, 1127-1133]. Strong homologies were found with Drosophila gelsolin, human gelsolin, villin core, Dictyostelium severin and Physarum fragmin. In addition, the gelsolin-like protein from lobster muscle revealed motifs that were clearly similar to the actin-bundling region of human villin headpiece although it did not itself contain a distinct headpiece domain. The recombinant lobster gelsolin-like protein, expressed in Escherichia coli as a fusion protein, was purified from inclusion bodies and renatured as a functional protein. There were no significant differences in the biological activity tested between the recombinant and the native protein isolated from lobster muscle. Images Figure 2 Figure 6 Figure 7 PMID:7848275

  19. Alignment of Helical Membrane Protein Sequences Using AlignMe

    PubMed Central

    Khafizov, Kamil; Forrest, Lucy R.

    2013-01-01

    Few sequence alignment methods have been designed specifically for integral membrane proteins, even though these important proteins have distinct evolutionary and structural properties that might affect their alignments. Existing approaches typically consider membrane-related information either by using membrane-specific substitution matrices or by assigning distinct penalties for gap creation in transmembrane and non-transmembrane regions. Here, we ask whether favoring matching of predicted transmembrane segments within a standard dynamic programming algorithm can improve the accuracy of pairwise membrane protein sequence alignments. We tested various strategies using a specifically designed program called AlignMe. An updated set of homologous membrane protein structures, called HOMEP2, was used as a reference for optimizing the gap penalties. The best of the membrane-protein optimized approaches were then tested on an independent reference set of membrane protein sequence alignments from the BAliBASE collection. When secondary structure (S) matching was combined with evolutionary information (using a position-specific substitution matrix (P)), in an approach we called AlignMePS, the resultant pairwise alignments were typically among the most accurate over a broad range of sequence similarities when compared to available methods. Matching transmembrane predictions (T), in addition to evolutionary information, and secondary-structure predictions, in an approach called AlignMePST, generally reduces the accuracy of the alignments of closely-related proteins in the BAliBASE set relative to AlignMePS, but may be useful in cases of extremely distantly related proteins for which sequence information is less informative. The open source AlignMe code is available at https://sourceforge.net/projects/alignme/, and at http://www.forrestlab.org, along with an online server and the HOMEP2 data set. PMID:23469223

  20. Multi-species sequence comparison: the next frontier in genome annotation.

    PubMed

    Dubchak, Inna; Frazer, Kelly

    2003-01-01

    Multi-species comparisons of DNA sequences are more powerful for discovering functional sequences than pairwise DNA sequence comparisons. Most current computational tools have been designed for pairwise comparisons, and efficient extension of these tools to multiple species will require knowledge of the ideal evolutionary distance to choose and the development of new algorithms for alignment, analysis of conservation, and visualization of results.

  1. Can Computationally Designed Protein Sequences Improve Secondary Structure Prediction?

    DTIC Science & Technology

    2011-01-01

    SSP. We use the RosettaDesign program to generate sequences that are com- patible with the structural classification of proteins ( SCOP ) database of...1997) using a significantly larger database of known structures than previously reported in the literature. Methods In this work, the Astral SCOP 1.75...6511 SCOP 1.75 domains were used after some domains were discarded due to large missing segments (Nres . 10), non-contiguities in the domain sequence

  2. Using principal component analysis and support vector machine to predict protein structural class for low-similarity sequences via PSSM.

    PubMed

    Zhang, Shengli; Ye, Feng; Yuan, Xiguo

    2012-01-01

    The accurate identification of protein structure class solely using extracted information from protein sequence is a complicated task in the current computational biology. Prediction of protein structural class for low-similarity sequences remains a challenging problem. In this study, the new computational method has been developed to predict protein structural class by fusing the sequence information and evolution information to represent a protein sample. To evaluate the performance of the proposed method, jackknife cross-validation tests are performed on two widely used benchmark data-sets, 1189 and 25PDB with sequence similarity lower than 40 and 25%, respectively. Comparison of our results with other methods shows that the proposed method by us is very promising and may provide a cost-effective alternative to predict protein structural class in particular for low-similarity data-sets.

  3. Internal organization of large protein families: relationship between the sequence, structure and function based clustering

    PubMed Central

    Cai, Xiao-hui; Jaroszewski, Lukasz; Wooley, John; Godzik, Adam

    2011-01-01

    The protein universe can be organized in families that group proteins sharing common ancestry. Such families display variable levels of structural and functional divergence, from homogenous families, where all members have the same function and very similar structure, to very divergent families, where large variations in function and structure are observed. For practical purposes of structure and function prediction, it would be beneficial to identify sub-groups of proteins with highly similar structures (iso-structural) and/or functions (iso-functional) within divergent protein families. We compared three algorithms in their ability to cluster large protein families and discuss whether any of these methods could reliably identify such iso-structural or iso-functional groups. We show that clustering using profile-sequence and profile-profile comparison methods closely reproduces clusters based on similarities between 3D structures or clusters of proteins with similar biological functions. In contrast, the still commonly used sequence-based methods with fixed thresholds result in vast overestimates of structural and functional diversity in protein families. As a result, these methods also overestimate the number of protein structures that have to be determined to fully characterize structural space of such families. The fact that one can build reliable models based on apparently distantly related templates is crucial for extracting maximal amount of information from new sequencing projects. PMID:21671455

  4. Internal organization of large protein families: relationship between the sequence, structure, and function-based clustering.

    PubMed

    Cai, Xiao-Hui; Jaroszewski, Lukasz; Wooley, John; Godzik, Adam

    2011-08-01

    The protein universe can be organized in families that group proteins sharing common ancestry. Such families display variable levels of structural and functional divergence, from homogenous families, where all members have the same function and very similar structure, to very divergent families, where large variations in function and structure are observed. For practical purposes of structure and function prediction, it would be beneficial to identify sub-groups of proteins with highly similar structures (iso-structural) and/or functions (iso-functional) within divergent protein families. We compared three algorithms in their ability to cluster large protein families and discuss whether any of these methods could reliably identify such iso-structural or iso-functional groups. We show that clustering using profile-sequence and profile-profile comparison methods closely reproduces clusters based on similarities between 3D structures or clusters of proteins with similar biological functions. In contrast, the still commonly used sequence-based methods with fixed thresholds result in vast overestimates of structural and functional diversity in protein families. As a result, these methods also overestimate the number of protein structures that have to be determined to fully characterize structural space of such families. The fact that one can build reliable models based on apparently distantly related templates is crucial for extracting maximal amount of information from new sequencing projects.

  5. Structure and Sequence Search on Aptamer-Protein Docking

    NASA Astrophysics Data System (ADS)

    Xiao, Jiajie; Bonin, Keith; Guthold, Martin; Salsbury, Freddie

    2015-03-01

    Interactions between proteins and deoxyribonucleic acid (DNA) play a significant role in the living systems, especially through gene regulation. However, short nucleic acids sequences (aptamers) with specific binding affinity to specific proteins exhibit clinical potential as therapeutics. Our capillary and gel electrophoresis selection experiments show that specific sequences of aptamers can be selected that bind specific proteins. Computationally, given the experimentally-determined structure and sequence of a thrombin-binding aptamer, we can successfully dock the aptamer onto thrombin in agreement with experimental structures of the complex. In order to further study the conformational flexibility of this thrombin-binding aptamer and to potentially develop a predictive computational model of aptamer-binding, we use GPU-enabled molecular dynamics simulations to both examine the conformational flexibility of the aptamer in the absence of binding to thrombin, and to determine our ability to fold an aptamer. This study should help further de-novo predictions of aptamer sequences by enabling the study of structural and sequence-dependent effects on aptamer-protein docking specificity.

  6. Sequence Analysis and Evolutionary Studies of Reelin Proteins

    PubMed Central

    Manoharan, Malini; Muhammad, Sayyed Auwn; Sowdhamini, Ramanathan

    2015-01-01

    The reelin gene is conserved across many vertebrate species, including humans. The protein product of this gene plays several important roles in early brain development and regulation of neural network plasticity of a matured brain structure. With an extended structure of 3461 amino acid sequences, consisting of eight reelin repeats, the human reelin sequence stands out as an exceptional model for evolutionary studies. In this study, sequence analysis of the human reelin and its homologues and reelin sequences from 104 other species is described in detail. Interesting sequence conservation patterns of individual repeats have been highlighted. Sequence phylogeny of the reelin sequences indicates a pattern similar to the evolution of the species, thereby serving as a highly conserved family for evolutionary purposes. Multiple sequence alignment of different reelin domain repeats, derived from homologues, suggests specific functions for individual repeats and high sequence conservation across reelin repeats from different organisms, albeit with few unusual domain architectures. A three-dimensional structural model of the full-length human reelin is now available that provides clues on residues at the dimer interface. PMID:26715843

  7. Homolog detection using global sequence properties suggests an alternate view of structural encoding in protein sequences

    PubMed Central

    Scheraga, Harold A.; Rackovsky, S.

    2014-01-01

    We show that a Fourier-based sequence distance function is able to identify structural homologs of target sequences with high accuracy. It is shown that Fourier distances correlate very strongly with independently determined structural distances between molecules, a property of the method that is not attainable using conventional representations. It is further shown that the ability of the Fourier approach to identify protein folds is statistically far in excess of random expectation. It is then shown that, in actual searches for structural homologs of selected target sequences, the Fourier approach gives excellent results. On the basis of these results, we suggest that the global information detected by the Fourier representation is an essential feature of structure encoding in protein sequences and a key to structural homology detection. PMID:24706836

  8. Patterns in protein primary sequences: classification, display and analysis.

    PubMed Central

    Saurugger, P. N.; Metfessel, B. A.

    1991-01-01

    The protein folding code, which is contained in the amino acid chain of a protein, has so far eluded elucidation. However, patterns of hydrophobic residues have previously been identified which show a specificity towards certain secondary structural elements. We are developing an analysis toolkit to find, visualize, and analyze patterns in primary sequences. Preliminary results show that there exist patterns in primary sequences which are useful for predicting the structural class of amino acid chains, performing especially well for the all-alpha helix and all-beta sheet classes. PMID:1807631

  9. Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments

    PubMed Central

    2010-01-01

    Background While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. Results We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to

  10. Extracting protein alignment models from the sequence database.

    PubMed Central

    Neuwald, A F; Liu, J S; Lipman, D J; Lawrence, C E

    1997-01-01

    Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans ; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences. PMID:9108146

  11. Sequence and structure conservation in a protein core.

    PubMed

    Rodionov, M A; Blundell, T L

    1998-11-15

    In order to study structural aspects of sequence conservation in families of homologous proteins, we have analyzed structurally aligned sequences of 585 proteins grouped into 128 homologous families. The conservation of a residue in a family is defined as the average residue similarity in a given position of aligned sequences. The residue similarities were expressed in the form of log-odd substitution tables that take into account the environments of amino acids in three-dimensional structures. The protein core is defined as those residues that have less then 7% solvent accessibility. The density of a protein core is described in terms of atom packing, which is investigated as a criterion for residue substitution and conservation. Although there is no significant correlation between sequence conservation and average atom packing around nonpolar residues such as leucine, valine and isoleucine, a significant correlation is observed for polar residues in the protein core. This may be explained by the hydrogen bonds in which polar residues are involved; the better their protection from water access the more stable should be the structure in that position.

  12. Learning to Translate Sequence and Structure to Function: Identifying DNA Binding and Membrane Binding Proteins

    PubMed Central

    Langlois, Robert E; Carson, Matthew B; Bhardwaj, Nitin; Lu, Hui

    2009-01-01

    A protein's function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics because it provides robust annotation of genes and proteins without depending solely on sequence similarity. Here we developed a machine learning protocol to identify DNA-binding proteins and membrane-binding proteins. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well was investigated. Indeed, the boosted tree classifier was found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous DNA-binding proteins and membrane-binding proteins respectively from non-binding proteins, significantly outperforming all previously published works. We also explored the importance of a protein's attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees was applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA- and membrane-binding, rather than annotating function through sequence similarity. PMID:17436108

  13. Automatic generation of primary sequence patterns from sets of related protein sequences.

    PubMed Central

    Smith, R F; Smith, T F

    1990-01-01

    We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern. Images PMID:2296575

  14. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information.

    PubMed

    Ma, Xin; Guo, Jing; Liu, Hong-De; Xie, Jian-Ming; Sun, Xiao

    2012-01-01

    The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthew’s correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/.

  15. Bioinformatics comparison of sulfate-reducing metabolism nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Nguyen, A.; Cheung, E.; Sullivan, R.; Holden, T.; Lieberman, D.; Cheung, T.

    2015-09-01

    The sulfate-reducing bacteria can be traced back to 3.5 billion years ago. The thermodynamics details of the sulfur cycle have been well documented. A recent sulfate-reducing bacteria report (Robator, Jungbluth, et al , 2015 Jan, Front. Microbiol) with Genbank nucleotide data has been analyzed in terms of the sulfite reductase (dsrAB) via fractal dimension and entropy values. Comparison to oil field sulfate-reducing sequences was included. The AUCG translational mass fractal dimension versus ATCG transcriptional mass fractal dimension for the low temperature dsrB and dsrA sequences reported in Reference Thirteen shows correlation R-sq ~ 0.79 , with a probably of about 3% in simulation. A recent report of using Cystathionine gamma-lyase sequence to produce CdS quantum dot in a biological method, where the sulfur is reduced just like in the H2S production process, was included for comparison. The AUCG mass fractal dimension versus ATCG mass fractal dimension for the Cystathionine gamma-lyase sequences was found to have R-sq of 0.72, similar to the low temperature dissimilatory sulfite reductase dsr group with 3% probability, in contrary to the oil field group having R-sq ~ 0.94, a high probable outcome in the simulation. The other two simulation histograms, namely, fractal dimension versus entropy R-sq outcome values, and di-nucleotide entropy versus mono-nucleotide entropy R-sq outcome values are also discussed in the data analysis focusing on low probability outcomes.

  16. MannDB - a microbial database of automated protein sequence analyses and evidence integration for protein characterization.

    PubMed

    Zhou, Carol L Ecale; Lam, Marisa W; Smith, Jason R; Zemla, Adam T; Dyer, Matthew D; Kuczmarski, Thomas A; Vitalis, Elizabeth A; Slezak, Thomas R

    2006-10-17

    -priority agents on the websites of several governmental organizations concerned with bio-terrorism. MannDB provides the user with a BLAST interface for comparison of native and non-native sequences and a query tool for conveniently selecting proteins of interest. In addition, the user has access to a web-based browser that compiles comprehensive and extensive reports. Access to MannDB is freely available at http://manndb.llnl.gov/.

  17. EST2Prot: Mapping EST sequences to proteins

    PubMed Central

    Shafer, Paul; Lin, David M; Yona, Golan

    2006-01-01

    Background EST libraries are used in various biological studies, from microarray experiments to proteomic and genetic screens. These libraries usually contain many uncharacterized ESTs that are typically ignored since they cannot be mapped to known genes. Consequently, new discoveries are possibly overlooked. Results We describe a system (EST2Prot) that uses multiple elements to map EST sequences to their corresponding protein products. EST2Prot uses UniGene clusters, substring analysis, information about protein coding regions in existing DNA sequences and protein database searches to detect protein products related to a query EST sequence. Gene Ontology terms, Swiss-Prot keywords, and protein similarity data are used to map the ESTs to functional descriptors. Conclusion EST2Prot extends and significantly enriches the popular UniGene mapping by utilizing multiple relations between known biological entities. It produces a mapping between ESTs and proteins in real-time through a simple web-interface. The system is part of the Biozon database and is accessible at . PMID:16515706

  18. Cytochrome oxidase subunit III from Arbacia lixula: detection of functional constraints by comparison with homologous sequences.

    PubMed

    De Giorgi, C; Martiradonna, A; Saccone, C

    1993-01-01

    In this paper we report the comparison of the sequences of the cytochrome oxidase subunit III from three different sea urchin species. Both nucleotide and amino acid sequences have been analyzed. The nucleotide sequence analysis reveals that the sea urchin sequences obey some rules already found in mammals. The base substitution analysis carried out on the sequences of the three species pairs, shows that the evolutionary dynamics of the first and the second codon positions are so slow that do not allow a quantitative measurement of their genetic distances, thus demonstrating that also in these species the COIII gene is strongly conserved during evolution. Changes occurring at the third codon positions indicate that the three species evolved from a common ancestor under different directional mutational pressure. The multi-alignment of the sea urchin proteins indicates the existence of the amino acid sequence motif N R T that represents a possible glycosylation site. Another glycosylation site has been detected in the mammalian cytochrome oxidase subunit III, in a position slightly different. Such an analysis revealed, for the first time, a new functional aspect of this sequence.

  19. Purification and sequencing of the active site tryptic peptide from penicillin-binding protein 1b of Escherichia coli

    SciTech Connect

    Nicholas, R.A.; Suzuki, H.; Hirota, Y.; Strominger, J.L.

    1985-07-02

    This paper reports the sequence of the active site peptide of penicillin-binding protein 1b from Escherichia coli. Purified penicillin-binding protein 1b was labeled with (/sup 14/C)penicillin G, digested with trypsin, and partially purified by gel filtration. Upon further purification by high-pressure liquid chromatography, two radioactive peaks were observed, and the major peak, representing over 75% of the applied radioactivity, was submitted to amino acid analysis and sequencing. The sequence Ser-Ile-Gly-Ser-Leu-Ala-Lys was obtained. The active site nucleophile was identified by digesting the purified peptide with aminopeptidase M and separating the radioactive products on high-pressure liquid chromatography. Amino acid analysis confirmed that the serine residue in the middle of the sequence was covalently bonded to the (/sup 14/C)penicilloyl moiety. A comparison of this sequence to active site sequences of other penicillin-binding proteins and beta-lactamases is presented.

  20. Identification of staphylococcal species based on variations in protein sequences (mass spectrometry) and DNA sequence (sodA microarray).

    PubMed

    Kooken, Jennifer; Fox, Karen; Fox, Alvin; Altomare, Diego; Creek, Kim; Wunschel, David; Pajares-Merino, Sara; Martínez-Ballesteros, Ilargi; Garaizar, Javier; Oyarzabal, Omar; Samadpour, Mansour

    2014-02-01

    This report is among the first using sequence variation in newly discovered protein markers for staphylococcal (or indeed any other bacterial) speciation. Variation, at the DNA sequence level, in the sodA gene (commonly used for staphylococcal speciation) provided excellent correlation. Relatedness among strains was also assessed using protein profiling using microcapillary electrophoresis and pulsed field electrophoresis. A total of 64 strains were analyzed including reference strains representing the 11 staphylococcal species most commonly isolated from man (Staphylococcus aureus and 10 coagulase negative species [CoNS]). Matrix assisted time of flight ionization/ionization mass spectrometry (MALDI TOF MS) and liquid chromatography-electrospray ionization tandem mass spectrometry (LC ESI MS/MS) were used for peptide analysis of proteins isolated from gel bands. Comparison of experimental spectra of unknowns versus spectra of peptides derived from reference strains allowed bacterial identification after MALDI TOF MS analysis. After LC-MS/MS analysis of gel bands bacterial speciation was performed by comparing experimental spectra versus virtual spectra using the software X!Tandem. Finally LC-MS/MS was performed on whole proteomes and data analysis also employing X!tandem. Aconitate hydratase and oxoglutarate dehydrogenase served as marker proteins on focused analysis after gel separation. Alternatively on full proteomics analysis elongation factor Tu generally provided the highest confidence in staphylococcal speciation.

  1. Comparative Studies of Disordered Proteins with Similar Sequences: Application to Aβ40 and Aβ42

    PubMed Central

    Fisher, Charles K.; Ullman, Orly; Stultz, Collin M.

    2013-01-01

    Quantitative comparisons of intrinsically disordered proteins (IDPs) with similar sequences, such as mutant forms of the same protein, may provide insights into IDP aggregation—a process that plays a role in several neurodegenerative disorders. Here we describe an approach for modeling IDPs with similar sequences that simplifies the comparison of the ensembles by utilizing a single library of structures. The relative population weights of the structures are estimated using a Bayesian formalism, which provides measures of uncertainty in the resulting ensembles. We applied this approach to the comparison of ensembles for Aβ40 and Aβ42. Bayesian hypothesis testing finds that although both Aβ species sample β-rich conformations in solution that may represent prefibrillar intermediates, the probability that Aβ42 samples these prefibrillar states is roughly an order of magnitude larger than the frequency in which Aβ40 samples such structures. Moreover, the structure of the soluble prefibrillar state in our ensembles is similar to the experimentally determined structure of Aβ that has been implicated as an intermediate in the aggregation pathway. Overall, our approach for comparative studies of IDPs with similar sequences provides a platform for future studies on the effect of mutations on the structure and function of disordered proteins. PMID:23561531

  2. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology.

    PubMed

    Bakhtiarizadeh, Mohammad Reza; Moradi-Shahrbabak, Mohammad; Ebrahimi, Mansour; Ebrahimie, Esmaeil

    2014-09-07

    Due to the central roles of lipid binding proteins (LBPs) in many biological processes, sequence based identification of LBPs is of great interest. The major challenge is that LBPs are diverse in sequence, structure, and function which results in low accuracy of sequence homology based methods. Therefore, there is a need for developing alternative functional prediction methods irrespective of sequence similarity. To identify LBPs from non-LBPs, the performances of support vector machine (SVM) and neural network were compared in this study. Comprehensive protein features and various techniques were employed to create datasets. Five-fold cross-validation (CV) and independent evaluation (IE) tests were used to assess the validity of the two methods. The results indicated that SVM outperforms neural network. SVM achieved 89.28% (CV) and 89.55% (IE) overall accuracy in identification of LBPs from non-LBPs and 92.06% (CV) and 92.90% (IE) (in average) for classification of different LBPs classes. Increasing the number and the range of extracted protein features as well as optimization of the SVM parameters significantly increased the efficiency of LBPs class prediction in comparison to the only previous report in this field. Altogether, the results showed that the SVM algorithm can be run on broad, computationally calculated protein features and offers a promising tool in detection of LBPs classes. The proposed approach has the potential to integrate and improve the common sequence alignment based methods. Copyright © 2014 Elsevier Ltd. All rights reserved.

  3. One common structural feature of "words" in protein sequences and human texts.

    PubMed

    Zemková, M; Trifonov, E N; Zahradník, D

    2014-01-01

    Frequently discussed analogy between genetic and human texts is explored by comparison of alternation of polar and non-polar amino-acid residues in proteins and alternation of consonants and vowels in human texts. In human languages, the usage of possible combinations of consonants and vowels is influenced by pronounceability of the combinations. Similarly, oligopeptide composition of proteins is influenced by requirements of protein folding and stability. One special type of structure often present in proteins is amphipathic α-helices in which polar and non-polar amino acids alternate with the period 3.5 residues, not unlike alternation of consonants and vowels. In this study, we evaluated the contribution made by amphipathic alternations to the protein sequence texts (20-24%). Their proportion is lower than respective values for alternating words in human texts (57-89%). The proteomes (full sets of proteins for selected organisms) were transformed into ranked sequences of n-grams (words of length n), including periodical amphipathic structures. Similarly, human texts were transformed into sequences of alternating consonants and vowels. Analysis of the vocabularies shows that in both types of texts (human languages and proteins) the alternating words are dominant or highly preferred, thus, strengthening the analogy between these two types of texts. The contribution of amphipathic words in the upper parts of the ranked lists for 10 analyzed proteomes varies between 58 and 74%. In human texts respective values range between 90 and 100%.

  4. Protein structure comparison using the markov transition model of evolution.

    PubMed

    Kawabata, T; Nishikawa, K

    2000-10-01

    A number of automatic protein structure comparison methods have been proposed; however, their similarity score functions are often decided by the researchers' intuition and trial-and-error, and not by theoretical background. We propose a novel theory to evaluate protein structure similarity, which is based on the Markov transition model of evolution. Our similarity score between structures i and j is defined as log P(j --> i)/P(i), where P(j --> i) is the probability that structure j changes to structure i during the evolutionary process, and P(i) is the probability that structure i appears by chance. This is a reasonable definition of structure similarity, especially for finding evolutionarily related (homologous) similarity. The probability P(j --> i) is estimated by the Markov transition model, which is similar to the Dayhoff's substitution model between amino acids. To estimate the parameters of the model, homologous protein structure pairs are collected using sequence similarity, and the numbers of structure transitions within the pairs are counted. Next these numbers are transformed to a transition probability matrix of the Markov transition. Transition probabilities for longer time are obtained by multiplying the probability matrix by itself several times. In this study, we generated three types of structure similarity scores: an environment score, a residue-residue distance score, and a secondary structure elements (SSE) score. Using these scores, we developed the structure comparison program, Matras (MArkovian TRAnsition of protein Structure). It employs a hierarchical alignment algorithm, in which a rough alignment is first obtained by SSEs, and then is improved with more detailed functions. We attempted an all-versus-all comparison of the SCOP database, and evaluated its ability to recognize a superfamily relationship, which was manually assigned to be homologous in the SCOP database. A comparison with the FSSP database shows that our program can

  5. Educational Software for the Analysis of DNA and Protein Sequences.

    ERIC Educational Resources Information Center

    Maloy, Stanley; Olson, Sue

    1989-01-01

    Describes the development of the microcomputer-based educational software, DNAzoom, which was designed to introduce undergraduates in molecular biology to computer analysis of DNA protein sequences. Highlights include graphical presentation of data, the functional use of color, a menu-oriented interface, and students' evaluations of the software.…

  6. Data repository mapping for influenza protein sequence analysis

    NASA Astrophysics Data System (ADS)

    Pellegrino, Donald; Chen, Chaomei

    2011-01-01

    This paper introduces a new method for creating an interactive sequence similarity map of all known influenza virus protein sequences and integrating the map with existing general purpose analytical tools. The NCBI data model was designed to provide a high degree of interconnectedness amongst data objects. Substantial and continuous increase in data volume has led to a large and highly connected information space. Researchers seeking to explore this space are challenged to identify a starting point. They often choose data that is popular in the literature. Reference in the literature follow a power law distribution and popular data points may bias explorers toward paths that lead only to a dead-end of what is already known. To help discover the unexpected we developed an interactive visual analytics system to map the information space of influenza protein sequence data. The design is motivated by the needs of eScience researchers.

  7. Protein stability: computation, sequence statistics, and new experimental methods

    PubMed Central

    Magliery, Thomas J.

    2015-01-01

    Calculating protein stability and predicting stabilizing mutations remain exceedingly difficult tasks, largely due to the inadequacy of potential functions, the difficulty of modeling entropy and the unfolded state, and challenges of sampling, particularly of backbone conformations. Yet, computational design has produced some remarkably stable proteins in recent years, apparently owing to near ideality in structure and sequence features. With caveats, computational prediction of stability can be used to guide mutation, and mutations derived from consensus sequence analysis, especially improved by recent co-variation filters, are very likely to stabilize without sacrificing function. The combination of computational and statistical approaches with library approaches, including new technologies such as deep sequencing and high throughput stability measurements, point to a very exciting near term future for stability engineering, even with difficult computational issues remaining. PMID:26497286

  8. ANTHEPROT: an integrated protein sequence analysis software with client/server capabilities.

    PubMed

    Deléage, G; Combet, C; Blanchet, C; Geourjon, C

    2001-07-01

    Programs devoted to the analysis of protein sequences exist either as stand-alone programs or as Web servers. However, stand-alone programs can hardly accommodate for the analysis that involves comparisons on databanks, which require regular updates. Moreover, Web servers cannot be as efficient as stand-alone programs when dealing with real-time graphic display. We describe here a stand-alone software program called ANTHEPROT, which is intended to perform protein sequence analysis with a high integration level and clients/server capabilities. It is an interactive program with a graphical user interface that allows handling of protein sequence and data in a very interactive and convenient manner. It provides many methods and tools, which are integrated into a graphical user interface. ANTHEPROT is available for Windows-based systems. It is able to connect to a Web server in order to perform large-scale sequence comparison on up-to-date databanks. ANTHEPROT is freely available to academic users and may be downloaded at http://pbil.ibcp.fr/ANTHEPROT.

  9. HomPPI: a class of sequence homology based protein-protein interface prediction methods

    PubMed Central

    2011-01-01

    Background Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. Results We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence. Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein. Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the

  10. HomPPI: a class of sequence homology based protein-protein interface prediction methods.

    PubMed

    Xue, Li C; Dobbs, Drena; Honavar, Vasant

    2011-06-17

    Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence.Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein.Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably

  11. Amino acid sequence of band-3 protein from rainbow trout erythrocytes derived from cDNA.

    PubMed Central

    Hübner, S; Michel, F; Rudloff, V; Appelhans, H

    1992-01-01

    In this report we present the first complete band-3 cDNA sequence of a poikilothermic lower vertebrate. The primary structure of the anion-exchange protein band 3 (AE1) from rainbow trout erythrocytes was determined by nucleotide sequencing of cDNA clones. The overlapping clones have a total length of 3827 bp with a 5'-terminal untranslated region of 150 bp, a 2754 bp open reading frame and a 3'-untranslated region of 924 bp. Band-3 protein from trout erythrocytes consists of 918 amino acid residues with a calculated molecular mass of 101 827 Da. Comparison of its amino acid sequence revealed a 60-65% identity within the transmembrane spanning sequence of band-3 proteins published so far. An additional insertion of 24 amino acid residues within the membrane-associated domain of trout band-3 protein was identified, which until now was thought to be a general feature only of mammalian band-3-related proteins. PMID:1637296

  12. Biophysical and structural considerations for protein sequence evolution

    PubMed Central

    2011-01-01

    Background Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field. Results Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed rates across sites. Conclusions Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model. PMID:22171550

  13. Sequence-structure analysis of FAD-containing proteins.

    PubMed

    Dym, O; Eisenberg, D

    2001-09-01

    We have analyzed structure-sequence relationships in 32 families of flavin adenine dinucleotide (FAD)-binding proteins, to prepare for genomic-scale analyses of this family. Four different FAD-family folds were identified, each containing at least two or more protein families. Three of these families, exemplified by glutathione reductase (GR), ferredoxin reductase (FR), and p-cresol methylhydroxylase (PCMH) were previously defined, and a family represented by pyruvate oxidase (PO) is newly defined. For each of the families, several conserved sequence motifs have been characterized. Several newly recognized sequence motifs are reported here for the PO, GR, and PCMH families. Each FAD fold can be uniquely identified by the presence of distinctive conserved sequence motifs. We also analyzed cofactor properties, some of which are conserved within a family fold while others display variability. Among the conserved properties is cofactor directionality: in some FAD-structural families, the adenine ring of the FAD points toward the FAD-binding domain, whereas in others the isoalloxazine ring points toward this domain. In contrast, the FAD conformation and orientation are conserved in some families while in others it displays some variability. Nevertheless, there are clear correlations among the FAD-family fold, the shape of the pocket, and the FAD conformation. Our general findings are as follows: (a) no single protein 'pharmacophore' exists for binding FAD; (b) in every FAD-binding family, the pyrophosphate moiety binds to the most strongly conserved sequence motif, suggesting that pyrophosphate binding is a significant component of molecular recognition; and (c) sequence motifs can identify proteins that bind phosphate-containing ligands.

  14. Comparison of latent and nominal rabbit Ig VHa1 allotype cDNA sequences.

    PubMed

    McCormack, W T; Dhanarajan, P; Roux, K H

    1988-09-15

    The genetic basis for the expression of a latent VH allotype in the rabbit was investigated. VH region cDNA libraries were produced from spleen mRNA derived from a homozygous a2a2 rabbit expressing an induced latent VHa1 allotype and, for comparison, from a normal homozygus a1a1 rabbit expressing nominal VHa1 allotype. The deduced amino acid sequences of the nominal VHa1 cDNA were concordant with previously published VHa1 protein sequences. A comparison of two complete VH-DH-JH and six partial VHa1 sequences reveals highly conserved sequence within VH framework regions (FR) and considerable diversity in complementarity-determining regions and D region sequences. Two functional JH genes or alleles are evident. Amino acid sequencing of the N-terminal 15 residues of pooled affinity-purified latent VHa1 H chain showed complete sequence identity with the nominal VHa1 sequences. Possible latent VHa1-encoding cDNA clones, derived from the a2a2 rabbit, were selected by hybridization with oligonucleotide probes corresponding to the VHa1 allotype-associated segments of the first and third framework regions (FR1 and FR3). cDNA sequence analysis reveals that the 5' untranslated regions of nominal and latent VHa1 cDNA were virtually identical to each other and to previously reported sequences associated with VHa2 and VHa-negative genes. Moreover, some latent VHa1 genes encode FR1 segments that are essentially homologous to the corresponding segment of a nominal VHa1 allotype. In contrast, other putative latent genes display blocks of VHa1 sequence in either FR1 or FR3 that are flanked by blocks of sequence identical to other rabbit VH genes (i.e., VHa2 or VHa-negative). These composite sequences may be directly encoded by composite germ-line VH genes or may be the products of somatically generated recombination or gene conversion between genes encoding latent and nominal allotypes. The data do not support the hypothesis that latent genes are the result of extensive modification

  15. Analysis and organization of protein sequence data: a retrospective spanning four decades.

    PubMed

    Barker, W C; Hunt, L T

    1997-07-01

    Protein sequence data are as useful and valuable today as was envisioned by pioneering sequencers and by the organizers of the first sequence database. Sequence analysis was first the province of specialists who developed search, comparison, and tree-building methods. Microcomputers, communication satellites, and the Internet have made these methods accessible to any scientist. The rapid increase in the data has driven a succession of changes in how databases are compiled, distributed, and accessed. Large public databases have become international collaborations. Although they need to develop still more efficient ways to accumulate, organize, annotate, and standardize huge amounts of data, inadequate support is available for such efforts. Thus there will be greater reliance on direct input from the scientific community. The World Wide Web is essential but not sufficient for integrated access to related databases.

  16. Protein sequence alignment with family-specific amino acid similarity matrices

    PubMed Central

    2011-01-01

    Background Alignment of amino acid sequences by means of dynamic programming is a cornerstone sequence comparison method. The quality of alignments produced by dynamic programming critically depends on the choice of the alignment scoring function. Therefore, for a specific alignment problem one needs a way of selecting the best performing scoring function. This work is focused on the issue of finding optimized protein family- and fold-specific scoring functions for global similarity matrix-based sequence alignment. Findings I utilize a comprehensive set of reference alignments obtained from structural superposition of homologous and analogous proteins to design a quantitative statistical framework for evaluating the performance of alignment scoring functions in global pairwise sequence alignment. This framework is applied to study how existing general-purpose amino acid similarity matrices perform on individual protein families and structural folds, and to compare them to family-specific and fold-specific matrices derived in this work. I describe an adaptive alignment procedure that automatically selects an appropriate similarity matrix and optimized gap penalties based on the properties of the sequences being aligned. Conclusions The results of this work indicate that using family-specific similarity matrices significantly improves the quality of the alignment of homologous sequences over the traditional sequence alignment based on a single general-purpose similarity matrix. However, using fold-specific similarity matrices can only marginally improve sequence alignment of proteins that share the same structural fold but do not share a common evolutionary origin. The family-specific matrices derived in this work and the optimized gap penalties are available at http://taurus.crc.albany.edu/fsm. PMID:21846354

  17. Engineering modular protein interaction switches by sequence overlap.

    PubMed

    Sallee, Nathan A; Yeh, Brian J; Lim, Wendell A

    2007-04-18

    Many cellular signaling pathways contain proteins whose interactions change in response to upstream inputs, allowing for conditional activation or repression of the interaction based on the presence of the input molecule. The ability to engineer similar regulation into protein interaction elements would provide us with powerful tools for controlling cell signaling. Here we describe an approach for engineering diverse synthetic protein interaction switches. Specifically, by overlapping the sequences of pairs of protein interaction domains and peptides, we have been able to generate mutually exclusive regulation over their interactions. Thus, the hybrid protein (which is composed of the two overlapped interaction modules) can bind to either of the two respective ligands for those modules, but not to both simultaneously. We show that these synthetic switch proteins can be used to regulate specific protein-protein interactions in vivo. These switches allow us to disrupt an interaction with the addition or activation of a protein input that has no natural connection to the interaction in question. Therefore, they give us the ability to make novel connections between normally unrelated signaling pathways and to rewire the input/output relationships of cellular behaviors. Our experiments also suggest a possible mechanism by which complex regulatory proteins might have evolved from simpler components.

  18. Sequence heterogeneity accelerates protein search for targets on DNA

    SciTech Connect

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-12-28

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  19. Sequence heterogeneity accelerates protein search for targets on DNA

    NASA Astrophysics Data System (ADS)

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-12-01

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  20. Finding sequence motifs in groups of functionally related proteins.

    PubMed

    Smith, H O; Annau, T M; Chandrasegaran, S

    1990-01-01

    We have developed a method for rapidly finding patterns of conserved amino acid residues (motifs) in groups of functionally related proteins. All 3-amino acid patterns in a group of proteins of the type aa1 d1 aa2 d2 aa3, where d1 and d2 are distances that can be varied in a range up to 24 residues, are accumulated into an array. Segments of the proteins containing those patterns that occur most frequently are aligned on each other by a scoring method that obtains an average relatedness value for all the amino acids in each column of the aligned sequence block based on the Dayhoff relatedness odds matrix. The automated method successfully finds and displays nearly all of the sequence motifs that have been previously reported to occur in 33 reverse transcriptases, 18 DNA integrases, and 30 DNA methyltransferases.

  1. Prediction of Protein Pairs Sharing Common Active Ligands Using Protein Sequence, Structure, and Ligand Similarity.

    PubMed

    Chen, Yu-Chen; Tolbert, Robert; Aronov, Alex M; McGaughey, Georgia; Walters, W Patrick; Meireles, Lidio

    2016-09-26

    We benchmarked the ability of comparative computational approaches to correctly discriminate protein pairs sharing a common active ligand (positive protein pairs) from protein pairs with no common active ligands (negative protein pairs). Since the target and the off-targets of a drug share at least a common ligand, i.e., the drug itself, the prediction of positive protein pairs may help identify off-targets. We evaluated representative protein-centric and ligand-centric approaches, including (1) 2D and 3D ligand similarity, (2) several measures of protein sequence similarity in conjunction with different sequence sources (e.g., full protein sequence versus binding site residues), and (3) a newly described pocket shape similarity and alignment program called SiteHopper. While the sequence-based alignment of pocket residues achieved the best overall performance, SiteHopper outperformed sequence-based approaches for unrelated proteins with only 20-30% pocket residue identity. Analogously, among ligand-centric approaches, path-based fingerprints achieved the best overall performance, but ROCS-based ligand shape similarity outperformed path-based fingerprints for structurally dissimilar ligands (Tanimoto 25%-40%). A significant drop in recognition performance was observed for ligand-centric approaches when PDB ligands were used instead of ChEMBL ligands. Finally, we analyzed the relationship between pocket shape and ligand shape in our data set and found that similar ligands tend to bind to similar pockets while similar pockets may accept a range of different-shaped ligands.

  2. Predict protein structural class for low-similarity sequences by evolutionary difference information into the general form of Chou's pseudo amino acid composition.

    PubMed

    Zhang, Lichao; Zhao, Xiqiang; Kong, Liang

    2014-08-21

    Knowledge of protein structural class plays an important role in characterizing the overall folding type of a given protein. At present, it is still a challenge to extract sequence information solely using protein sequence for protein structural class prediction with low similarity sequence in the current computational biology. In this study, a novel sequence representation method is proposed based on position specific scoring matrix for protein structural class prediction. By defined evolutionary difference formula, varying length proteins are expressed as uniform dimensional vectors, which can represent evolutionary difference information between the adjacent residues of a given protein. To perform and evaluate the proposed method, support vector machine and jackknife tests are employed on three widely used datasets, 25PDB, 1189 and 640 datasets with sequence similarity lower than 25%, 40% and 25%, respectively. Comparison of our results with the previous methods shows that our method may provide a promising method to predict protein structural class especially for low-similarity sequences.

  3. Ultra-Fast Evaluation of Protein Energies Directly from Sequence

    PubMed Central

    Grigoryan, Gevorg; Zhou, Fei; Lustig, Steve R; Ceder, Gerbrand; Morgan, Dane; Keating, Amy E

    2006-01-01

    The structure, function, stability, and many other properties of a protein in a fixed environment are fully specified by its sequence, but in a manner that is difficult to discern. We present a general approach for rapidly mapping sequences directly to their energies on a pre-specified rigid backbone, an important sub-problem in computational protein design and in some methods for protein structure prediction. The cluster expansion (CE) method that we employ can, in principle, be extended to model any computable or measurable protein property directly as a function of sequence. Here we show how CE can be applied to the problem of computational protein design, and use it to derive excellent approximations of physical potentials. The approach provides several attractive advantages. First, following a one-time derivation of a CE expansion, the amount of time necessary to evaluate the energy of a sequence adopting a specified backbone conformation is reduced by a factor of 107 compared to standard full-atom methods for the same task. Second, the agreement between two full-atom methods that we tested and their CE sequence-based expressions is very high (root mean square deviation 1.1–4.7 kcal/mol, R2 = 0.7–1.0). Third, the functional form of the CE energy expression is such that individual terms of the expansion have clear physical interpretations. We derived expressions for the energies of three classic protein design targets—a coiled coil, a zinc finger, and a WW domain—as functions of sequence, and examined the most significant terms. Single-residue and residue-pair interactions are sufficient to accurately capture the energetics of the dimeric coiled coil, whereas higher-order contributions are important for the two more globular folds. For the task of designing novel zinc-finger sequences, a CE-derived energy function provides significantly better solutions than a standard design protocol, in comparable computation time. Given these advantages, CE is likely

  4. Protein determination: a comparison of several methods.

    PubMed

    Hocman, G; Palkovic, M

    1977-01-01

    A comparison of four methods for the determination of total proteins is presented from the following points of view: - sensitivity; - specificity; - amount of work, chemicals, time and equipment needed for the performance of the determination. The following tests have been examined; Tombs' (absorbancy at 210 nm); Waddell's (difference in absorbancy between 215 and 225 nm); Warburg's (absorbancy at 280 nm); Lowry's (absorbancy at 500 nm after the reaction with phenol reagent). The authors recommend Tombs' method for its outstanding sensitivity, specificity and simplicity as the best of the four.

  5. Mathematical Characterization of Protein Sequences Using Patterns as Chemical Group Combinations of Amino Acids

    PubMed Central

    Choudhury, Pabitra Pal; Jana, Siddhartha Sankar

    2016-01-01

    Comparison of amino acid sequence similarity is the fundamental concept behind the protein phylogenetic tree formation. By virtue of this method, we can explain the evolutionary relationships, but further explanations are not possible unless sequences are studied through the chemical nature of individual amino acids. Here we develop a new methodology to characterize the protein sequences on the basis of the chemical nature of the amino acids. We design various algorithms for studying the variation of chemical group transitions and various chemical group combinations as patterns in the protein sequences. The amino acid sequence of conventional myosin II head domain of 14 family members are taken to illustrate this new approach. We find two blocks of maximum length 6 aa as ‘FPKATD’ and ‘Y/FTNEKL’ without repeating the same chemical nature and one block of maximum length 20 aa with the repetition of chemical nature which are common among all 14 members. We also check commonality with another motor protein sub-family kinesin, KIF1A. Based on our analysis we find a common block of length 8 aa both in myosin II and KIF1A. This motif is located in the neck linker region which could be responsible for the generation of mechanical force, enabling us to find the unique blocks which remain chemically conserved across the family. We also validate our methodology with different protein families such as MYOI, Myosin light chain kinase (MLCK) and Rho-associated protein kinase (ROCK), Na+/K+-ATPase and Ca2+-ATPase. Altogether, our studies provide a new methodology for investigating the conserved amino acids’ pattern in different proteins. PMID:27930687

  6. Mathematical Characterization of Protein Sequences Using Patterns as Chemical Group Combinations of Amino Acids.

    PubMed

    Das, Jayanta Kumar; Das, Provas; Ray, Korak Kumar; Choudhury, Pabitra Pal; Jana, Siddhartha Sankar

    2016-01-01

    Comparison of amino acid sequence similarity is the fundamental concept behind the protein phylogenetic tree formation. By virtue of this method, we can explain the evolutionary relationships, but further explanations are not possible unless sequences are studied through the chemical nature of individual amino acids. Here we develop a new methodology to characterize the protein sequences on the basis of the chemical nature of the amino acids. We design various algorithms for studying the variation of chemical group transitions and various chemical group combinations as patterns in the protein sequences. The amino acid sequence of conventional myosin II head domain of 14 family members are taken to illustrate this new approach. We find two blocks of maximum length 6 aa as 'FPKATD' and 'Y/FTNEKL' without repeating the same chemical nature and one block of maximum length 20 aa with the repetition of chemical nature which are common among all 14 members. We also check commonality with another motor protein sub-family kinesin, KIF1A. Based on our analysis we find a common block of length 8 aa both in myosin II and KIF1A. This motif is located in the neck linker region which could be responsible for the generation of mechanical force, enabling us to find the unique blocks which remain chemically conserved across the family. We also validate our methodology with different protein families such as MYOI, Myosin light chain kinase (MLCK) and Rho-associated protein kinase (ROCK), Na+/K+-ATPase and Ca2+-ATPase. Altogether, our studies provide a new methodology for investigating the conserved amino acids' pattern in different proteins.

  7. Phosphorylation of the transit sequence of chloroplast precursor proteins.

    PubMed

    Waegemann, K; Soll, J

    1996-03-15

    A protein kinase was located in the cytosol of pea mesophyll cells. The protein kinase phosphorylates, in an ATP-dependent manner, chloroplast-destined precursor proteins but not precursor proteins, which are located to plant mitochondria or plant peroxisomes. The phosphorylation occurs on either serine or threonine residues, depending on the precursor protein used. We demonstrate the specific phosphorylation of the precursor forms of the chloroplast stroma proteins ferredoxin (preFd), small subunit of ribulose-bisphosphate-carboxylase (preSSU), the thylakoid localized light-harvesting chlorophyll a/b-binding protein (preLHCP), and the thylakoid lumen-localized proteins of the oxygen-evolving complex of 23 kDa (preOE23) and 33 kDa (preOE33). In the case of thylakoid lumen proteins which possess bipartite transit sequences, the phosphorylation occurs within the stroma-targeting domain. By using single amino acid substitution within the presequences of preSSU, preOE23, and preOE33, we were able to tentatively identify a consensus motif for the precursor protein protein kinase. This motif is (P/G)X(n)(R/K)X(n)(S/T)X(n) (S*/T*), were n = 0-3 amino acids spacer and S*/T* represents the phosphate acceptor. The precursor protein protein kinase is present only in plant extracts, e.g. wheat germ and pea, but not in a reticulocyte lysate. Protein import experiments into chloroplasts revealed that phosphorylated preSSU binds to the organelles, but dephosphorylation seems required to complete the translocation process and to obtain complete import. These results suggest that a precursor protein protein phosphatase is involved in chloroplast import and represents a so far unidentified component of the import machinery. In contrast to sucrose synthase, a cytosolic marker protein, the precursor protein protein kinase seems to adhere partially to the chloroplast surface. A phosphorylation-dephosphorylation cycle of chloroplast-destined precursor proteins might represent one step

  8. Sequence peculiarity of gnetalean legumin-like seed storage proteins.

    PubMed

    Shutov, A D; Braun, H; Chesnokov, Y V; Horstmann, C; Kakhovskaya, I A; Bäumlein, H

    1998-10-01

    The development of seeds as a specialized organ for the nutrition, protection, and dispersal of the next generation was an important step in the evolution of land plants. Seed maturation is accompanied by massive synthesis of storage compounds such as proteins, starch, and lipids. To study the processes of seed storage protein evolution we have partially sequenced storage proteins from maturing seeds of representatives from the gymnosperm genera Gnetum, Ephedra, and Welwitschia-morphologically diverse and unusual taxa that are grouped in most formal systems into the common order Gnetales. Based on partial N-terminal amino acid sequences, oligonucleotide primers were derived and used for PCR amplification and cloning of the corresponding cDNAs. We also describe the structure of the nuclear gene for legumin of Welwitschia mirabilis. This first gnetalean nuclear gene structure contains introns in only two of the four conserved positions previously characterized in other spermatophyte legumin genes. The distinct phylogenetic status of the gnetalean taxa is also reflected in a sequence peculiarity of their legumin genes. A comparative analysis of exon/intron sequences leads to the hypothesis that legumin genes from Gnetales belong to a monophyletic evolutionary branch clearly distinct from that of legumin genes of extant Ginkgoales and Coniferales as well as from all angiosperms.

  9. Hinge Atlas: relating protein sequence to sites of structural flexibility

    PubMed Central

    Flores, Samuel C; Lu, Long J; Yang, Julie; Carriero, Nicholas; Gerstein, Mark B

    2007-01-01

    Background Relating features of protein sequences to structural hinges is important for identifying domain boundaries, understanding structure-function relationships, and designing flexibility into proteins. Efforts in this field have been hampered by the lack of a proper dataset for studying characteristics of hinges. Results Using the Molecular Motions Database we have created a Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges. Conclusion We found various correlations between hinges and sequence features. Some of these are expected; for instance, we found that hinges tend to occur on the surface and in coils and turns and to be enriched with small and hydrophilic residues. Others are less obvious and intuitive. In particular, we found that hinges tend to coincide with active sites, but unlike the latter they are not at all conserved in evolution. We evaluate the potential for hinge prediction based on sequence. Motions play an important role in catalysis and protein-ligand interactions. Hinge bending motions comprise the largest class of known motions. Therefore it is important to relate the hinge location to sequence features such as residue type, physicochemical class, secondary structure, solvent exposure, evolutionary conservation, and proximity to active sites. To do this, we first generated the Hinge Atlas, a set of protein motions with the hinge locations manually annotated, and then studied the coincidence of these features with the hinge location. We found that all of the features have bearing on the hinge location. Most interestingly, we found that hinges tend to occur at or near active sites and yet unlike the latter are not conserved. Less surprisingly, we found that hinge residues tend to be small, not hydrophobic or aliphatic, and occur in turns and random coils on the surface. A functional sequence based hinge predictor was made which uses some of the

  10. Identification of Plasmodesmal Localization Sequences in Proteins In Planta.

    PubMed

    Yuan, Cheng; Lazarowitz, Sondra G; Citovsky, Vitaly

    2017-08-15

    Plasmodesmata (Pd) are cell-to-cell connections that function as gateways through which small and large molecules are transported between plant cells. Whereas Pd transport of small molecules, such as ions and water, is presumed to occur passively, cell-to-cell transport of biological macromolecules, such proteins, most likely occurs via an active mechanism that involves specific targeting signals on the transported molecule. The scarcity of identified plasmodesmata (Pd) localization signals (PLSs) has severely restricted the understanding of protein-sorting pathways involved in plant cell-to-cell macromolecular transport and communication. From a wealth of plant endogenous and viral proteins known to traffic through Pd, only three PLSs have been reported to date, all of them from endogenous plant proteins. Thus, it is important to develop a reliable and systematic experimental strategy to identify a functional PLS sequence, that is both necessary and sufficient for Pd targeting, directly in the living plant cells. Here, we describe one such strategy using as a paradigm the cell-to-cell movement protein (MP) of the Tobacco mosaic virus (TMV). These experiments, that identified and characterized the first plant viral PLS, can be adapted for discovery of PLS sequences in most Pd-targeted proteins.

  11. Interrogating noise in protein sequences from the perspective of protein-protein interactions prediction.

    PubMed

    Wang, Yongcui; Ren, Xianwen; Zhang, Chunhua; Deng, Naiyang; Zhang, Xiangsun

    2012-12-21

    The past decades witnessed extensive efforts to study the relationship among proteins. Particularly, sequence-based protein-protein interactions (PPIs) prediction is fundamentally important in speeding up the process of mapping interactomes of organisms. High-throughput experimental methodologies make many model organism's PPIs known, which allows us to apply machine learning methods to learn understandable rules from the available PPIs. Under the machine learning framework, the composition vectors are usually applied to encode proteins as real-value vectors. However, the composition vector value might be highly correlated to the distribution of amino acids, i.e., amino acids which are frequently observed in nature tend to have a large value of composition vectors. Thus formulation to estimate the noise induced by the background distribution of amino acids may be needed during representations. Here, we introduce two kinds of denoising composition vectors, which were successfully used in construction of phylogenetic trees, to eliminate the noise. When validating these two denoising composition vectors on Escherichia coli (E. coli), Saccharomyces cerevisiae (S. cerevisiae) and human PPIs datasets, surprisingly, the predictive performance is not improved, and even worse than non-denoised prediction. These results suggest that the noise in phylogenetic tree construction may be valuable information in PPIs prediction.

  12. Aligning multiple protein sequences by parallel hybrid genetic algorithm.

    PubMed

    Nguyen, Hung Dinh; Yoshihara, Ikuo; Yamamori, Kunihito; Yasunaga, Moritoshi

    2002-01-01

    This paper presents a parallel hybrid genetic algorithm (GA) for solving the sum-of-pairs multiple protein sequence alignment. A new chromosome representation and its corresponding genetic operators are proposed. A multi-population GENITOR-type GA is combined with local search heuristics. It is then extended to run in parallel on a multiprocessor system for speeding up. Experimental results of benchmarks from the BAliBASE show that the proposed method is superior to MSA, OMA, and SAGA methods with regard to quality of solution and running time. It can be used for finding multiple sequence alignment as well as testing cost functions.

  13. Apple Macintosh programs for nucleic and protein sequence analyses.

    PubMed

    Bellon, B

    1988-03-11

    This paper describes a package of programs for handling and analyzing nucleic acid and protein sequences using the Apple Macintosh microcomputer. There are three important features of these programs: first, because of the now classical Macintosh interface the programs can be easily used by persons with little or no computer experience. Second, it is possible to save all the data, written in an editable scrolling text window or drawn in a graphic window, as files that can be directly used either as word processing documents or as picture documents. Third, sequences can be easily exchanged with any other computer. The package is composed of thirteen programs, written in Pascal programming language.

  14. Molecular cloning and amino acid sequence of human plakoglobin, the common junctional plaque protein

    SciTech Connect

    Franke, W.W.; Goldschmidt, M.D.; Zimbelmann, R.; Mueller, H.M.; Schiller, D.L.; Cowin, P. )

    1989-06-01

    Plakoglobin is a major cytoplasmic protein that occurs in a soluble and a membrane-associated form and is the only known constituent common to the submembranous plaques of both kinds of adhering junctions, the desmosomes and the intermediate junctions. Using a partial cDNA clone for bovine plakoglobin, the authors isolated cDNAs encoding human plakoglobin, determined its nucleotide sequence, and deduced the complete amino acid sequence. The polypeptide encoded by the cDNA was synthesized by in vitro transcription and translation and identified by its comigration with authentic plakoglobin in two-dimensional gel electrophoresis. The identity was further confirmed by comparison of the deduced sequence with the directly determined amino acid sequence of two fragments from bovine plakoglobin. Analysis of the plakoglobin sequence showed the protein to be unrelated to any other known proteins, highly conserved between human and bovine tissues, and characterized by numerous changes between hydrophilic and hydrophobic sections. Only one kind of plakoglobin mRNA was found in most tissues, but an additional mRNA was detected in certain human tumor cell lines. This longer mRNA may be represented by a second type of plakoglobin cDNA, which contains an insertion of 297 nucleotides in the 3{prime} noncoding region.

  15. Sequence analysis and structural implications of rotavirus capsid proteins.

    PubMed

    Parbhoo, N; Dewar, J B; Gildenhuys, S

    Rotavirus is the major cause of severe virus-associated gastroenteritis worldwide in children aged 5 and younger. Many children lose their lives annually due to this infection and the impact is particularly pronounced in developing countries. The mature rotavirus is a non-enveloped triple-layered nucleocapsid containing 11 double stranded RNA segments. Here a global view on the sequence and structure of the three main capsid proteins, VP2, VP6 and VP7 is shown by generating a consensus sequence for each of these rotavirus proteins, for each species obtained from published data of representative rotavirus genotypes from across the world and across species. Degree of conservation between species was represented on homology models for each of the proteins. VP7 shows the highest level of variation with 14-45 amino acids showing conservation of less than 60%. These changes are localised to the outer surface alluding to a possible mechanism in evading the immune system. The middle layer, VP6 shows lower variability with only 14-32 sites having lower than 70% conservation. The inner structural layer made up of VP2 showed the lowest variability with only 1-16 sites having less than 70% conservation across species. The results correlate with each protein's multiple structural roles in the infection cycle. Thus, although the nucleotide sequences vary due to the error-prone nature of replication and lack of proof reading, the corresponding amino acid sequence of VP2, 6 and 7 remain relatively conserved. Benefits of this knowledge about the conservation include the ability to target proteins at sites that cannot undergo mutational changes without influencing viral fitness; as well as possibility to study systems that are highly evolved for structure and function in order to determine how to generate and manipulate such systems for use in various biotechnological applications.

  16. Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design.

    PubMed

    Smith, Colin A; Kortemme, Tanja

    2011-01-01

    Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.

  17. Alternative evolutionary histories in the sequence space of an ancient protein.

    PubMed

    Starr, Tyler N; Picton, Lora K; Thornton, Joseph W

    2017-09-13

    To understand why molecular evolution turned out as it did, we must characterize not only the path that evolution followed across the space of possible molecular sequences but also the many alternative trajectories that could have been taken but were not. A large-scale comparison of real and possible histories would establish whether the outcome of evolution represents an optimal state driven by natural selection or the contingent product of historical chance events; it would also reveal how the underlying distribution of functions across sequence space shaped historical evolution. Here we combine ancestral protein reconstruction with deep mutational scanning to characterize alternative histories in the sequence space around an ancient transcription factor, which evolved a novel biological function through well-characterized mechanisms. We find hundreds of alternative protein sequences that use diverse biochemical mechanisms to perform the derived function at least as well as the historical outcome. These alternatives all require prior permissive substitutions that do not enhance the derived function, but not all require the same permissive changes that occurred during history. We find that if evolution had begun from a different starting point within the network of sequences encoding the ancestral function, outcomes with different genetic and biochemical forms would probably have resulted; this contingency arises from the distribution of functional variants in sequence space and epistasis between residues. Our results illuminate the topology of the vast space of possibilities from which history sampled one path, highlighting how the outcome of evolution depends on a serial chain of compounding chance events.

  18. A Comparison of Rosetta Stones in Adapter Protein Families

    PubMed Central

    Kumar, Hulikal Shivashankara Santosh; Kumar, Vadlapudi

    2016-01-01

    The inventory of proteins used in different kingdoms appears surprisingly similar in all sequenced eukaryotic genome. Protein domains represent the basic evolutionary units that form proteins. Domain duplication and shuffling by recombination are probably the most important forces driving protein evolution and hence the complexity of the proteome. While the duplication of whole genes as well as domain encoding exons increases the abundance of domains in the proteome, domain shuffling increases versatility, i.e. the number of distinct contexts in which a domain can occur. In this study we considered five important adapter domain families namely WD40, KELCH, Ankyrin, PDZ and Pleckstrin Homology (PH domain) family for the comparison of Domain versatility, Abundance and domain sharing between them. We used ecological statistics methods such as Jaccard’s Similarity Index (JSI), Detrended Correspondence Analysis, k-Means clustering for the domain distribution data. We found high propensity of domain sharing between PH and PDZ. We found higher abundance of only few selected domains in PH, PDZ, ANK and KELCH families. We also found WD40 family with high versatility and less redundant domain occurrence, with less domain sharing. Hence, the assignments of functions to more orphan WD40 proteins that will help in the identification of suitable drug targets. PMID:28246462

  19. FAB overlapping: a strategy for sequencing homologous proteins

    NASA Astrophysics Data System (ADS)

    Ferranti, P.; Malorni, A.; Marino, G.; Pucci, P.; di Luccia, A.; Ferrara, L.

    1991-12-01

    Extensive similarity has been shown to exist between the primary structures of closely related proteins from different species, the only differences being restricted to a few amino acid variations. A new mass spectrometric procedure, which has been called FAB-overlapping, has been developed for sequencing highly homologous proteins based on the detection of these small differences as compared with a known protein used as a reference. Several complementary peptide maps are constructed using fast atom bombardment mass spectrometry (FAB-MS) analysis of different proteolytic digests of the unknown protein and the mass values are related to those expected on the basis of the sequence of the reference protein. The mass signals exhibiting unusual mass values identify those regions where variations have taken place; fine location of the mutations can be obtained by coupling simple protein chemistry methodologies with FAB-MS. Using the FAB-overlapping procedure, it was possible to determine the sequence of [alpha]1, [alpha]3 and [beta] globins from water buffalo (Bubalus bubalis hemoglobins (phenotype AA). Two amino acid substitutions were detected in the buffalo [beta] chain (Lys16 --> His and Asn118 --> His) whereas the [alpha]1 chains were found the [alpha]1 and [alpha]3 chains were found to contain four amino acid replacements, three of which were identical (Glu23 --> Asp, Glu71 --> Gly, Phe117 --> Cys), and the insertion of an alanine residue in position 124. The only differences between [alpha]1 and [alpha]3 globins were identified in the C -terminal region; [alpha]1 contains a Phe residue at position 130 whereas [alpha]3 shows serine at position 132.

  20. Sequence comparison and classification of beet luteovirus isolates.

    PubMed

    de Miranda, J R; Stevens, M; de Bruyne, E; Smith, H G; Bird, C; Hull, R

    1995-01-01

    Three distinct sequence groups were found among partial nucleotide sequences of 38 isolates of beet western yellows virus (BWYV) and beet mild yellowing virus (BMYV) from Europe, Iran and the USA. The first group contains both sugar beet and oilseed rape specific isolates, and the differentiating characteristic linked to this host range specificity are 2 single base pair changes in a 1,200 nucleotide region of the genome. It is proposed that the European BWYV strains that can be transferred at low frequency between rape and sugar beet belong to this group. Also belonging to this group are the published BWYV sequences of Veidt et al. and of the California BWYV-ST9 isolate. The second group contains mostly rape-derived isolates which have an intergenic region highly distinct from that of group-1 isolates but similar polymerase and coat protein regions. It is proposed that the rape-specific BWYV isolates which cannot be transmitted to sugar beet belong to this group. The third group contains mostly beet-specific isolates from Southern Europe and Iran, and may be adapted to the Mediterranean climate and flora. It is distinct from groups 1 and 2 in all three genome regions investigated and its polymerase and intergenic regions are as much related to those of potato leafroll virus (PLRV) and curcurbit aphid borne yellows virus (CABYV) as they are to those of group-1 and group-2. On the basis of sequence similarities and established nomenclature it is proposed to use BWYV for groups 1 and 2 (BWYV-1 and BWYV-2 respectively) and to use BMYV for group-3 isolates, which are distinct enough from the other two groups to merit a separate nomenclature.

  1. AlignBucket: a tool to speed up 'all-against-all' protein sequence alignments optimizing length constraints.

    PubMed

    Profiti, Giuseppe; Fariselli, Piero; Casadio, Rita

    2015-12-01

    The next-generation sequencing era requires reliable, fast and efficient approaches for the accurate annotation of the ever-increasing number of biological sequences and their variations. Transfer of annotation upon similarity search is a standard approach. The procedure of all-against-all protein comparison is a preliminary step of different available methods that annotate sequences based on information already present in databases. Given the actual volume of sequences, methods are necessary to pre-process data to reduce the time of sequence comparison. We present an algorithm that optimizes the partition of a large volume of sequences (the whole database) into sets where sequence length values (in residues) are constrained depending on a bounded minimal and expected alignment coverage. The idea is to optimally group protein sequences according to their length, and then computing the all-against-all sequence alignments among sequences that fall in a selected length range. We describe a mathematically optimal solution and we show that our method leads to a 5-fold speed-up in real world cases. The software is available for downloading at http://www.biocomp.unibo.it/∼giuseppe/partitioning.html. giuseppe.profiti2@unibo.it. Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  2. Incremental Window-based Protein Sequence Alignment Algorithms

    DTIC Science & Technology

    2006-03-23

    Huzefa Rangwala and George Karypis March 23, 2006 Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection of... Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Incremental Window-based Protein Sequence Alignment Algorithms Huzefa Rangwala and George Karypis...Then it per- forms a series of iterations in which it performs the following three steps: First, it extracts from ’ the residue-pair with the highest

  3. Can Computationally Designed Protein Sequences Improve Secondary Structure Prediction?

    DTIC Science & Technology

    2011-01-01

    with the structural classification of proteins ( SCOP ) database of known structural domains (Kuhlman and Baker, 2000; Rohl et al., 2004). Secondary...reported in the literature. Methods In this work, the Astral SCOP 1.75 (Murzin et al., 1995; Hubbard et al., 1999) structural domain database filtered...entry matching the query test sequence can be left out. A total of 6511 SCOP 1.75 domains were used after some domains were discarded due to large

  4. Quantitative assessment of RNA-protein interactions with high-throughput sequencing-RNA affinity profiling.

    PubMed

    Ozer, Abdullah; Tome, Jacob M; Friedman, Robin C; Gheba, Dan; Schroth, Gary P; Lis, John T

    2015-08-01

    Because RNA-protein interactions have a central role in a wide array of biological processes, methods that enable a quantitative assessment of these interactions in a high-throughput manner are in great demand. Recently, we developed the high-throughput sequencing-RNA affinity profiling (HiTS-RAP) assay that couples sequencing on an Illumina GAIIx genome analyzer with the quantitative assessment of protein-RNA interactions. This assay is able to analyze interactions between one or possibly several proteins with millions of different RNAs in a single experiment. We have successfully used HiTS-RAP to analyze interactions of the EGFP and negative elongation factor subunit E (NELF-E) proteins with their corresponding canonical and mutant RNA aptamers. Here we provide a detailed protocol for HiTS-RAP that can be completed in about a month (8 d hands-on time). This includes the preparation and testing of recombinant proteins and DNA templates, clustering DNA templates on a flowcell, HiTS and protein binding with a GAIIx instrument, and finally data analysis. We also highlight aspects of HiTS-RAP that can be further improved and points of comparison between HiTS-RAP and two other recently developed methods, quantitative analysis of RNA on a massively parallel array (RNA-MaP) and RNA Bind-n-Seq (RBNS), for quantitative analysis of RNA-protein interactions.

  5. Functional analysis of bipartite begomovirus coat protein promoter sequences

    SciTech Connect

    Lacatus, Gabriela; Sunter, Garry

    2008-06-20

    We demonstrate that the AL2 gene of Cabbage leaf curl virus (CaLCuV) activates the CP promoter in mesophyll and acts to derepress the promoter in vascular tissue, similar to that observed for Tomato golden mosaic virus (TGMV). Binding studies indicate that sequences mediating repression and activation of the TGMV and CaLCuV CP promoter specifically bind different nuclear factors common to Nicotiana benthamiana, spinach and tomato. However, chromatin immunoprecipitation demonstrates that TGMV AL2 can interact with both sequences independently. Binding of nuclear protein(s) from different crop species to viral sequences conserved in both bipartite and monopartite begomoviruses, including TGMV, CaLCuV, Pepper golden mosaic virus and Tomato yellow leaf curl virus suggests that bipartite begomoviruses bind common host factors to regulate the CP promoter. This is consistent with a model in which AL2 interacts with different components of the cellular transcription machinery that bind viral sequences important for repression and activation of begomovirus CP promoters.

  6. Graphical representation and mathematical characterization of protein sequences and applications to viral proteins.

    PubMed

    Ghosh, Ambarnil; Nandy, Ashesh

    2011-01-01

    Graphical representation and numerical characterization (GRANCH) of nucleotide and protein sequences is a new field that is showing a lot of promise in analysis of such sequences. While formulation and applications of GRANCH techniques for DNA/RNA sequences started just over a decade ago, analyses of protein sequences by these techniques are of more recent origin. The emphasis is still on developing the underlying technique, but significant results have been achieved in using these methods for protein phylogeny, mass spectral data of proteins and protein serum profiles in parasites, toxicoproteomics, determination of different indices for use in QSAR studies, among others. We briefly mention these in this chapter, with some details on protein phylogeny and viral diseases. In particular, we cover a systematic method developed in GRANCH to determine conserved surface exposed peptide segments in selected viral proteins that can be used for drug and vaccine targeting. The new GRANCH techniques and applications for DNAs and proteins are covered briefly to provide an overview to this nascent field.

  7. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing

    PubMed Central

    Dasenko, Mark A.

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles

  8. Properties of Sequence Conservation in Upstream Regulatory and Protein Coding Sequences among Paralogs in Arabidopsis thaliana

    NASA Astrophysics Data System (ADS)

    Richardson, Dale N.; Wiehe, Thomas

    Whole genome duplication (WGD) has catalyzed the formation of new species, genes with novel functions, altered expression patterns, complexified signaling pathways and has provided organisms a level of genetic robustness. We studied the long-term evolution and interrelationships of 5’ upstream regulatory sequences (URSs), protein coding sequences (CDSs) and expression correlations (EC) of duplicated gene pairs in Arabidopsis. Three distinct methods revealed significant evolutionary conservation between paralogous URSs and were highly correlated with microarray-based expression correlation of the respective gene pairs. Positional information on exact matches between sequences unveiled the contribution of micro-chromosomal rearrangements on expression divergence. A three-way rank analysis of URS similarity, CDS divergence and EC uncovered specific gene functional biases. Transcription factor activity was associated with gene pairs exhibiting conserved URSs and divergent CDSs, whereas a broad array of metabolic enzymes was found to be associated with gene pairs showing diverged URSs but conserved CDSs.

  9. Protein-Sol: a web tool for predicting protein solubility from sequence.

    PubMed

    Hebditch, Max; Carballo-Amador, M Alejandro; Charonis, Spyros; Curtis, Robin; Warwicker, Jim

    2017-10-01

    Protein solubility is an important property in industrial and therapeutic applications. Prediction is a challenge, despite a growing understanding of the relevant physicochemical properties. Protein-Sol is a web server for predicting protein solubility. Using available data for Escherichia coli protein solubility in a cell-free expression system, 35 sequence-based properties are calculated. Feature weights are determined from separation of low and high solubility subsets. The model returns a predicted solubility and an indication of the features which deviate most from average values. Two other properties are profiled in windowed calculation along the sequence: fold propensity, and net segment charge. The utility of these additional features is demonstrated with the example of thioredoxin. The Protein-Sol webserver is available at http://protein-sol.manchester.ac.uk. jim.warwicker@manchester.ac.uk.

  10. The S-layer protein from Campylobacter rectus: sequence determination and function of the recombinant protein.

    PubMed

    Miyamoto, M; Maeda, H; Kitanaka, M; Kokeguchi, S; Takashiba, S; Murayama, Y

    1998-09-15

    The gene encoding the crystalline surface layer (S-layer) protein from Campylobacter rectus, designated slp, was sequenced and the recombinant gene product was expressed in Escherichia coli. The gene consisted of 4086 nucleotides encoding a protein with 1361 amino acids. The N-terminal amino acid sequence revealed that Slp did not contain a signal sequence, but that the initial methionine residue was processed. The deduced amino acid sequence displayed some common characteristic features of S-layer proteins previously reported. A homology search showed a high similarity to the Campylobacter fetus S-layer proteins, especially in their N-terminus. The C-terminal third of Slp exhibited homology with the RTX toxins from Gram-negative bacteria via the region including the glycine-rich repeats. The Slp protein had the same N-terminal sequence as a 104-kDa cytotoxin isolated from the culture supernatants of C. rectus. However, neither native nor recombinant Slp showed cytotoxicity against HL-60 cells or human peripheral white blood cells. These data support the idea that the N-terminus acts as an anchor to the cell surface components and that the C-terminus is involved in the assembly and/or transport of the protein.

  11. Increasing Sequence Diversity with Flexible Backbone Protein Design: The Complete Redesign of a Protein Hydrophobic Core

    PubMed Central

    Murphy, Grant S.; Mills, Jeffrey L.; Miley, Michael J.; Machius, Mischa; Szyperski, Thomas; Kuhlman, Brian

    2012-01-01

    Summary Protein design tests our understanding of protein stability and structure. Successful design methods should allow the exploration of sequence space not found in nature. However, when redesigning naturally occurring protein structures most fixed backbone design algorithms return amino acid sequences that share strong sequence identity with wild-type sequences, especially in the protein core. This behavior places a restriction on functional space that can be explored and is not consistent with observations from nature, where sequences of low identity have similar structures. Here, we allow backbone flexibility during design to mutate every position in the core (38 residues) of a four-helix bundle protein. Only small perturbations to the backbone, 1-2 Å, were needed to entirely mutate the core. The redesigned protein, DRNN, is exceptionally stable (melting point > 140 °C). An NMR and X-ray crystal structure show that the side chains and backbone were accurately modeled (all-atom RMSD = 1.3 Å). PMID:22632833

  12. A general sequence processing and analysis program for protein engineering.

    PubMed

    Stafford, Ryan L; Zimmerman, Erik S; Hallam, Trevor J; Sato, Aaron K

    2014-10-27

    Protein engineering projects often amass numerous raw DNA sequences, but no readily available software combines sequence processing and activity correlation required for efficient lead identification. XLibraryDisplay is an open source program integrated into Microsoft Excel for Windows that automates batch sequence processing via a simple step-by-step, menu-driven graphical user interface. XLibraryDisplay accepts any DNA template which is used as a basis for trimming, filtering, translating, and aligning hundreds to thousands of sequences (raw, FASTA, or Phred PHD file formats). Key steps for library characterization through lead discovery are available including library composition analysis, filtering by experimental data, graphing and correlating to experimental data, alignment to structural data extracted from PDB files, and generation of PyMOL visualization scripts. Though larger data sets can be handled, the program is best suited for analyzing approximately 10 000 or fewer leads or naïve clones which have been characterized using Sanger sequencing and other experimental approaches. XLibraryDisplay can be downloaded for free from sourceforge.net/projects/xlibrarydisplay/ .

  13. A rhodopsin-like protein in Cyanophora paradoxa: gene sequence and protein immunolocalization.

    PubMed

    Frassanito, Anna Maria; Barsanti, Laura; Passarelli, Vincenzo; Evangelista, Valtere; Gualtieri, Paolo

    2010-03-01

    Here, we report the DNA sequence of the rhodopsin gene in the alga Cyanophora paradoxa (Glaucophyta). The primers were designed according to the conserved regions of prokaryotic and eukaryotic rhodopsin-like proteins deposited in the GenBank. The sequence consists of 1,272 bp comprised of 5 introns. The correspondent protein, named Cyanophopsin, showed high identity to rhodopsin-like proteins of Archea, Bacteria, Fungi, and Algae. At the N-terminal, the protein is characterized by a region with no transmembrane alpha-helices (80 aa), followed by a region with 7alpha-helices (219 aa) and a shorter 35-aa C-terminal region. The DNA sequence of the N-terminal region was expressed in E. coli and the recombinant purified peptide was used as antigen in hens to obtain polyclonal antibodies. Indirect immunofluorescence in C. paradoxa cells showed a marked labeling of the muroplast (aka cyanelle) membrane.

  14. Prediction of protein antigenic determinants from amino acid sequences

    SciTech Connect

    Hopp, T.P.; Woods, K.R.

    1981-06-01

    A method is presented for locating protein antigenic determinants by analyzing amino acid sequences in order to find the point of greatest local hydrophilicity. This is accomplished by assigning each amino acid a numerical value (hydrophilicity value) and then repetitively averaging these values along the peptide chain. The point of highest local average hydrophilicity is invariably located in, or immediately adjacent to, an antigenic determinant. It was found that the prediction success rate depended on averaging group length, with hexapeptide averages yielding optimal results. The method was developed using 12 proteins for which extensive immunochemical analysis has been carried out and subsequently was used to predict antigenic determinants for the following proteins: hepatitis B surface antigen, influenza hemagglutinis, fowl plague virus hemagglutinin, human histocompatibility antigen HLA-B7, human interferons, Escherichia coli and cholera enterotoxins, ragweed allergens Ra3 and Ra5, and streptococcal M protein. The hepatitis B surface antigen sequence was synthesized by chemical means and was shown to have antigenic activity by radioimmunoassay.

  15. CISAPS: Complex Informational Spectrum for the Analysis of Protein Sequences.

    PubMed

    Chrysostomou, Charalambos; Seker, Huseyin; Aydin, Nizamettin

    2015-01-01

    Complex informational spectrum analysis for protein sequences (CISAPS) and its web-based server are developed and presented. As recent studies show, only the use of the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient. Therefore, CISAPS is developed to consider and provide results in three forms including absolute, real, and imaginary spectrum. Biologically related features to the analysis of influenza A subtypes as presented as a case study in this study can also appear individually either in the real or imaginary spectrum. As the results presented, protein classes can present similarities or differences according to the features extracted from CISAPS web server. These associations are probable to be related with the protein feature that the specific amino acid index represents. In addition, various technical issues such as zero-padding and windowing that may affect the analysis are also addressed. CISAPS uses an expanded list of 611 unique amino acid indices where each one represents a different property to perform the analysis. This web-based server enables researchers with little knowledge of signal processing methods to apply and include complex informational spectrum analysis to their work.

  16. CISAPS: Complex Informational Spectrum for the Analysis of Protein Sequences

    PubMed Central

    Seker, Huseyin; Aydin, Nizamettin

    2015-01-01

    Complex informational spectrum analysis for protein sequences (CISAPS) and its web-based server are developed and presented. As recent studies show, only the use of the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient. Therefore, CISAPS is developed to consider and provide results in three forms including absolute, real, and imaginary spectrum. Biologically related features to the analysis of influenza A subtypes as presented as a case study in this study can also appear individually either in the real or imaginary spectrum. As the results presented, protein classes can present similarities or differences according to the features extracted from CISAPS web server. These associations are probable to be related with the protein feature that the specific amino acid index represents. In addition, various technical issues such as zero-padding and windowing that may affect the analysis are also addressed. CISAPS uses an expanded list of 611 unique amino acid indices where each one represents a different property to perform the analysis. This web-based server enables researchers with little knowledge of signal processing methods to apply and include complex informational spectrum analysis to their work. PMID:25632276

  17. PROFESS: a PROtein Function, Evolution, Structure and Sequence database

    PubMed Central

    Triplet, Thomas; Shortridge, Matthew D.; Griep, Mark A.; Stark, Jaime L.; Powers, Robert; Revesz, Peter

    2010-01-01

    The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are ∼1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein–protein interaction networks. Database URL: http://cse.unl.edu/∼profess/ PMID:20624718

  18. Common recognition principles across diverse sequence and structural families of sialic acid binding proteins.

    PubMed

    Bhagavat, Raghu; Chandra, Nagasuma

    2014-01-01

    Sialic acids form a large family of 9-carbon monosaccharides and are integral components of glycoconjugates. They are known to bind to a wide range of receptors belonging to diverse sequence families and fold classes and are key mediators in a plethora of cellular processes. Thus, it is of great interest to understand the features that give rise to such a recognition capability. Structural analyses using a non-redundant data set of known sialic acid binding proteins was carried out, which included exhaustive binding site comparisons and site alignments using in-house algorithms, followed by clustering and tree computation, which has led to derivation of sialic acid recognition principles. Although the proteins in the data set belong to several sequence and structure families, their binding sites could be grouped into only six types. Structural comparison of the binding sites indicates that all sites contain one or more different combinations of key structural features over a common scaffold. The six binding site types thus serve as structural motifs for recognizing sialic acid. Scanning the motifs against a non-redundant set of binding sites from PDB indicated the motifs to be specific for sialic acid recognition. Knowledge of determinants obtained from this study will be useful for detecting function in unknown proteins. As an example analysis, a genome-wide scan for the motifs in structures of Mycobacterium tuberculosis proteome identified 17 hits that contain combinations of the features, suggesting a possible function of sialic acid binding by these proteins.

  19. Isolation and characterization of adrenoleukodystrophy protein (ALDP) related sequences in the human genome

    SciTech Connect

    Geraghty, M.T.; Stetten, G.; Kearns, W.

    1994-09-01

    X-linked adrenoleukodystrophy (ALD) is a disorder of peroxisomal {beta}-oxidation of very long chain fatty acids. It presents either as progressive dementia in childhood or as progressive paraparesis in later years. Adrenal insufficiency occurs in both phenotypes. The gene of the ALD protein has been mapped to Xq28 and has recently been cloned and characterized. The ALD protein has significant homology to the peroxisomal membrane protein, PMP70 and belongs to the ATP binding cassette superfamily of transporters. We screened a human genomic library with an ALDP cDNA and isolated 5 different but highly similar clones containing sequences corresponding to the 3{prime} end of the ALDP gene. Comparison of the sequences over the region corresponding to exon 9 through the 3{prime} end of the ALDP gene reveals {approximately}96% nucleotide identity in both exonic and intronic regions. Splice sites and open reading frames are maintained. Using both FISH and human-rodent DNA mapping panels, we positively assign these ALDP-related sequences to chromosomes 2, 16 and 22, and provisionally to 1 and 20. Southern blot of primate DNA probed with a partial ALDP cDNA (exon 2-10) shows that expansion of ALDP-related sequences occurred in higher primates (chimp, gorilla and human). Although Northern blots show multiple ALDP-hybridizing transcripts in certain tissues, we have no evidence to date for expression of these ALDP-related sequences. In conclusion, our data show there has been an unusual and recent dispersal to multiple chromosomes of structural gene sequences related to the ALDP gene. The functional significance of these sequences remains to be determined but their existence complicates PCR and mutation analysis of the ALDP gene.

  20. A sequence alignment-independent method for protein classification.

    PubMed

    Vries, John K; Munshi, Rajan; Tobi, Dror; Klein-Seetharaman, Judith; Benos, Panayiotis V; Bahar, Ivet

    2004-01-01

    Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4

  1. Identification of Sequences Encoding Symbiodinium minutum Mitochondrial Proteins.

    PubMed

    Butterfield, Erin R; Howe, Christopher J; Nisbet, R Ellen R

    2016-01-21

    The dinoflagellates are an extremely diverse group of algae closely related to the Apicomplexa and the ciliates. Much work has previously been undertaken to determine the presence of various biochemical pathways within dinoflagellate mitochondria. However, these studies were unable to identify several key transcripts including those encoding proteins involved in the pyruvate dehydrogenase complex, iron-sulfur cluster biosynthesis, and protein import. Here, we analyze the draft nuclear genome of the dinoflagellate Symbiodinium minutum, as well as RNAseq data to identify nuclear genes encoding mitochondrial proteins. The results confirm the presence of a complete tricarboxylic acid cycle in the dinoflagellates. Results also demonstrate the difficulties in using the genome sequence for the identification of genes due to the large number of introns, but show that it is highly useful for the determination of gene duplication events.

  2. Identification of Sequences Encoding Symbiodinium minutum Mitochondrial Proteins

    PubMed Central

    Butterfield, Erin R.; Howe, Christopher J.; Nisbet, R. Ellen R.

    2016-01-01

    The dinoflagellates are an extremely diverse group of algae closely related to the Apicomplexa and the ciliates. Much work has previously been undertaken to determine the presence of various biochemical pathways within dinoflagellate mitochondria. However, these studies were unable to identify several key transcripts including those encoding proteins involved in the pyruvate dehydrogenase complex, iron–sulfur cluster biosynthesis, and protein import. Here, we analyze the draft nuclear genome of the dinoflagellate Symbiodinium minutum, as well as RNAseq data to identify nuclear genes encoding mitochondrial proteins. The results confirm the presence of a complete tricarboxylic acid cycle in the dinoflagellates. Results also demonstrate the difficulties in using the genome sequence for the identification of genes due to the large number of introns, but show that it is highly useful for the determination of gene duplication events. PMID:26798115

  3. Sequences of the recA gene and protein.

    PubMed

    Sancar, A; Stachelek, C; Konigsberg, W; Rupp, W D

    1980-05-01

    We have determined the nucleotide sequence of the recA gene of Escherichia coli; this permits the formulation of the primary structure for the recA protein. This structure is consistent with the amino acid composition of the tryptic peptides obtained from the recA protein. The coding region of the recA gene has 1059 base pairs, which specify 352 amino acids. The recA protein has alanine and phenylalanine as its NH2- and COOH-terminal amino acids, respectively, and has the following amino acid composition: Cys3 Asp20 Asn15 Met9 Thr17 Ser20 Glu30 Gln13 Pro10 Gly35 Ala38 Val22 Ile27 Leu31 Tyr7 Phe10 His2Lys27 Trp2 Arg14. Of the three cysteine residues, only two can be alkylated under reducing and denaturing conditions. The molecular weight of the recA polypeptide is 37,842.

  4. Synthesis of peptide sequences derived from fibril-forming proteins.

    PubMed

    Scanlon, Denis B; Karas, John A

    2011-01-01

    The pathogenesis of a large number of diseases, including Alzheimer's Disease, Parkinson's Disease, and Creutzfeldt-Jakob Disease (CJD), is associated with protein aggregation and the formation of amyloid, fibrillar deposits. Peptide fragments of amyloid-forming proteins have been found to form fibrils in their own right and have become important tools for unlocking the mechanism of amyloid fibril formation and the pathogenesis of amyloid diseases. The synthesis and purification of peptide sequences derived from amyloid fibril-forming proteins can be extremely challenging. The synthesis may not proceed well, generating a very low quality crude product which can be difficult to purify. Even clean crude peptides can be difficult to purify, as they are often insoluble or form fibrils rapidly in solution. This chapter presents methods to recognise and to overcome the difficulties associated with the synthesis, and purification of fibril-forming peptides, illustrating the points with three synthetic examples.

  5. DNA topology confers sequence specificity to nonspecific architectural proteins.

    PubMed

    Wei, Juan; Czapla, Luke; Grosner, Michael A; Swigon, David; Olson, Wilma K

    2014-11-25

    Topological constraints placed on short fragments of DNA change the disorder found in chain molecules randomly decorated by nonspecific, architectural proteins into tightly organized 3D structures. The bacterial heat-unstable (HU) protein builds up, counter to expectations, in greater quantities and at particular sites along simulated DNA minicircles and loops. Moreover, the placement of HU along loops with the "wild-type" spacing found in the Escherichia coli lactose (lac) and galactose (gal) operons precludes access to key recognition elements on DNA. The HU protein introduces a unique spatial pathway in the DNA upon closure. The many ways in which the protein induces nearly the same closed circular configuration point to the statistical advantage of its nonspecificity. The rotational settings imposed on DNA by the repressor proteins, by contrast, introduce sequential specificity in HU placement, with the nonspecific protein accumulating at particular loci on the constrained duplex. Thus, an architectural protein with no discernible DNA sequence-recognizing features becomes site-specific and potentially assumes a functional role upon loop formation. The locations of HU on the closed DNA reflect long-range mechanical correlations. The protein responds to DNA shape and deformability—the stiff, naturally straight double-helical structure—rather than to the unique features of the constituent base pairs. The structures of the simulated loops suggest that HU architecture, like nucleosomal architecture, which modulates the ability of regulatory proteins to recognize their binding sites in the context of chromatin, may influence repressor-operator interactions in the context of the bacterial nucleoid.

  6. Size and sequence and the volume change of protein folding.

    PubMed

    Rouget, Jean-Baptiste; Aksel, Tural; Roche, Julien; Saldana, Jean-Louis; Garcia, Angel E; Barrick, Doug; Royer, Catherine A

    2011-04-20

    The application of hydrostatic pressure generally leads to protein unfolding, implying, in accordance with Le Chatelier's principle, that the unfolded state has a smaller molar volume than the folded state. However, the origin of the volume change upon unfolding, ΔV(u), has yet to be determined. We have examined systematically the effects of protein size and sequence on the value of ΔV(u) using as a model system a series of deletion variants of the ankyrin repeat domain of the Notch receptor. The results provide strong evidence in support of the notion that the major contributing factor to pressure effects on proteins is their imperfect internal packing in the folded state. These packing defects appear to be specifically localized in the 3D structure, in contrast to the uniformly distributed effects of temperature and denaturants that depend upon hydration of exposed surface area upon unfolding. Given its local nature, the extent to which pressure globally affects protein structure can inform on the degree of cooperativity and long-range coupling intrinsic to the folded state. We also show that the energetics of the protein's conformations can significantly modulate their volumetric properties, providing further insight into protein stability.

  7. Probing Protein Sequences as Sources for Encrypted Antimicrobial Peptides

    PubMed Central

    Brand, Guilherme D.; Magalhães, Mariana T. Q.; Tinoco, Maria L. P.; Aragão, Francisco J. L.; Nicoli, Jacques; Kelly, Sharon M.; Cooper, Alan; Bloch, Carlos

    2012-01-01

    Starting from the premise that a wealth of potentially biologically active peptides may lurk within proteins, we describe here a methodology to identify putative antimicrobial peptides encrypted in protein sequences. Candidate peptides were identified using a new screening procedure based on physicochemical criteria to reveal matching peptides within protein databases. Fifteen such peptides, along with a range of natural antimicrobial peptides, were examined using DSC and CD to characterize their interaction with phospholipid membranes. Principal component analysis of DSC data shows that the investigated peptides group according to their effects on the main phase transition of phospholipid vesicles, and that these effects correlate both to antimicrobial activity and to the changes in peptide secondary structure. Consequently, we have been able to identify novel antimicrobial peptides from larger proteins not hitherto associated with such activity, mimicking endogenous and/or exogenous microorganism enzymatic processing of parent proteins to smaller bioactive molecules. A biotechnological application for this methodology is explored. Soybean (Glycine max) plants, transformed to include a putative antimicrobial protein fragment encoded in its own genome were tested for tolerance against Phakopsora pachyrhizi, the causative agent of the Asian soybean rust. This procedure may represent an inventive alternative to the transgenic technology, since the genetic material to be used belongs to the host organism and not to exogenous sources. PMID:23029273

  8. Probing protein sequences as sources for encrypted antimicrobial peptides.

    PubMed

    Brand, Guilherme D; Magalhães, Mariana T Q; Tinoco, Maria L P; Aragão, Francisco J L; Nicoli, Jacques; Kelly, Sharon M; Cooper, Alan; Bloch, Carlos

    2012-01-01

    Starting from the premise that a wealth of potentially biologically active peptides may lurk within proteins, we describe here a methodology to identify putative antimicrobial peptides encrypted in protein sequences. Candidate peptides were identified using a new screening procedure based on physicochemical criteria to reveal matching peptides within protein databases. Fifteen such peptides, along with a range of natural antimicrobial peptides, were examined using DSC and CD to characterize their interaction with phospholipid membranes. Principal component analysis of DSC data shows that the investigated peptides group according to their effects on the main phase transition of phospholipid vesicles, and that these effects correlate both to antimicrobial activity and to the changes in peptide secondary structure. Consequently, we have been able to identify novel antimicrobial peptides from larger proteins not hitherto associated with such activity, mimicking endogenous and/or exogenous microorganism enzymatic processing of parent proteins to smaller bioactive molecules. A biotechnological application for this methodology is explored. Soybean (Glycine max) plants, transformed to include a putative antimicrobial protein fragment encoded in its own genome were tested for tolerance against Phakopsora pachyrhizi, the causative agent of the Asian soybean rust. This procedure may represent an inventive alternative to the transgenic technology, since the genetic material to be used belongs to the host organism and not to exogenous sources.

  9. Substrate-Driven Mapping of the Degradome by Comparison of Sequence Logos

    PubMed Central

    Fuchs, Julian E.; von Grafenstein, Susanne; Huber, Roland G.; Kramer, Christian; Liedl, Klaus R.

    2013-01-01

    Sequence logos are frequently used to illustrate substrate preferences and specificity of proteases. Here, we employed the compiled substrates of the MEROPS database to introduce a novel metric for comparison of protease substrate preferences. The constructed similarity matrix of 62 proteases can be used to intuitively visualize similarities in protease substrate readout via principal component analysis and construction of protease specificity trees. Since our new metric is solely based on substrate data, we can engraft the protease tree including proteolytic enzymes of different evolutionary origin. Thereby, our analyses confirm pronounced overlaps in substrate recognition not only between proteases closely related on sequence basis but also between proteolytic enzymes of different evolutionary origin and catalytic type. To illustrate the applicability of our approach we analyze the distribution of targets of small molecules from the ChEMBL database in our substrate-based protease specificity trees. We observe a striking clustering of annotated targets in tree branches even though these grouped targets do not necessarily share similarity on protein sequence level. This highlights the value and applicability of knowledge acquired from peptide substrates in drug design of small molecules, e.g., for the prediction of off-target effects or drug repurposing. Consequently, our similarity metric allows to map the degradome and its associated drug target network via comparison of known substrate peptides. The substrate-driven view of protein-protein interfaces is not limited to the field of proteases but can be applied to any target class where a sufficient amount of known substrate data is available. PMID:24244149

  10. A graphic representation of protein sequence and predicting the subcellular locations of prokaryotic proteins.

    PubMed

    Feng, Zhi-Ping; Zhang, Chun-Ting

    2002-03-01

    Zp curve, a three-dimensional space curve representation of protein primary sequence based on the hydrophobicity and charged properties of amino acid residues along the primary sequence is suggested. Relying on the Zp parameters extracted from the three components of the Zp curve and the Bayes discriminant algorithm, the subcellular locations of prokaryotic proteins were predicted. Consequently, an accuracy of 81.5% in the cross-validation test has been achieved using 13 parameters extracted from the curve for the database of 997 prokaryotic proteins. The result is slightly better than that of using the neural network method (80.9%) based on the amino acid composition for the same database. By jointing the amino acid composition and the Zp parameters, the overall predictive accuracy 89.6% can be achieved. It is about 3% higher than that of the Bayes discriminant algorithm based merely on the amino acid composition for the same database. The prediction is also performed with a larger dataset derived from the version 39 SWISS-PROT databank and two datasets with different sequence similarity. Even for the dataset of non-sequence similarity, the improvement can be of 4.4% in the cross-validation test. The results indicate that the Zp parameters are effective in representing the information within a protein primary sequence. The method of extracting information from the primary structure may be useful for other areas of protein studies.

  11. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure.

    PubMed

    Capra, John A; Laskowski, Roman A; Thornton, Janet M; Singh, Mona; Funkhouser, Thomas A

    2009-12-01

    Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (http://compbio.cs.princeton.edu/concavity/).

  12. Origin and spread of photosynthesis based upon conserved sequence features in key bacteriochlorophyll biosynthesis proteins.

    PubMed

    Gupta, Radhey S

    2012-11-01

    The origin of photosynthesis and how this capability has spread to other bacterial phyla remain important unresolved questions. I describe here a number of conserved signature indels (CSIs) in key proteins involved in bacteriochlorophyll (Bchl) biosynthesis that provide important insights in these regards. The proteins BchL and BchX, which are essential for Bchl biosynthesis, are derived by gene duplication in a common ancestor of all phototrophs. More ancient gene duplication gave rise to the BchX-BchL proteins and the NifH protein of the nitrogenase complex. The sequence alignment of NifH-BchX-BchL proteins contain two CSIs that are uniquely shared by all NifH and BchX homologs, but not by any BchL homologs. These CSIs and phylogenetic analysis of NifH-BchX-BchL protein sequences strongly suggest that the BchX homologs are ancestral to BchL and that the Bchl-based anoxygenic photosynthesis originated prior to the chlorophyll (Chl)-based photosynthesis in cyanobacteria. Another CSI in the BchX-BchL sequence alignment that is uniquely shared by all BchX homologs and the BchL sequences from Heliobacteriaceae, but absent in all other BchL homologs, suggests that the BchL homologs from Heliobacteriaceae are primitive in comparison to all other photosynthetic lineages. Several other identified CSIs in the BchN homologs are commonly shared by all proteobacterial homologs and a clade consisting of the marine unicellular Cyanobacteria (Clade C). These CSIs in conjunction with the results of phylogenetic analyses and pair-wise sequence similarity on the BchL, BchN, and BchB proteins, where the homologs from Clade C Cyanobacteria and Proteobacteria exhibited close relationship, provide strong evidence that these two groups have incurred lateral gene transfers. Additionally, phylogenetic analyses and several CSIs in the BchL-N-B proteins that are uniquely shared by all Chlorobi and Chloroflexi homologs provide evidence that the genes for these proteins have also been

  13. Systematic and fully automated identification of protein sequence patterns.

    PubMed

    Hart, R K; Royyuru, A K; Stolovitzky, G; Califano, A

    2000-01-01

    We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.

  14. Sequence Heterogeneity Accelerates Protein Search for Targets on DNA

    NASA Astrophysics Data System (ADS)

    Shvets, Alexey; Kolomeisky, Anatoly

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry and heterogeneity of a genome. The work was supported by the Welch Foundation (Grant C-1559), by the NSF (Grant CHE-1360979), and by the Center for Theoretical Biological Physics sponsored by the NSF (Grant PHY-1427654).

  15. Quantitative assessment of protein function prediction from metagenomics shotgun sequences.

    PubMed

    Harrington, E D; Singh, A H; Doerks, T; Letunic, I; von Mering, C; Jensen, L J; Raes, J; Bork, P

    2007-08-28

    To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.

  16. Quantification of the variation in percentage identity for protein sequence alignments

    PubMed Central

    Raghava, GPS; Barton, Geoffrey J

    2006-01-01

    Background Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method. Results The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% ± 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84. Conclusion Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to

  17. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

    PubMed

    Pruitt, Kim D; Tatusova, Tatiana; Maglott, Donna R

    2005-01-01

    The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff.

  18. Protein Sequence Alignment Taking the Structure of Peptide Bond

    NASA Astrophysics Data System (ADS)

    Hara, Toshihide; Sato, Keiko; Ohya, Masanori

    2013-01-01

    In a previous paper1 we proposed a new method for performing pairwise alignment of protein sequences. The method, called MTRAP, achieves the highest performance compared with other alignment methods such as ClustalW22,3 on two benchmarks for alignment accuracy. In this paper, we introduce a new measure between two amino acids based on the formation of peptide bonds. The measure is implemented into MTRAP software to further improve alignment accuracy. Our alignment software is available at

  19. Sequence-Specific Solvent Accessibilities of Protein Residues in Unfolded Protein Ensembles

    PubMed Central

    Bernadó, Pau; Blackledge, Martin; Sancho, Javier

    2006-01-01

    Protein stability cannot be understood without the correct description of the unfolded state. We present here an efficient method for accurate calculation of atomic solvent exposures for denatured protein ensembles. The method used to generate the ensembles has been shown to reproduce diverse biophysical experimental data corresponding to natively and chemically unfolded proteins. Using a data set of 19 nonhomologous proteins containing from 98 to 579 residues, we report average accessibilities for all residue types. These averaged accessibilities are considerably lower than those previously reported for tripeptides and close to the lower limit reported by Creamer and co-workers. Of importance, we observe remarkable sequence dependence for the exposure to solvent of all residue types, which indicates that average residue solvent exposures can be inappropriate to interpret mutational studies. In addition, we observe smaller influences of both protein size and protein amino acid composition in the averaged residue solvent exposures for individual proteins. Calculating residue-specific solvent accessibilities within the context of real sequences is thus necessary and feasible. The approach presented here may allow a more precise parameterization of protein energetics as a function of polar- and apolar-area burial and opens new ways to investigate the energetics of the unfolded state of proteins. PMID:17012314

  20. No Genome-Wide Protein Sequence Convergence for Echolocation

    PubMed Central

    Zou, Zhengting; Zhang, Jianzhi

    2015-01-01

    Toothed whales and two groups of bats independently acquired echolocation, the ability to locate and identify objects by reflected sound. Echolocation requires physiologically complex and coordinated vocal, auditory, and neural functions, but the molecular basis of the capacity for echolocation is not well understood. A recent study suggested that convergent amino acid substitutions widespread in the proteins of echolocators underlay the convergent origins of mammalian echolocation. Here, we show that genomic signatures of molecular convergence between echolocating lineages are generally no stronger than those between echolocating and comparable nonecholocating lineages. The same is true for the group of 29 hearing-related proteins claimed to be enriched with molecular convergence. Reexamining the previous selection test reveals several flaws and invalidates the asserted evidence for adaptive convergence. Together, these findings indicate that the reported genomic signatures of convergence largely reflect the background level of sequence convergence unrelated to the origins of echolocation. PMID:25631925

  1. No genome-wide protein sequence convergence for echolocation.

    PubMed

    Zou, Zhengting; Zhang, Jianzhi

    2015-05-01

    Toothed whales and two groups of bats independently acquired echolocation, the ability to locate and identify objects by reflected sound. Echolocation requires physiologically complex and coordinated vocal, auditory, and neural functions, but the molecular basis of the capacity for echolocation is not well understood. A recent study suggested that convergent amino acid substitutions widespread in the proteins of echolocators underlay the convergent origins of mammalian echolocation. Here, we show that genomic signatures of molecular convergence between echolocating lineages are generally no stronger than those between echolocating and comparable nonecholocating lineages. The same is true for the group of 29 hearing-related proteins claimed to be enriched with molecular convergence. Reexamining the previous selection test reveals several flaws and invalidates the asserted evidence for adaptive convergence. Together, these findings indicate that the reported genomic signatures of convergence largely reflect the background level of sequence convergence unrelated to the origins of echolocation.

  2. Engineering the Dynamic Properties of Protein Networks through Sequence Variation

    PubMed Central

    2016-01-01

    The dynamic behavior of macromolecular networks dominates the mechanical properties of soft materials and influences biological processes at multiple length scales. In hydrogels prepared from self-assembling artificial proteins, stress relaxation and energy dissipation arise from the transient character of physical network junctions. Here we show that subtle changes in sequence can be used to program the relaxation behavior of end-linked networks of engineered coiled-coil proteins. Single-site substitutions in the coiled-coil domains caused shifts in relaxation time over 5 orders of magnitude as demonstrated by dynamic oscillatory shear rheometry and stress relaxation measurements. Networks with multiple relaxation time scales were also engineered. This work demonstrates how time-dependent mechanical responses of macromolecular materials can be encoded in genetic information. PMID:27924309

  3. An Integrated Sequence-Structure Database incorporating matching mRNA sequence, amino acid sequence and protein three-dimensional structure data.

    PubMed Central

    Adzhubei, I A; Adzhubei, A A; Neidle, S

    1998-01-01

    We have constructed a non-homologous database, termed the Integrated Sequence-Structure Database (ISSD) which comprises the coding sequences of genes, amino acid sequences of the corresponding proteins, their secondary structure and straight phi,psi angles assignments, and polypeptide backbone coordinates. Each protein entry in the database holds the alignment of nucleotide sequence, amino acid sequence and the PDB three-dimensional structure data. The nucleotide and amino acid sequences for each entry are selected on the basis of exact matches of the source organism and cell environment. The current version 1.0 of ISSD is available on the WWW at http://www.protein.bio.msu.su/issd/ and includes 107 non-homologous mammalian proteins, of which 80 are human proteins. The database has been used by us for the analysis of synonymous codon usage patterns in mRNA sequences showing their correlation with the three-dimensional structure features in the encoded proteins. Possible ISSD applications include optimisation of protein expression, improvement of the protein structure prediction accuracy, and analysis of evolutionary aspects of the nucleotide sequence-protein structure relationship. PMID:9399866

  4. Sequence comparison on a cluster of workstations using the PVM system

    SciTech Connect

    Guan, X.; Mural, R.J.; Uberbacher, E.C.

    1995-02-01

    We have implemented a distributed sequence comparison algorithm on a cluster of workstations using the PVM paradigm. This implementation has achieved similar performance to the intel iPSC/860 Hypercube, a massively parallel computer. The distributed sequence comparison algorithm serves as a search tool for two Internet servers GRAIL and GENQUEST. This paper describes the implementation and the performance of the algorithm.

  5. Sequence studies of proteins from larval and pupal cuticle of the yellow meal worm, Tenebrio molitor.

    PubMed

    Andersen, S O; Rafn, K; Roepstorff, P

    1997-02-01

    Complete amino acid sequences have been determined for six larval-pupal cuticular proteins from Tenebrio molitor. The sequenced proteins are major components in both larval and pupal cuticle, and both basic and slightly acidic proteins are represented. The proteins show pronounced similarities to some of the proteins sequenced from other insect cuticles. Three slightly acidic larval-pupal Tenebrio cuticular proteins contain a 66-residue central, hydrophilic region, resembling regions in cuticular proteins from insect species of four different orders (Coleoptera, Diptera, Lepidoptera and Orthoptera), and three basic proteins from larval-pupal Tenebrio cuticle have a 51-residue hydrophilic region in common with two proteins from cuticle of pharate adult locusts (Locusta migratoria). The Tenebrio larval-pupal cuticular proteins are also similar to locust adult cuticular proteins, by frequent occurrence of the short sequence motif Ala-Ala-Pro-Ala/Val. The pronounced sequence similarities between cuticular proteins from different insect orders indicate that the conserved regions are functionally important.

  6. Sequence analysis and location of capsid proteins within RNA 2 of strawberry latent ringspot virus.

    PubMed

    Kreiah, S; Strunk, G; Cooper, J I

    1994-09-01

    The nucleotide sequence of the RNA 2 of a strawberry isolate (H) of strawberry latent ringspot virus (SLRSV) comprised 3824 nucleotides and contained one long open reading frame with a theoretical coding capacity of 890 amino acids equivalent to a protein of 98.8K. The N-terminal amino acid sequences of virion-derived proteins were determined by Edman degradation allowing the capsid coding regions to be located and serine/glycine cleavage sites to be identified within the polyprotein. The amino acid sequence in the capsid coding region of an isolate of SLRSV from flowering cherry in New Zealand was 97% identical to that of SLRSV-H. Except in the 3' and 5' terminal non-coding sequences, computer-based alignment and comparison algorithms did not reveal any substantial homologies between RNA 2 of SLRSV-H and the equivalent genomic segments in the nepoviruses arabis mosaic, cherry leaf roll, grapevine fanleaf, raspberry ringspot, grapevine hungarian chrome mosaic, tomato blackring, tomato ringspot, tobacco ringspot, or in the comoviruses cowpea mosaic and red clover mottle. Despite the similarities in overall genome organization, data from RNA 2 remain insufficient for unambiguous positioning of SLRSV in relation to species/genera in the Comoviridae.

  7. Sequence similarity network reveals common ancestry of multidomain proteins.

    PubMed

    Song, Nan; Joseph, Jacob M; Davis, George B; Durand, Dannie

    2008-05-16

    We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain

  8. Mapping protein-DNA interactions using ChIP-sequencing.

    PubMed

    Massie, Charles E; Mills, Ian G

    2012-01-01

    Chromatin immunoprecipitation (ChIP) allows enrichment of genomic regions which are associated with specific transcription factors, histone modifications, and indeed any other epitopes which are present on chromatin. The original ChIP methods used site-specific PCR and Southern blotting to confirm which regions of the genome were enriched, on a candidate basis. The combination of ChIP with genomic tiling arrays (ChIP-chip) allowed a more unbiased approach to map ChIP-enriched sites. However, limitations of microarray probe design and probe number have a detrimental impact on the coverage, resolution, sensitivity, and cost of whole-genome tiling microarray sets for higher eukaryotes with large genomes. The combination of ChIP with high-throughput sequencing technology has allowed more comprehensive surveys of genome occupancy, greater resolution, and lower cost for whole genome coverage. Herein, we provide a comparison of high-throughput sequencing platforms and a survey of ChIP-seq analysis tools, discuss experimental design, and describe a detailed ChIP-seq method.Chromatin immunoprecipitation (ChIP) allows enrichment of genomic regions which are associated with specific transcription factors, histone modifications, and indeed any other epitopes which are present on chromatin. The original ChIP methods used site-specific PCR and Southern blotting to confirm which regions of the genome were enriched, on a candidate basis. The combination of ChIP with genomic tiling arrays (ChIP-chip) allowed a more unbiased approach to map ChIP-enriched sites. However, limitations of microarray probe design and probe number have a detrimental impact on the coverage, resolution, sensitivity, and cost of whole-genome tiling microarray sets for higher eukaryotes with large genomes. The combination of ChIP with high-throughput sequencing technology has allowed more comprehensive surveys of genome occupancy, greater resolution, and lower cost for whole genome coverage. Herein, we

  9. Cloning and sequencing of a cDNA encoding a heat-stable sweet protein, mabinlin II.

    PubMed

    Nirasawa, S; Masuda, Y; Nakaya, K; Kurihara, Y

    1996-11-28

    A cDNA clone encoding a heat-stable sweet protein, mabinlin II (MAB), was isolated and sequenced. The encoded precursor to MAB was composed of 155 amino acid (aa) residues, including a signal sequence of 20 aa, an N-terminal extension peptide of 15 aa, a linker peptide of 14 aa and one residue of C-terminal extension. Comparison of the proteolytic cleavage sites during post-translational processing of MAB precursor with those of like 2S seed-storage proteins of Arabidopsis thaliana, Brassica napus and Bertholletia excelsa shows that the three individual cleavage sites between respective species are conserved.

  10. QOMA: quasi-optimal multiple alignment of protein sequences.

    PubMed

    Zhang, Xu; Kahveci, Tamer

    2007-01-15

    We consider the problem of multiple alignment of protein sequences with the goal of achieving a large SP (Sum-of-Pairs) score. We introduce a new graph-based method. We name our method QOMA (Quasi-Optimal Multiple Alignment). QOMA starts with an initial alignment. It represents this alignment using a K-partite graph. It then improves the SP score of the initial alignment through local optimizations within a window that moves greedily on the alignment. QOMA uses two parameters to permit flexibility in time/accuracy trade off: (1) The size of the window for local optimization. (2) The sparsity of the K-partite graph. Unlike traditional progressive methods, QOMA is independent of the order of sequences. The experimental results on BAliBASE benchmarks show that QOMA produces higher SP score than the existing tools including ClustalW, Probcons, Muscle, T-Coffee and DCA. The difference is more significant for distant proteins. The software is available from the authors upon request.

  11. Deciphering the Hidden Informational Content of Protein Sequences

    PubMed Central

    Liu, Ming; Hua, Qing-xin; Hu, Shi-Quan; Jia, Wenhua; Yang, Yanwu; Saith, Sunil Evan; Whittaker, Jonathan; Arvan, Peter; Weiss, Michael A.

    2010-01-01

    Protein sequences encode both structure and foldability. Whereas the interrelationship of sequence and structure has been extensively investigated, the origins of folding efficiency are enigmatic. We demonstrate that the folding of proinsulin requires a flexible N-terminal hydrophobic residue that is dispensable for the structure, activity, and stability of the mature hormone. This residue (PheB1 in placental mammals) is variably positioned within crystal structures and exhibits 1H NMR motional narrowing in solution. Despite such flexibility, its deletion impaired insulin chain combination and led in cell culture to formation of non-native disulfide isomers with impaired secretion of the variant proinsulin. Cellular folding and secretion were maintained by hydrophobic substitutions at B1 but markedly perturbed by polar or charged side chains. We propose that, during folding, a hydrophobic side chain at B1 anchors transient long-range interactions by a flexible N-terminal arm (residues B1–B8) to mediate kinetic or thermodynamic partitioning among disulfide intermediates. Evidence for the overall contribution of the arm to folding was obtained by alanine scanning mutagenesis. Together, our findings demonstrate that efficient folding of proinsulin requires N-terminal sequences that are dispensable in the native state. Such arm-dependent folding can be abrogated by mutations associated with β-cell dysfunction and neonatal diabetes mellitus. PMID:20663888

  12. Fibronectin-binding protein of Streptococcus pyogenes: sequence of the binding domain involved in adherence of streptococci to epithelial cells.

    PubMed Central

    Talay, S R; Valentin-Weigand, P; Jerlström, P G; Timmis, K N; Chhatwal, G S

    1992-01-01

    The sequence of the fibronectin-binding domain of the fibronectin-binding protein of Streptococcus pyogenes (Sfb protein) was determined, and its role in streptococcal adherence was investigated by use of an Sfb fusion protein in adherence studies. A 1-kb DNA fragment coding for the binding domain of Sfb protein was cloned into the expression vector pEX31 to produce an Sfb fusion protein consisting of the N-terminal part of MS2 polymerase and a C-terminal fragment of the streptococcal protein. Induction of the vector promoter resulted in hyperexpression of fibronectin-binding fusion protein in the cytoplasm of the recombinant Escherichia coli cells. Sequence determination of the cloned 1-kb fragment revealed an in-frame reading frame for a 268-amino-acid peptide composed of a 37-amino-acid sequence which is completely repeated three times and incompletely repeated a fourth time. Cloning of one repeat into pEX31 resulted in expression of small fusion peptides that show fibronectin-binding activity, indicating that one repeat contains at least one binding domain. Each repeat exhibits two charged domains and shows high homology with the 38-amino-acid D3 repeat of the fibronectin-binding protein of Staphylococcus aureus. Sequence comparison with other streptococcal ligand-binding surface proteins, including M protein, failed to reveal significant homology, which suggests that Sfb protein represents a novel type of functional protein in S. pyogenes. The Sfb fusion protein isolated from the cytoplasm of recombinant cells was purified by fast protein liquid chromatography. It showed a strong competitive inhibition of fibronectin binding to S. pyogenes and of the adherence of bacteria to cultured epithelial cells. In contrast, purified streptococcal lipoteichoic acid showed only a weak inhibition of fibronectin binding and streptococcal adherence. These results demonstrate that Sfb protein is directly involved in the fibronectin-mediated adherence of S. pyogenes to

  13. Comparison of next generation sequencing technologies for transcriptome characterization

    PubMed Central

    2009-01-01

    Background We have developed a simulation approach to help determine the optimal mixture of sequencing methods for most complete and cost effective transcriptome sequencing. We compared simulation results for traditional capillary sequencing with "Next Generation" (NG) ultra high-throughput technologies. The simulation model was parameterized using mappings of 130,000 cDNA sequence reads to the Arabidopsis genome (NCBI Accession SRA008180.19). We also generated 454-GS20 sequences and de novo assemblies for the basal eudicot California poppy (Eschscholzia californica) and the magnoliid avocado (Persea americana) using a variety of methods for cDNA synthesis. Results The Arabidopsis reads tagged more than 15,000 genes, including new splice variants and extended UTR regions. Of the total 134,791 reads (13.8 MB), 119,518 (88.7%) mapped exactly to known exons, while 1,117 (0.8%) mapped to introns, 11,524 (8.6%) spanned annotated intron/exon boundaries, and 3,066 (2.3%) extended beyond the end of annotated UTRs. Sequence-based inference of relative gene expression levels correlated significantly with microarray data. As expected, NG sequencing of normalized libraries tagged more genes than non-normalized libraries, although non-normalized libraries yielded more full-length cDNA sequences. The Arabidopsis data were used to simulate additional rounds of NG and traditional EST sequencing, and various combinations of each. Our simulations suggest a combination of FLX and Solexa sequencing for optimal transcriptome coverage at modest cost. We have also developed ESTcalc http://fgp.huck.psu.edu/NG_Sims/ngsim.pl, an online webtool, which allows users to explore the results of this study by specifying individualized costs and sequencing characteristics. Conclusion NG sequencing technologies are a highly flexible set of platforms that can be scaled to suit different project goals. In terms of sequence coverage alone, the NG sequencing is a dramatic advance over capillary

  14. 3-d structure-based amino acid sequence alignment of esterases, lipases and related proteins

    SciTech Connect

    Gentry, M.K.; Doctor, B.P.; Cygler, M.; Schrag, J.D.; Sussman, J.L.

    1993-05-13

    Acetylcholinesterase and butyrylcholinesterase, enzymes with potential as pretreatment drugs for organophosphate toxicity, are members of a larger family of homologous proteins that includes carboxylesterases, cholesterol esterases, lipases, and several nonhydrolytic proteins. A computer-generated alignment of 18 of the proteins, the acetylcholinesases, butyrylcholinesterases, carboxylesterases, some esterases, and the nonenzymatic proteins has been previously presented. More recently, the three-dimensional structures of two enzymes enzymes in this group, acetylcholinesterase from Torpedo californica and lipase from Geotrichum candidum, have been determined. Based on the x-ray structures and the superposition of these two enzymes, it was possible to obtain an improved amino acid sequence alignment of 32 members of this family of proteins. Examination of this alignment reveals that 24 amino acids are invariant in all of the hydrolytic proteins, and an additional 49 are well conserved. Conserved amino acids include those of the active site, the disulfide bridges, the salt bridges, in the core of the proteins, and at the edges of secondary structural elements. Comparison of the three-dimensional structures makes it possible to find a well-defined structural basis for the conservation of many of these amino acids.

  15. Comparison of the Folding Mechanism of Highly Homologous Proteins in the Lipid-binding Protein Family

    EPA Science Inventory

    The folding mechanism of two closely related proteins in the intracellular lipid binding protein family, human bile acid binding protein (hBABP) and rat bile acid binding protein (rBABP) were examined. These proteins are 77% identical (93% similar) in sequence Both of these singl...

  16. Comparison of the Folding Mechanism of Highly Homologous Proteins in the Lipid-binding Protein Family

    EPA Science Inventory

    The folding mechanism of two closely related proteins in the intracellular lipid binding protein family, human bile acid binding protein (hBABP) and rat bile acid binding protein (rBABP) were examined. These proteins are 77% identical (93% similar) in sequence Both of these singl...

  17. Nucleotide sequence analysis of the coat protein genes of two Korean isolates of sweet potato feathery mottle potyvirus.

    PubMed

    Ryu, K H; Kim, S J; Park, W M

    1998-01-01

    The coat protein (CP) genes of the genomic RNA of two Korean isolates of sweet potato feathery mottle potyvirus (SPFMV), SPFMV-K1 and SPFMV-K2, were cloned and their complete nucleotide sequences were determined. Sequence comparisons of the two Korean isolates showed 97.8% amino acid identity in the CP cistron, and 79.9% to 99.0% identity with those of 6 other known SPFMV strains. Of 74 amino acid changes totally among the SPFMV strains, 39 changes were located at the N-terminal region. Pairwise amino acid sequence comparison revealed sequence similarities of 48.6 to 70.2% between SPFMV and 20 other potyviruses, indicating SPFMV to be a quite distinct species. Multiple alignment of the CP cistrons from other potyviruses showed that most of the conserved amino acid residues of the genus Potyvirus are well preserved in the corresponding locations.

  18. Protein multiple sequence alignment by hybrid bio-inspired algorithms.

    PubMed

    Cutello, Vincenzo; Nicosia, Giuseppe; Pavone, Mario; Prizzi, Igor

    2011-03-01

    This article presents an immune inspired algorithm to tackle the Multiple Sequence Alignment (MSA) problem. MSA is one of the most important tasks in biological sequence analysis. Although this paper focuses on protein alignments, most of the discussion and methodology may also be applied to DNA alignments. The problem of finding the multiple alignment was investigated in the study by Bonizzoni and Vedova and Wang and Jiang, and proved to be a NP-hard (non-deterministic polynomial-time hard) problem. The presented algorithm, called Immunological Multiple Sequence Alignment Algorithm (IMSA), incorporates two new strategies to create the initial population and specific ad hoc mutation operators. It is based on the 'weighted sum of pairs' as objective function, to evaluate a given candidate alignment. IMSA was tested using both classical benchmarks of BAliBASE (versions 1.0, 2.0 and 3.0), and experimental results indicate that it is comparable with state-of-the-art multiple alignment algorithms, in terms of quality of alignments, weighted Sums-of-Pairs (SP) and Column Score (CS) values. The main novelty of IMSA is its ability to generate more than a single suboptimal alignment, for every MSA instance; this behaviour is due to the stochastic nature of the algorithm and of the populations evolved during the convergence process. This feature will help the decision maker to assess and select a biologically relevant multiple sequence alignment. Finally, the designed algorithm can be used as a local search procedure to properly explore promising alignments of the search space.

  19. Sequence of a cDNA encoding nitrite reductase from the tree Betula pendula and identification of conserved protein regions.

    PubMed

    Friemann, A; Brinkmann, K; Hachtel, W

    1992-02-01

    The sequence of an mRNA encoding nitrite reductase (NiR, EC 1.7.7.1.) from the tree Betula pendula was determined. A cDNA library constructed from leaf poly(A)+ mRNA was screened with an oligonucleotide probe deduced from NiR sequences from spinach and maize. A 2.5 kb cDNA was isolated that hybridized to an mRNA, the steady-state level of which increased markedly upon induction with nitrate. The nucleotide sequence of the cDNA contains a reading frame encoding a protein of 583 amino acids that reveals 79% identity with NiR from spinach. The transit peptide of the NiR precursor from birch was determined to be 22 amino acids in size by sequence comparison with NiR from spinach and maize and is the shortest transit peptide reported so far. A graphical evaluation of identities found in the NiR sequence alignment revealed nine well conserved sections each exceeding ten amino acids in size. Sequence comparisons with related redox proteins identified essential residues involved in cofactor binding. A putative binding site for ferredoxin was found in the N-terminal half of the protein.

  20. Partial amino acid sequence of human pancreatic stone protein, a novel pancreatic secretory protein.

    PubMed Central

    Montalto, G; Bonicel, J; Multigner, L; Rovery, M; Sarles, H; De Caro, A

    1986-01-01

    Pancreatic stone protein (PSP) is the major organic component of human pancreatic stones. With the use of monoclonal antibody immunoadsorbents, five immunoreactive forms (PSP-S) with close Mr values (14,000-19,000) were isolated from normal pancreatic juice. By CM-Trisacryl M chromatography the lowest-Mr form (PSP-S1) was separated from the others and some of its molecular characteristics were investigated. The Mr of the PSP-S1 polypeptide chain calculated from the amino acid composition was about 16,100. The N-terminal sequences (40 residues) of PSP and PSP-S1 are identical, which suggests that the peptide backbone is the same for both of these polypeptides. The PSP-S1 sequence was determined up to residue 65 and was found to be different from all other known protein sequences. Images Fig. 1. PMID:3541906

  1. De Novo Sequencing of Top-Down Tandem Mass Spectra: A Next Step towards Retrieving a Complete Protein Sequence

    PubMed Central

    Vyatkina, Kira

    2017-01-01

    De novo sequencing of tandem (MS/MS) mass spectra represents the only way to determine the sequence of proteins from organisms with unknown genomes, or the ones not directly inscribed in a genome—such as antibodies, or novel splice variants. Top-down mass spectrometry provides new opportunities for analyzing such proteins; however, retrieving a complete protein sequence from top-down MS/MS spectra still remains a distant goal. In this paper, we review the state-of-the-art on this subject, and enhance our previously developed Twister algorithm for de novo sequencing of peptides from top-down MS/MS spectra to derive longer sequence fragments of a target protein. PMID:28248257

  2. Transitive homology-guided structural studies lead to discovery of Cro proteins with 40% sequence identity but different folds.

    PubMed

    Roessler, Christian G; Hall, Branwen M; Anderson, William J; Ingram, Wendy M; Roberts, Sue A; Montfort, William R; Cordes, Matthew H J

    2008-02-19

    Proteins that share common ancestry may differ in structure and function because of divergent evolution of their amino acid sequences. For a typical diverse protein superfamily, the properties of a few scattered members are known from experiment. A satisfying picture of functional and structural evolution in relation to sequence changes, however, may require characterization of a larger, well chosen subset. Here, we employ a "stepping-stone" method, based on transitive homology, to target sequences intermediate between two related proteins with known divergent properties. We apply the approach to the question of how new protein folds can evolve from preexisting folds and, in particular, to an evolutionary change in secondary structure and oligomeric state in the Cro family of bacteriophage transcription factors, initially identified by sequence-structure comparison of distant homologs from phages P22 and lambda. We report crystal structures of two Cro proteins, Xfaso 1 and Pfl 6, with sequences intermediate between those of P22 and lambda. The domains show 40% sequence identity but differ by switching of alpha-helix to beta-sheet in a C-terminal region spanning approximately 25 residues. Sedimentation analysis also suggests a correlation between helix-to-sheet conversion and strengthened dimerization.

  3. Transitive Homology-Guided Structural Studies Lead to Discovery of Cro Proteins With 40% Sequence Identify But Different Folds

    SciTech Connect

    Roessler, C.G.; Hall, B.M.; Anderson, W.J.; Ingram, W.M.; Roberts, S.A.; Montfort, W.R.; Cordes, M.H.J.

    2009-05-27

    Proteins that share common ancestry may differ in structure and function because of divergent evolution of their amino acid sequences. For a typical diverse protein superfamily, the properties of a few scattered members are known from experiment. A satisfying picture of functional and structural evolution in relation to sequence changes, however, may require characterization of a larger, well chosen subset. Here, we employ a 'stepping-stone' method, based on transitive homology, to target sequences intermediate between two related proteins with known divergent properties. We apply the approach to the question of how new protein folds can evolve from preexisting folds and, in particular, to an evolutionary change in secondary structure and oligomeric state in the Cro family of bacteriophage transcription factors, initially identified by sequence-structure comparison of distant homologs from phages P22 and {lambda}. We report crystal structures of two Cro proteins, Xfaso 1 and Pfl 6, with sequences intermediate between those of P22 and {lambda}. The domains show 40% sequence identity but differ by switching of {alpha}-helix to {beta}-sheet in a C-terminal region spanning {approx}25 residues. Sedimentation analysis also suggests a correlation between helix-to-sheet conversion and strengthened dimerization.

  4. ProML--the protein markup language for specification of protein sequences, structures and families.

    PubMed

    Hanisch, Daniel; Zimmer, Ralf; Lengauer, Thomas

    2002-01-01

    We propose a specification language ProML for protein sequences, structures, and families based on the open XML standard. The language allows for portable, system-independent, machine-parsable and human-readable representation of essential features of proteins. The language is of immediate use for several bioinformatics applications: we discuss clustering of proteins into families and the representation of the specific shared features of the respective clusters. Moreover, we use ProML for specification of data used in fold recognition bench-marks exploiting experimentally derived distance constraints.

  5. A horizontal alignment tool for numerical trend discovery in sequence data: application to protein hydropathy.

    PubMed

    Hadzipasic, Omar; Wrabl, James O; Hilser, Vincent J

    2013-01-01

    An algorithm is presented that returns the optimal pairwise gapped alignment of two sets of signed numerical sequence values. One distinguishing feature of this algorithm is a flexible comparison engine (based on both relative shape and absolute similarity measures) that does not rely on explicit gap penalties. Additionally, an empirical probability model is developed to estimate the significance of the returned alignment with respect to randomized data. The algorithm's utility for biological hypothesis formulation is demonstrated with test cases including database search and pairwise alignment of protein hydropathy. However, the algorithm and probability model could possibly be extended to accommodate other diverse types of protein or nucleic acid data, including positional thermodynamic stability and mRNA translation efficiency. The algorithm requires only numerical values as input and will readily compare data other than protein hydropathy. The tool is therefore expected to complement, rather than replace, existing sequence and structure based tools and may inform medical discovery, as exemplified by proposed similarity between a chlamydial ORFan protein and bacterial colicin pore-forming domain. The source code, documentation, and a basic web-server application are available.

  6. Sequence and structural features of carbohydrate binding in proteins and assessment of predictability using a neural network

    PubMed Central

    Malik, Adeel; Ahmad, Shandar

    2007-01-01

    Background Protein-Carbohydrate interactions are crucial in many biological processes with implications to drug targeting and gene expression. Nature of protein-carbohydrate interactions may be studied at individual residue level by analyzing local sequence and structure environments in binding regions in comparison to non-binding regions, which provide an inherent control for such analyses. With an ultimate aim of predicting binding sites from sequence and structure, overall statistics of binding regions needs to be compiled. Sequence-based predictions of binding sites have been successfully applied to DNA-binding proteins in our earlier works. We aim to apply similar analysis to carbohydrate binding proteins. However, due to a relatively much smaller region of proteins taking part in such interactions, the methodology and results are significantly different. A comparison of protein-carbohydrate complexes has also been made with other protein-ligand complexes. Results We have compiled statistics of amino acid compositions in binding versus non-binding regions- general as well as in each different secondary structure conformation. Binding propensities of each of the 20 residue types and their structure features such as solvent accessibility, packing density and secondary structure have been calculated to assess their predisposition to carbohydrate interactions. Finally, evolutionary profiles of amino acid sequences have been used to predict binding sites using a neural network. Another set of neural networks was trained using information from single sequences and the prediction performance from the evolutionary profiles and single sequences were compared. Best of the neural network based prediction could achieve an 87% sensitivity of prediction at 23% specificity for all carbohydrate-binding sites, using evolutionary information. Single sequences gave 68% sensitivity and 55% specificity for the same data set. Sensitivity and specificity for a limited galactose

  7. Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset.

    PubMed

    Shi, Ming-Guang; Xia, Jun-Feng; Li, Xue-Ling; Huang, De-Shuang

    2010-03-01

    Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.

  8. Large Ribosomal Protein 4 Increases Efficiency of Viral Recoding Sequences

    PubMed Central

    Green, Lisa; Houck-Loomis, Brian; Yueh, Andrew

    2012-01-01

    Expression of retroviral replication enzymes (Pol) requires a controlled translational recoding event to bypass the stop codon at the end of gag. This recoding event occurs either by direct suppression of termination via the insertion of an amino acid at the stop codon (readthrough) or by alteration of the mRNA reading frame (frameshift). Here we report the effects of a host protein, large ribosomal protein 4 (RPL4), on the efficiency of recoding. Using a dual luciferase reporter assay, we found that transfection of cells with a plasmid encoding RPL4 cDNA increases recoding efficiency in a dose-dependent manner, with a maximal enhancement of nearly twofold. Expression of RPL4 increases recoding of reporters containing retroviral readthrough and frameshift sequences, as well as the Sindbis virus leaky termination signal. RPL4-induced enhancement of recoding is cell line specific and appears to be specific to RPL4 among ribosomal proteins. Cotransfection of RPL4 cDNA with Moloney murine leukemia proviral DNA results in Gag processing defects and a reduction of viral particle formation, presumably caused by the RPL4-dependent alteration of the Gag-to-Gag-Pol ratio required for virion assembly and release. PMID:22718819

  9. Structural classification of protein sequences based on signal processing and support vector machines.

    PubMed

    Chrysostomou, Charalambos; Seker, Huseyin

    2016-08-01

    The function of any protein depends directly on its secondary and tertiary structure. Proteins can fold into a three-dimensional shape, which is primarily depended on the arrangement of amino acids in the primary structure. In recent years, with the explosive sequencing of proteins, it is unfeasible to perform detailed experimental studies, as these methodologies are very expensive and time consuming. This leaves the structure of the majority of currently available protein sequences unknown. In this paper, a predictive model is therefore presented for the classification of protein sequence's secondary structures, namely alpha helix and beta sheet. The proteins used throughout this study were collected from the Structural Classification of Proteinsextended (SCOPe) database, which contains manually curated information from proteins with known structure. Two sets of proteins are used for all alpha and all beta protein sequences. The first set comprise of sequences with less than 40% identity, and the second set comprise of proteins with less than 95% identity. The analysis shows a strong connection between the amino acid indices used to convert protein sequences to numerical sequences and proteins' secondary structures. The total classification accuracy for the proposed classifier for the protein sequences with less than 40% identity for amino acid index BIOV880101 and BIOV880102 are 78.49% and 76.40%, respectively. The classification accuracy for sets of protein sequences with less than 95% identity for amino acid index BIOV880101 and BIOV880102 are 88.01% and 85.17%, respectively.

  10. Phenotypic comparisons of consensus variants versus laboratory resurrections of Precambrian proteins.

    PubMed

    Risso, Valeria A; Gavira, Jose A; Gaucher, Eric A; Sanchez-Ruiz, Jose M

    2014-06-01

    Consensus-sequence engineering has generated protein variants with enhanced stability, and sometimes, with modulated biological function. Consensus mutations are often interpreted as the introduction of ancestral amino acid residues. However, the precise relationship between consensus engineering and ancestral protein resurrection is not fully understood. Here, we report the properties of proteins encoded by consensus sequences derived from a multiple sequence alignment of extant, class A β-lactamases, as compared with the properties of ancient Precambrian β-lactamases resurrected in the laboratory. These comparisons considered primary sequence, secondary, and tertiary structure, as well as stability and catalysis against different antibiotics. Out of the three consensus variants generated, one could not be expressed and purified (likely due to misfolding and/or low stability) and only one displayed substantial stability having substrate promiscuity, although to a lower extent than ancient β-lactamases. These results: (i) highlight the phenotypic differences between consensus variants and laboratory resurrections of ancestral proteins; (ii) question interpretations of consensus proteins as phenotypic proxies of ancestral proteins; and (iii) support the notion that ancient proteins provide a robust approach toward the preparation of protein variants having large numbers of mutational changes while possessing unique biomolecular properties.

  11. The amino acid sequence of protein CM-3 from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J

    1985-01-01

    Protein CM-3 from Dendroaspis polylepis polylepis venom was purified by gel filtration and ion exchange chromatography. It comprises 65 amino acids including eight half-cystines. The complete amino acid sequence of protein CM-3 has been elucidated. The sequence (residues 1-50) resembles that of the N-terminal sequence of the subunits of a synergistic type protein and residues 51-65 that of the C-terminal sequence of an angusticeps type protein. Mixtures of protein CM-3 and angusticeps type proteins showed no apparent synergistic effect, in that their toxicity in combination was no greater than the sum of their individual toxicities.

  12. Direct Chloroplast Sequencing: Comparison of Sequencing Platforms and Analysis Tools for Whole Chloroplast Barcoding

    PubMed Central

    Brozynska, Marta; Furtado, Agnelo; Henry, Robert James

    2014-01-01

    Direct sequencing of total plant DNA using next generation sequencing technologies generates a whole chloroplast genome sequence that has the potential to provide a barcode for use in plant and food identification. Advances in DNA sequencing platforms may make this an attractive approach for routine plant identification. The HiSeq (Illumina) and Ion Torrent (Life Technology) sequencing platforms were used to sequence total DNA from rice to identify polymorphisms in the whole chloroplast genome sequence of a wild rice plant relative to cultivated rice (cv. Nipponbare). Consensus chloroplast sequences were produced by mapping sequence reads to the reference rice chloroplast genome or by de novo assembly and mapping of the resulting contigs to the reference sequence. A total of 122 polymorphisms (SNPs and indels) between the wild and cultivated rice chloroplasts were predicted by these different sequencing and analysis methods. Of these, a total of 102 polymorphisms including 90 SNPs were predicted by both platforms. Indels were more variable with different sequencing methods, with almost all discrepancies found in homopolymers. The Ion Torrent platform gave no apparent false SNP but was less reliable for indels. The methods should be suitable for routine barcoding using appropriate combinations of sequencing platform and data analysis. PMID:25329378

  13. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison.

    PubMed

    Dai, Qi; Liu, Xiaoqing; Yao, Yuhua; Zhao, Fukun

    2011-05-07

    Sequence comparison is one of the major tasks in bioinformatics, which can be used to study structural and functional conservation, as well as evolutionary relations among the sequences. Numerous dissimilarity measures achieve promising results in sequence comparison, but challenges remain. This paper studied numerical characteristics of word frequencies and proposed a novel dissimilarity measure for sequence comparison. Instead of using the word frequencies directly, the proposed measure considers both the word frequencies and overlapping structures of words. To verify the effectiveness of the proposed measure, we tested it with two experiments and further compared it with alignment-based and alignment-free measures. The results demonstrate that the proposed measure extracting more information on the overlapping structures of the words improves the efficiency of sequence comparison.

  14. Design of Protein Multi-specificity Using an Independent Sequence Search Reduces the Barrier to Low Energy Sequences

    PubMed Central

    Sevy, Alexander M.; Jacobs, Tim M.; Crowe, James E.; Meiler, Jens

    2015-01-01

    Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a ‘single state’ design (SSD) paradigm. Multi-specificity design (MSD), on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON). The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain) an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design “promiscuous”, polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes. PMID:26147100

  15. Design of Protein Multi-specificity Using an Independent Sequence Search Reduces the Barrier to Low Energy Sequences.

    PubMed

    Sevy, Alexander M; Jacobs, Tim M; Crowe, James E; Meiler, Jens

    2015-07-01

    Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a 'single state' design (SSD) paradigm. Multi-specificity design (MSD), on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON). The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain) an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design "promiscuous", polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes.

  16. Integration of latex protein sequence data provides comprehensive functional overview of latex proteins.

    PubMed

    Cho, Won Kyong; Jo, Yeonhwa; Chu, Hyosub; Park, Sang-Ho; Kim, Kook-Hyung

    2014-03-01

    The laticiferous system is one of the most important conduit systems in higher plants, which produces a milky-like sap known as latex. Latex contains diverse secondary metabolites with various ecological functions. To obtain a comprehensive overview of the latex proteome, we integrated available latex proteins sequences and constructed a comprehensive dataset composed of 1,208 non-redundant latex proteins from 20 various latex-bearing plants. The results of functional analyses revealed that latex proteins are involved in various biological processes, including transcription, translation, protein degradation and the plant response to environmental stimuli. The results of the comparative analysis showed that the functions of the latex proteins are similar to those of phloem, suggesting the functional conservation of plant vascular proteins. The presence of latex proteins in mitochondria and plastids suggests the production of diverse secondary metabolites. Furthermore, using a BLAST search, we identified 854 homologous latex proteins in eight plant species, including three latex-bearing plants, such as papaya, caster bean and cassava, suggesting that latex proteins were newly evolved in vascular plants. Taken together, this study is the largest and most comprehensive in silico analysis of the latex proteome. The results obtained here provide useful resources and information for characterizing the evolution of the latex proteome.

  17. Species-specific protein sequence and fold optimizations

    PubMed Central

    Dumontier, Michel; Michalickova, Katerina; Hogue, Christopher WV

    2002-01-01

    Background An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes. Results Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archae, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archae and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 ± 8% whereas the CG detected 73 ± 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at . Conclusion Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events. PMID:12487631

  18. Initial sequence of the chimpanzee genome and comparison with the human genome.

    PubMed

    2005-09-01

    Here we present a draft genome sequence of the common chimpanzee (Pan troglodytes). Through comparison with the human genome, we have generated a largely complete catalogue of the genetic differences that have accumulated since the human and chimpanzee species diverged from our common ancestor, constituting approximately thirty-five million single-nucleotide changes, five million insertion/deletion events, and various chromosomal rearrangements. We use this catalogue to explore the magnitude and regional variation of mutational forces shaping these two genomes, and the strength of positive and negative selection acting on their genes. In particular, we find that the patterns of evolution in human and chimpanzee protein-coding genes are highly correlated and dominated by the fixation of neutral and slightly deleterious alleles. We also use the chimpanzee genome as an outgroup to investigate human population genetics and identify signatures of selective sweeps in recent human evolution.

  19. Sequence-Specific Protein Aggregation Generates Defined Protein Knockdowns in Plants1[OPEN

    PubMed Central

    Vuylsteke, Marnik; Aesaert, Stijn; Rombaut, Debbie; De Smet, Frederik; Xu, Jie; Van Lijsebettens, Mieke; Rousseau, Frederic

    2016-01-01

    Protein aggregation is determined by short (5–15 amino acids) aggregation-prone regions (APRs) of the polypeptide sequence that self-associate in a specific manner to form β-structured inclusions. Here, we demonstrate that the sequence specificity of APRs can be exploited to selectively knock down proteins with different localization and function in plants. Synthetic aggregation-prone peptides derived from the APRs of either the negative regulators of the brassinosteroid (BR) signaling, the glycogen synthase kinase 3/Arabidopsis SHAGGY-like kinases (GSK3/ASKs), or the starch-degrading enzyme α-glucan water dikinase were designed. Stable expression of the APRs in Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) induced aggregation of the target proteins, giving rise to plants displaying constitutive BR responses and increased starch content, respectively. Overall, we show that the sequence specificity of APRs can be harnessed to generate aggregation-associated phenotypes in a targeted manner in different subcellular compartments. This study points toward the potential application of induced targeted aggregation as a useful tool to knock down protein functions in plants and, especially, to generate beneficial traits in crops. PMID:27208282

  20. Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition.

    PubMed

    Ibrahim, Wisam; Abadeh, Mohammad Saniee

    2017-03-27

    Protein fold recognition is an important problem in bioinformatics to predict three-dimensional structure of a protein. One of the most challenging tasks in protein fold recognition problem is the extraction of efficient features from the amino-acid sequences to obtain better classifiers. In this paper, we have proposed six descriptors to extract features from protein sequences. These descriptors are applied in the first stage of a three-stage framework PCA-DELM-LDA to extract feature vectors from the amino-acid sequences. Principal Component Analysis PCA has been implemented to reduce the number of extracted features. The extracted feature vectors have been used with original features to improve the performance of the Deep Extreme Learning Machine DELM in the second stage. Four new features have been extracted from the second stage and used in the third stage by Linear Discriminant Analysis LDA to classify the instances into 27 folds. The proposed framework is implemented on the independent and combined feature sets in SCOP datasets. The experimental results show that extracted feature vectors in the first stage could improve the performance of DELM in extracting new useful features in second stage.

  1. Filling-in void and sparse regions in protein sequence space by protein-like artificial sequences enables remarkable enhancement in remote homology detection capability.

    PubMed

    Mudgal, Richa; Sowdhamini, Ramanathan; Chandra, Nagasuma; Srinivasan, Narayanaswamy; Sandhya, Sankaran

    2014-02-20

    Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like "linker" sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be "plugged-into" routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold. Copyright © 2013 Elsevier Ltd. All rights reserved.

  2. Large scale comparison of non-human sequences in human sequencing data

    PubMed Central

    Tae, Hongseok; Karunasena, Enusha; Bavarva, Jasmin H.; McIver, Lauren J.; Garner, Harold R.

    2014-01-01

    Several studies have demonstrated that unmapped reads in next generation sequencing data could be used to identify infectious agents or structural variants, but there has been no intensive effort to analyze and classify all non-human sequences found in individual large data sets. To identify commonality in non-human sequences by infectious agents and putative contamination events, we analyzed non-human sequences in 150 genomic sequencing data files from the 1000 Genomes Project and observed that 0.13% of reads on average showed similarities to non-human genomes. We compared results among different sample groups divided based on ethnicities, sequencing centers and enrichment methods (whole genome sequencing vs. exome sequencing) and found that sequencing centers had specific signatures of contaminating genomes as ‘time stamps’. We also observed many unmapped reads that falsely indicated contamination because of the high similarity of human sequences to sequences in non-human genome assemblies such as mouse and Nicotiana. PMID:25173571

  3. Sequence analysis and expression of the M1 and M2 matrix protein genes of hirame rhabdovirus (HIRRV)

    USGS Publications Warehouse

    Nishizawa, T.; Kurath, G.; Winton, J.R.

    1997-01-01

    We have cloned and sequenced a 2318 nucleotide region of the genomic RNA of hirame rhabdovirus (HIRRV), an important viral pathogen of Japanese flounder Paralichthys olivaceus. This region comprises approximately two-thirds of the 3' end of the nucleocapsid protein (N) gene and the complete matrix protein (M1 and M2) genes with the associated intergenic regions. The partial N gene sequence was 812 nucleotides in length with an open reading frame (ORF) that encoded the carboxyl-terminal 250 amino acids of the N protein. The M1 and M2 genes were 771 and 700 nucleotides in length, respectively, with ORFs encoding proteins of 227 and 193 amino acids. The M1 gene sequence contained an additional small ORF that could encode a highly basic, arginine-rich protein of 25 amino acids. Comparisons of the N, M1, and M2 gene sequences of HIRRV with the corresponding sequences of the fish rhabdoviruses, infectious hematopoietic necrosis virus (IHNV) or viral hemorrhagic septicemia virus (VHSV) indicated that HIRRV was more closely related to IHNV than to VHSV, but was clearly distinct from either. The putative consensus gene termination sequence for IHNV and VHSV, AGAYAG(A)(7), was present in the N-M1, M1-M2, and M2-G intergenic regions of HIRRV as were the putative transcription initiation sequences YGGCAC and AACA. An Escherichia coli expression system was used to produce recombinant proteins from the M1 and M2 genes of HIRRV. These were the same size as the authentic M1 and M2 proteins and reacted with anti-HIRRV rabbit serum in western blots. These reagents can be used for further study of the fish immune response and to test novel control methods.

  4. Patterns of sequence conservation in the S-Layer proteins and related sequences in Clostridium difficile.

    PubMed

    Calabi, Emanuela; Fairweather, Neil

    2002-07-01

    Clostridium difficile is the etiological agent of antibiotic-associated diarrhea. Among the factors that may play a role in infection are S-layer proteins (SLPs). Previous work has shown these to consist mainly of two components, resulting from the cleavage of a precursor encoded by the slpA gene. The high-molecular-weight (MW) subunit is related both to amidases from B. subtilis and to at least another 28 gene products in C. difficile strain 630. To gain insight into the functions of the SLPs and related proteins, we have further investigated the pattern of variability both at the slpA locus and at six nearby paralogs. Sequencing of the slpA gene from an S-layer group II strain and a variant S-layer group strain confirms a high degree of divergence in the low-MW SLP, which may result from diversifying selection. A highly conserved motif, however, is found at the C terminus in all low-MW subunits and may be essential for SlpA precursor cleavage. In strain 167, a variant cleavage product is present, suggesting a secondary processing site. Southern blotting analysis shows slpA-like open reading frames (ORFs) 2 to 7 to be conserved in all nine strains tested, with one exception: ORF2, which encodes a 66-kDa polypeptide coextracted at low pH with the main SLPs in strain 630, may be partially deleted in strain 167. Polymorphism within the slpA-ORF7 cluster may be more pronounced in the region proximal to the slpA gene. Unexpectedly, a high-MW subunit probe cross hybridizes to sequences outside the slpA locus, which appear to vary in number in different strains.

  5. A parallel computing approach to genetic sequence comparison: the master-worker paradigm with interworker communication.

    PubMed

    Sittig, D F; Foulser, D; Carriero, N; McCorkle, G; Miller, P L

    1991-04-01

    We have implemented a parallel version of a dynamic programming biological sequence comparison algorithm to study the potential applicability of using parallel computers for genetic sequence comparisons. Our parallel program is built using C-Linda, a machine-independent parallel programming language, and was tested on both a 10 CPU Sequent Symmetry and a 64 CPU Intel Hypercube. C-Linda implements a shared associative memory model, "tuple space," through which multiple processes can communicate and coordinate control. In our master-worker (MW) parallel implementation, a master process creates several worker processes, extracts a test sequence and multiple library sequences from a database and stores them in tuple space. Each worker reads the test sequence and then repeatedly extracts library strings from tuple space, performs pairwise sequence comparison using a local comparison algorithm to generate a similarity score, and returns the similarity scores to tuple space. The master collects the scores from tuple space and identifies the best match over all library sequences. We also implemented a method of global interworker communication to reduce the total search time by stopping those string comparisons that had no chance of improving on the current best match. Comparisons of the total run time, speedup, and efficiency were made for parallel and sequential versions of a basic MW implementation as well as versions with the global abort threshold.

  6. Impaired nuclear import of mammalian Dlx4 proteins as a consequence of rapid sequence divergence

    SciTech Connect

    Coubrough, Melissa L.; Bendall, Andrew J. . E-mail: abendall@uoguelph.ca

    2006-11-15

    Dlx genes encode a developmentally important family of transcription factors with a variety of functions and sites of action during vertebrate embryogenesis. The murine Dlx4 gene is an enigmatic member of the family; little is known about the normal developmental function(s) of Dlx4. Here, we show that Dlx4 is expressed in the murine placenta and in a trophoblast cell line where the protein localizes to both the nucleus and cytoplasm. Despite the presence of several leucine/valine-rich motifs that match known nuclear export sequences, cytoplasmic Dlx4 is not due to CRM-1-mediated nuclear export. Rather, nuclear import of Dlx4 is compromised by specific residues that flank the nuclear localization signal. One of these residues represents a novel conserved feature of the Dlx4 protein in placental mammals, and the second represents novel variation within mouse Dlx4 isoforms. Comparison of orthologous protein sequences reveals a particularly high rate of non-synonymous change in the coding regions of mammalian Dlx4 genes. Since impaired nuclear localization is unlikely to enhance the function of a nuclear transcription factor, these data point to reduced selection pressure as the basis for the rapid divergence of the Dlx4 gene within the mammalian clade.

  7. Complete Genome Sequence of the Grouper Iridovirus and Comparison of Genomic Organization with Those of Other Iridoviruses

    PubMed Central

    Tsai, Chih-Tung; Ting, Jing-Wen; Wu, Ming-Hsien; Wu, Ming-Feng; Guo, Ing-Cherng; Chang, Chi-Yao

    2005-01-01

    The complete DNA sequence of grouper iridovirus (GIV) was determined using a whole-genome shotgun approach on virion DNA. The circular form genome was 139,793 bp in length with a 49% G+C content. It contained 120 predicted open reading frames (ORFs) with coding capacities ranging from 62 to 1,268 amino acids. A total of 21% (25 of 120) of GIV ORFs are conserved in the other five sequenced iridovirus genomes, including DNA replication, transcription, nucleotide metabolism, protein modification, viral structure, and virus-host interaction genes. The whole-genome nucleotide pairwise comparison showed that GIV virus was partially colinear with counterparts of previously sequenced ranaviruses (ATV and TFV). Besides, sequence analysis revealed that GIV possesses several unique features which are different from those of other complete sequenced iridovirus genomes: (i) GIV is the first ranavirus-like virus which has been sequenced completely and which infects fish other than amphibians, (ii) GIV is the only vertebrate iridovirus without CpG sequence methylation and lacking DNA methyltransferase, (iii) GIV contains a purine nucleoside phosphorylase gene which is not found in other iridoviruses or in any other viruses, (iv) GIV contains 17 sets of repeat sequence, with basic unit sizes ranging from 9 to 63 bp, dispersed throughout the whole genome. These distinctive features of GIV further extend our understanding of molecular events taking place between ranavirus and its hosts and the iridovirus evolution. PMID:15681403

  8. Generic Comparison of Protein Inference Engines*

    PubMed Central

    Claassen, Manfred; Reiter, Lukas; Hengartner, Michael O.; Buhmann, Joachim M.; Aebersold, Ruedi

    2012-01-01

    Protein identifications, instead of peptide-spectrum matches, constitute the biologically relevant result of shotgun proteomics studies. How to appropriately infer and report protein identifications has triggered a still ongoing debate. This debate has so far suffered from the lack of appropriate performance measures that allow us to objectively assess protein inference approaches. This study describes an intuitive, generic and yet formal performance measure and demonstrates how it enables experimentalists to select an optimal protein inference strategy for a given collection of fragment ion spectra. We applied the performance measure to systematically explore the benefit of excluding possibly unreliable protein identifications, such as single-hit wonders. Therefore, we defined a family of protein inference engines by extending a simple inference engine by thousands of pruning variants, each excluding a different specified set of possibly unreliable identifications. We benchmarked these protein inference engines on several data sets representing different proteomes and mass spectrometry platforms. Optimally performing inference engines retained all high confidence spectral evidence, without posterior exclusion of any type of protein identifications. Despite the diversity of studied data sets consistently supporting this rule, other data sets might behave differently. In order to ensure maximal reliable proteome coverage for data sets arising in other studies we advocate abstaining from rigid protein inference rules, such as exclusion of single-hit wonders, and instead consider several protein inference approaches and assess these with respect to the presented performance measure in the specific application context. PMID:22057310

  9. Generic comparison of protein inference engines.

    PubMed

    Claassen, Manfred; Reiter, Lukas; Hengartner, Michael O; Buhmann, Joachim M; Aebersold, Ruedi

    2012-04-01

    Protein identifications, instead of peptide-spectrum matches, constitute the biologically relevant result of shotgun proteomics studies. How to appropriately infer and report protein identifications has triggered a still ongoing debate. This debate has so far suffered from the lack of appropriate performance measures that allow us to objectively assess protein inference approaches. This study describes an intuitive, generic and yet formal performance measure and demonstrates how it enables experimentalists to select an optimal protein inference strategy for a given collection of fragment ion spectra. We applied the performance measure to systematically explore the benefit of excluding possibly unreliable protein identifications, such as single-hit wonders. Therefore, we defined a family of protein inference engines by extending a simple inference engine by thousands of pruning variants, each excluding a different specified set of possibly unreliable identifications. We benchmarked these protein inference engines on several data sets representing different proteomes and mass spectrometry platforms. Optimally performing inference engines retained all high confidence spectral evidence, without posterior exclusion of any type of protein identifications. Despite the diversity of studied data sets consistently supporting this rule, other data sets might behave differently. In order to ensure maximal reliable proteome coverage for data sets arising in other studies we advocate abstaining from rigid protein inference rules, such as exclusion of single-hit wonders, and instead consider several protein inference approaches and assess these with respect to the presented performance measure in the specific application context.

  10. The Bioinformatics Report of Mutation Outcome on NADPH Flavin Oxidoreductase Protein Sequence in Clinical Isolates of H. pylori.

    PubMed

    Mirzaei, Nasrin; Poursina, Farkhondeh; Moghim, Sharareh; Ghaempanah, Abdol Majid; Safaei, Hajieh Ghasemian

    2016-05-01

    frxA gene has been implicated in the metronidazole nitro reduction by H. pylori. Alternatively, frxA is expected to contribute to the protection of urease and to the in vivo survival of H. pylori. The aim of present study is to report the mutation effects on the frxA protein sequence in clinical isolates of H. pylori in our community. Metronidazole resistance was proven in 27 of 48 isolates. glmM and frxA genes were used for molecular confirmation of H. pylori isolates. The primer set for detection of whole sequence of frxA gene for the effect of mutation on protein sequence was used. DNA and protein sequence evaluation and analysis were done by blast, Clustal Omega, and T COFFEE programs. Then, FrxA protein sequences from six metronidazole-resistant clinical isolates were analyzed by web-based bioinformatics tools. The result of six metronidazole-resistant clinical isolates in comparison with strain 26695 showed ten missense mutations. The result with the STRING program revealed that no change was seen after alterations in these sequences. According to consensus data involving four methods, residue substitutions at 40, 13, and 141 increase the stability of protein sequence after mutation, while other alterations decrease. Residue substitutions at 40, 43, 141, 138, 169, and 179 are deleterious, while, V7I, Q10R, V34I, and V96I alterations are neutral. As FrxA contribute to survival of bacterium and in regard to the effect of mutations on protein function, it might affect the survival and bacterium phenotype and it need to be studied more. Also, none of the stability prediction tool is perfect; iStable is the best predictor method among all methods.

  11. Snake venom. The amino acid sequence of protein A from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J; Strydom, D J

    1980-12-01

    Protein A from Dendroaspis polylepis polylepis venom comprises 81 amino acids, including ten half-cystine residues. The complete primary structures of protein A and its variant A' were elucidated. The sequences of proteins A and A', which differ in a single position, show no homology with various neurotoxins and non-neurotoxic proteins and represent a new type of elapid venom protein.

  12. X-ray sequence and crystal structure of luffaculin 1, a novel type 1 ribosome-inactivating protein

    PubMed Central

    Hou, Xiaomin; Chen, Minghuang; Chen, Liqing; Meehan, Edward J; Xie, Jieming; Huang, Mingdong

    2007-01-01

    Background Protein sequence can be obtained through Edman degradation, mass spectrometry, or cDNA sequencing. High resolution X-ray crystallography can also be used to derive protein sequence information, but faces the difficulty in distinguishing the Asp/Asn, Glu/Gln, and Val/Thr pairs. Luffaculin 1 is a new type 1 ribosome-inactivating protein (RIP) isolated from the seeds of Luffa acutangula. Besides rRNA N-glycosidase activity, luffaculin 1 also demonstrates activities including inhibiting tumor cells' proliferation and inducing tumor cells' differentiation. Results The crystal structure of luffaculin 1 was determined at 1.4 Å resolution. Its amino-acid sequence was derived from this high resolution structure using the following criteria: 1) high resolution electron density; 2) comparison of electron density between two molecules that exist in the same crystal; 3) evaluation of the chemical environment of residues to break down the sequence assignment ambiguity in residue pairs Glu/Gln, Asp/Asn, and Val/Thr; 4) comparison with sequences of the homologous proteins. Using the criteria 1 and 2, 66% of the residues can be assigned. By incorporating with criterion 3, 86% of the residues were assigned, suggesting the effectiveness of chemical environment evaluation in breaking down residue ambiguity. In total, 94% of the luffaculin 1 sequence was assigned with high confidence using this improved X-ray sequencing strategy. Two N-acetylglucosamine moieties, linked respectively to the residues Asn77 and Asn84, can be identified in the structure. Residues Tyr70, Tyr110, Glu159 and Arg162 define the active site of luffaculin 1 as an RNA N-glycosidase. Conclusion X-ray sequencing method can be effective to derive sequence information of proteins. The evaluation of the chemical environment of residues is a useful method to break down the assignment ambiguity in Glu/Gln, Asp/Asn, and Val/Thr pairs. The sequence and the crystal structure confirm that luffaculin 1 is a new

  13. KISSa: a strategy to build multiple sequence alignments from pairwise comparisons of very closely related sequences.

    PubMed

    Marass, Francesco; Upton, Chris

    2009-05-20

    The volume of viral genomic sequence data continues to increase rapidly. This is especially true for the smaller RNA viruses, which are relatively easy to sequence in large numbers. The data volumes cause a number of significant problems for research applications that require large multiple alignments of essentially complete genomes, which are of the order of 10 kb. We present a simple strategy to enable the creation of large quasi-multiple sequence alignments from pairwise alignment data. This process is suitable for large, closely related sequences such as the polyproteins of dengue viruses, which need the insertion of very few indels. The quasi-multiple sequence alignments generated by KISSa are sufficiently accurate to support tree-based genome selection for interactive bioinformatics analysis tools. The speed of this process is critical to providing an interactive experience for the user.

  14. The SBASE protein domain library, release 2.0: a collection of annotated protein sequence segments.

    PubMed Central

    Pongor, S; Skerl, V; Cserzö, M; Hátsági, Z; Simon, G; Bevilacqua, V

    1993-01-01

    SBASE 2.0 is the second release of SBASE, a collection of annotated protein domain sequences. SBASE entries represent various structural, functional, ligand-binding and topogenic segments of proteins [Pongor, S. et al. (1993) Prot. Eng., in press]. This release contains 34,518 entries provided with standardized names and it is cross-referenced to the major protein and nucleic acid databanks as well as to the PROSITE catalog of protein sequence patterns [Bairoch, A. (1992) Nucl. Acids Res., 20 suppl, 2013-2018]. SBASE can be used for establishing domain homologies using different database-search tools such as FASTA [Lipman and Pearson (1985) Science, 227, 1436-1441], FASTDB [Brutlag et al. (1990) Comp. Appl. Biosci., 6, 237-245] or BLAST3 [Altschul and Lipman (1990) Proc. Natl. Acad. Sci. USA, 87, 5509-5513] which is especially useful in the case of loosely defined domain types for which efficient consensus patterns can not be established. SBASE 2.0 and a set of search and retrieval tools are freely available on request to the authors or by anonymous 'ftp' file transfer from mean value of ftp.icgeb.trieste.it. PMID:8332532

  15. Comparison of simple sequence repeats in 19 Archaea.

    PubMed

    Trivedi, S

    2006-12-05

    All organisms that have been studied until now have been found to have differential distribution of simple sequence repeats (SSRs), with more SSRs in intergenic than in coding sequences. SSR distribution was investigated in Archaea genomes where complete chromosome sequences of 19 Archaea were analyzed with the program SPUTNIK to find di- to penta-nucleotide repeats. The number of repeats was determined for the complete chromosome sequences and for the coding and non-coding sequences. Different from what has been found for other groups of organisms, there is an abundance of SSRs in coding regions of the genome of some Archaea. Dinucleotide repeats were rare and CG repeats were found in only two Archaea. In general, trinucleotide repeats are the most abundant SSR motifs; however, pentanucleotide repeats are abundant in some Archaea. Some of the tetranucleotide and pentanucleotide repeat motifs are organism specific. In general, repeats are short and CG-rich repeats are present in Archaea having a CG-rich genome. Among the 19 Archaea, SSR density was not correlated with genome size or with optimum growth temperature. Pentanucleotide density had an inverse correlation with the CG content of the genome.

  16. Sequence analysis and phylogenetic study of some toxin proteins of snakes and related non-toxin proteins of chordates.

    PubMed

    Panda, Subhamay; Chandra, Goutam

    2013-01-01

    Snakes are equipped with their venomic armory to tackle different prey and predators in adverse natural world. The venomic composition of snakes is a mix of biologically active proteins and polypeptides. Among different components snake venom cytotoxins and short neurotoxin are non-enzymatic polypeptide candidates with in the venom. These two components structurally resembled to three-finger protein superfamily specific scaffold. Different non-toxin family members of three-finger protein superfamily are involved in different biological roles. In the present study we analyzed the snake venom cytotoxins, short neurotoxins and related non-toxin proteins of different chordates in terms of amino acid sequence level diversification profile, polarity profile of amino acid sequences, conserved pattern of amino acids and phylogenetic relationship of these toxin and nontoxin protein sequences. Sequence alignment analysis demonstrates the polarity specific molecular enrichment strategy for better system adaptivity. Occurrence of amino acid substitution is high in number in toxin sequences. In non-toxin body proteins there are less amino acid substitutions. With the help of conserved residues these proteins maintain the three-finger protein scaffold. Due to system specific adaptation toxin and non-toxin proteins exhibit a varied type of amino acid residue distribution in sequence stretch. Understanding of Natural invention scheme (recruitment of venom proteins from normal body proteins) may help us to develop futuristic engineered bio-molecules with remedial properties.

  17. Poliovirus replication proteins: RNA sequence encoding P3-1b and the sites of proteolytic processing

    SciTech Connect

    Semler, B.L.; Anderson, C.W.; Kitamura, N.; Rothberg, P.G.; Wishart, W.L.; Wimmer, E.

    1981-06-01

    A partial amino-terminal amino acid sequence of each of the major proteins encoded by the replicase region of the poliovirus genome has been determined. A comparison of this sequence information with the amino acid sequence predicted from the RNA sequence that has been determined for the 3' region of the poliovirus genome has allowed us to locate precisely the proteolytic cleavage sites at which the initial polyprotein is processed to create the poliovirus products P3-1b (NCVP1b), P3-2 (NCVP2), P3-4b (NCVP4b), and P3-7c (NCVP7c). For each of these products, as well as for the small genome-linked protein VPg, proteolytic cleavage occurs between a glutamine and a glycine residue to create the amino terminus of each protein. This result suggests that a single proteinase may be responsible for all of these cleavages. The sequence data also allow the precise positioning of the genome-linked protein VPg within the precursor P3-1b just proximal to the amino terminus of polypeptide P3-2.

  18. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    PubMed

    Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  19. Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation

    PubMed Central

    Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392

  20. Relationship between sequence conservation and three-dimensional structure in a large family of esterases, lipases, and related proteins.

    PubMed Central

    Cygler, M.; Schrag, J. D.; Sussman, J. L.; Harel, M.; Silman, I.; Gentry, M. K.; Doctor, B. P.

    1993-01-01

    Based on the recently determined X-ray structures of Torpedo californica acetylcholinesterase and Geotrichum candidum lipase and on their three-dimensional superposition, an improved alignment of a collection of 32 related amino acid sequences of other esterases, lipases, and related proteins was obtained. On the basis of this alignment, 24 residues are found to be invariant in 29 sequences of hydrolytic enzymes, and an additional 49 are well conserved. The conservation in the three remaining sequences is somewhat lower. The conserved residues include the active site, disulfide bridges, salt bridges, and residues in the core of the proteins. Most invariant residues are located at the edges of secondary structural elements. A clear structural basis for the preservation of many of these residues can be determined from comparison of the two X-ray structures. PMID:8453375

  1. MIPS: a database for protein sequences, homology data and yeast genome information.

    PubMed Central

    Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F

    1997-01-01

    The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498

  2. Malakite: an automatic tool for characterisation of structure of reliable blocks in multiple alignments of protein sequences.

    PubMed

    Burkov, Boris; Nagaev, Boris; Spirin, Sergei; Alexeevski, Andrei

    2010-06-01

    It makes sense to speak of alignment of protein sequences only within the regions, where the sequences are related to each other. This simple consideration is often disregarded by programs of multiple alignment construction. A package for alignment analysis MAlAKiTE (Multiple Alignment Automatic Kinship Tiling Engine) is introduced. It aims to find the blocks of reliable alignment, which contain related regions only, within the whole alignment and allows for dealing with them. The validity of the detection of reliable blocks' was verified by comparison with structural data.

  3. Revision of Begomovirus taxonomy based on pairwise sequence comparisons.

    PubMed

    Brown, Judith K; Zerbini, F Murilo; Navas-Castillo, Jesús; Moriones, Enrique; Ramos-Sobrinho, Roberto; Silva, José C F; Fiallo-Olivé, Elvira; Briddon, Rob W; Hernández-Zepeda, Cecilia; Idris, Ali; Malathi, V G; Martin, Darren P; Rivera-Bustamante, Rafael; Ueda, Shigenori; Varsani, Arvind

    2015-06-01

    Viruses of the genus Begomovirus (family Geminiviridae) are emergent pathogens of crops throughout the tropical and subtropical regions of the world. By virtue of having a small DNA genome that is easily cloned, and due to the recent innovations in cloning and low-cost sequencing, there has been a dramatic increase in the number of available begomovirus genome sequences. Even so, most of the available sequences have been obtained from cultivated plants and are likely a small and phylogenetically unrepresentative sample of begomovirus diversity, a factor constraining taxonomic decisions such as the establishment of operationally useful species demarcation criteria. In addition, problems in assigning new viruses to established species have highlighted shortcomings in the previously recommended mechanism of species demarcation. Based on the analysis of 3,123 full-length begomovirus genome (or DNA-A component) sequences available in public databases as of December 2012, a set of revised guidelines for the classification and nomenclature of begomoviruses are proposed. The guidelines primarily consider a) genus-level biological characteristics and b) results obtained using a standardized classification tool, Sequence Demarcation Tool, which performs pairwise sequence alignments and identity calculations. These guidelines are consistent with the recently published recommendations for the genera Mastrevirus and Curtovirus of the family Geminiviridae. Genome-wide pairwise identities of 91 % and 94 % are proposed as the demarcation threshold for begomoviruses belonging to different species and strains, respectively. Procedures and guidelines are outlined for resolving conflicts that may arise when assigning species and strains to categories wherever the pairwise identity falls on or very near the demarcation threshold value.

  4. Accuracy Estimation and Parameter Advising for Protein Multiple Sequence Alignment

    PubMed Central

    DeBlasio, Dan

    2013-01-01

    Abstract We develop a novel and general approach to estimating the accuracy of multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, for parameter advising, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose. Our estimator, which we call Facet (for “feature-based accuracy estimator”), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality. PMID:23489379

  5. Phylogenetic relationships of Cryptosporidium determined by ribosomal RNA sequence comparison.

    PubMed

    Johnson, A M; Fielke, R; Lumb, R; Baverstock, P R

    1990-04-01

    Reverse transcription of total cellular RNA was used to obtain a partial sequence of the small subunit ribosomal RNA of Cryptosporidium, a protist currently placed in the phylum Apicomplexa. The semi-conserved regions were aligned with homologous sequences in a range of other eukaryotes, and the evolutionary relationships of Cryptosporidium were determined by two different methods of phylogenetic analysis. The prokaryotes Escherichia coli and Halobacterium cuti were included as outgroups. The results do not show an especially close relationship of Cryptosporidium to other members of the phylum Apicomplexa.

  6. Evolution of EF-hand calcium-modulated proteins. III. Exon sequences confirm most dendrograms based on protein sequences: calmodulin dendrograms show significant lack of parallelism

    NASA Technical Reports Server (NTRS)

    Nakayama, S.; Kretsinger, R. H.

    1993-01-01

    In the first report in this series we presented dendrograms based on 152 individual proteins of the EF-hand family. In the second we used sequences from 228 proteins, containing 835 domains, and showed that eight of the 29 subfamilies are congruent and that the EF-hand domains of the remaining 21 subfamilies have diverse evolutionary histories. In this study we have computed dendrograms within and among the EF-hand subfamilies using the encoding DNA sequences. In most instances the dendrograms based on protein and on DNA sequences are very similar. Significant differences between protein and DNA trees for calmodulin remain unexplained. In our fourth report we evaluate the sequences and the distribution of introns within the EF-hand family and conclude that exon shuffling did not play a significant role in its evolution.

  7. Evolution of EF-hand calcium-modulated proteins. III. Exon sequences confirm most dendrograms based on protein sequences: calmodulin dendrograms show significant lack of parallelism

    NASA Technical Reports Server (NTRS)

    Nakayama, S.; Kretsinger, R. H.

    1993-01-01

    In the first report in this series we presented dendrograms based on 152 individual proteins of the EF-hand family. In the second we used sequences from 228 proteins, containing 835 domains, and showed that eight of the 29 subfamilies are congruent and that the EF-hand domains of the remaining 21 subfamilies have diverse evolutionary histories. In this study we have computed dendrograms within and among the EF-hand subfamilies using the encoding DNA sequences. In most instances the dendrograms based on protein and on DNA sequences are very similar. Significant differences between protein and DNA trees for calmodulin remain unexplained. In our fourth report we evaluate the sequences and the distribution of introns within the EF-hand family and conclude that exon shuffling did not play a significant role in its evolution.

  8. The evolution of proteins from random amino acid sequences: II. Evidence from the statistical distributions of the lengths of modern protein sequences.

    PubMed

    White, S H

    1994-04-01

    This paper continues an examination of the hypothesis that modern proteins evolved from random heteropeptide sequences. In support of the hypothesis, White and Jacobs (1993, J Mol Evol 36:79-95) have shown that any sequence chosen randomly from a large collection of nonhomologous proteins has a 90% or better chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. The goal of the present study was to investigate the possibility that the random-origin hypothesis could explain the lengths of modern protein sequences without invoking specific mechanisms such as gene duplication or exon splicing. The sets of sequences examined were taken from the 1989 PIR database and consisted of 1,792 "super-family" proteins selected to have little sequence identity, 623 E. coli sequences, and 398 human sequences. The length distributions of the proteins could be described with high significance by either of two closely related probability density functions: The gamma distribution with parameter 2 or the distribution for the sum of two exponential random independent variables. A simple theory for the distributions was developed which assumes that (1) protoprotein sequences had exponentially distributed random independent lengths, (2) the length dependence of protein stability determined which of these protoproteins could fold into compact primitive proteins and thereby attain the potential for biochemical activity, (3) the useful protein sequences were preserved by the primitive genome, and (4) the resulting distribution of sequence lengths is reflected by modern proteins. The theory successfully predicts the two observed distributions which can be distinguished by the functional form of the dependence of protein stability on length. The theory leads to three interesting conclusions. First, it predicts that a tetra-nucleotide was the signal for primitive translation termination. This prediction is

  9. A COMPARISON OF FIXED SEQUENCE AND OPTIONAL BRANCHING AUTIOINSTRUCTIONAL METHODS.

    ERIC Educational Resources Information Center

    MELARAGNO, RALPH J.; AND OTHERS

    HYPOTHESES RELATED TO PROCEDURES PERMITTING STUDENTS TO BRANCH AT THEIR OWN OPTION WERE TESTED. THE FIRST HYPOTHESIS WAS THAT A FIXED-SEQUENCE PROGRAM WOULD BE LESS EFFECTIVE THAN THE SAME ITEMS CAST AS STATEMENTS IN TEXTBOOK FORMAT THROUGH WHICH THE STUDENT COULD SKIP AT HIS OWN OPTION. THE SECOND HYPOTHESIS WAS THAT PERFORMANCE ON A PROGRAM…

  10. A Guaranteed Similarity Metric Learning Framework for Biological Sequence Comparison.

    PubMed

    Hua, Keru; Yu, Qin; Zhang, Ruiming

    2016-01-01

    Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. The distance and similarity between two sequence are very important and widely studied. During the last decades, Similarity(distance) metric learning is one of the hottest topics of machine learning/data mining as well as their applications in the bioinformatics field. It is feasible to introduce machine learning technology to learn similarity metric from biological data. In this paper, we propose a novel framework of guaranteed similarity metric learning (GMSL) to perform alignment of biology sequences in any feature vector space. It introduces the (ϵ, γ, τ)-goodness similarity theory to Mahalanobis metric learning. As a theoretical guaranteed similarity metric learning approach, GMSL guarantees that the learned similarity function performs well in classification and clustering. Our experiments on the most used datasets demonstrate that our approach outperforms the state-of-the-art biological sequences alignment methods and other similarity metric learning algorithms in both accuracy and stability.

  11. Weighting in sequence space: A comparison of methods in terms of generalized sequences

    SciTech Connect

    Vingron, M. ); Sibbald, P.R. )

    1993-10-01

    Four methods for weighting aligned biological sequences have recently appeared that differ mathematically, philosophically, and in their results. Thus, while there is consensus about the need to weight sequences, the method to use is contentious. A geometric analysis based on a continuous sequence space is presented that provides a common framework in which to compare the methods. It is concluded that there are two best' methods. When the sequences are known to be phylogenetically related and a tree can be generated without introducing excessive stress into the data, the method of Altschul et al. [Altschul, S.F., Carroll, R.J. Lipman, D.J. (1989) J. Mol. Biol. 207, 647-653] is appropriate. When the sequences are not known to be phylogenetically related or a tree cannot be produced without unduly distorting the distances between the sequences, a modification of the method of Sibbald and Argos [Sibbald, P.R. Argos, p. (1990) J. Mol. Biol. 216, 813-818] is preferable. 29 refs., 3 figs., 2 tabs.

  12. Beta.-glucosidase coding sequences and protein from orpinomyces PC-2

    DOEpatents

    Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong; Ximenes, Eduardo A.

    2001-02-06

    Provided is a novel .beta.-glucosidase from Orpinomyces sp. PC2, nucleotide sequences encoding the mature protein and the precursor protein, and methods for recombinant production of this .beta.-glucosidase.

  13. Optimization of Mutation Pressure in Relation to Properties of Protein-Coding Sequences in Bacterial Genomes

    PubMed Central

    Błażej, Paweł; Miasojedow, Błażej; Grabińska, Małgorzata; Mackiewicz, Paweł

    2015-01-01

    Most mutations are deleterious and require energetically costly repairs. Therefore, it seems that any minimization of mutation rate is beneficial. On the other hand, mutations generate genetic diversity indispensable for evolution and adaptation of organisms to changing environmental conditions. Thus, it is expected that a spontaneous mutational pressure should be an optimal compromise between these two extremes. In order to study the optimization of the pressure, we compared mutational transition probability matrices from bacterial genomes with artificial matrices fulfilling the same general features as the real ones, e.g., the stationary distribution and the speed of convergence to the stationarity. The artificial matrices were optimized on real protein-coding sequences based on Evolutionary Strategies approach to minimize or maximize the probability of non-synonymous substitutions and costs of amino acid replacements depending on their physicochemical properties. The results show that the empirical matrices have a tendency to minimize the effects of mutations rather than maximize their costs on the amino acid level. They were also similar to the optimized artificial matrices in the nucleotide substitution pattern, especially the high transitions/transversions ratio. We observed no substantial differences between the effects of mutational matrices on protein-coding sequences in genomes under study in respect of differently replicated DNA strands, mutational cost types and properties of the referenced artificial matrices. The findings indicate that the empirical mutational matrices are rather adapted to minimize mutational costs in the studied organisms in comparison to other matrices with similar mathematical constraints. PMID:26121655

  14. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

    PubMed Central

    Pruitt, Kim D.; Tatusova, Tatiana; Maglott, Donna R.

    2007-01-01

    NCBI's reference sequence (RefSeq) database () is a curated non-redundant collection of sequences representing genomes, transcripts and proteins. The database includes 3774 organisms spanning prokaryotes, eukaryotes and viruses, and has records for 2 879 860 proteins (RefSeq release 19). RefSeq records integrate information from multiple sources, when additional data are available from those sources and therefore represent a current description of the sequence and its features. Annotations include coding regions, conserved domains, tRNAs, sequence tagged sites (STS), variation, references, gene and protein product names, and database cross-references. Sequence is reviewed and features are added using a combined approach of collaboration and other input from the scientific community, prediction, propagation from GenBank and curation by NCBI staff. The format of all RefSeq records is validated, and an increasing number of tests are being applied to evaluate the quality of sequence and annotation, especially in the context of complete genomic sequence. PMID:17130148

  15. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

    PubMed

    Pruitt, Kim D; Tatusova, Tatiana; Maglott, Donna R

    2007-01-01

    NCBI's reference sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) is a curated non-redundant collection of sequences representing genomes, transcripts and proteins. The database includes 3774 organisms spanning prokaryotes, eukaryotes and viruses, and has records for 2,879,860 proteins (RefSeq release 19). RefSeq records integrate information from multiple sources, when additional data are available from those sources and therefore represent a current description of the sequence and its features. Annotations include coding regions, conserved domains, tRNAs, sequence tagged sites (STS), variation, references, gene and protein product names, and database cross-references. Sequence is reviewed and features are added using a combined approach of collaboration and other input from the scientific community, prediction, propagation from GenBank and curation by NCBI staff. The format of all RefSeq records is validated, and an increasing number of tests are being applied to evaluate the quality of sequence and annotation, especially in the context of complete genomic sequence.

  16. The Chlamydophila abortus genome sequence reveals an array of variable proteins that contribute to interspecies variation

    PubMed Central

    Thomson, Nicholas R.; Yeats, Corin; Bell, Kenneth; Holden, Matthew T.G.; Bentley, Stephen D.; Livingstone, Morag; Cerdeño-Tárraga, Ana M.; Harris, Barbara; Doggett, Jon; Ormond, Doug; Mungall, Karen; Clarke, Kay; Feltwell, Theresa; Hance, Zahra; Sanders, Mandy; Quail, Michael A.; Price, Claire; Barrell, Bart G.; Parkhill, Julian; Longbottom, David

    2005-01-01

    The obligate intracellular bacterial pathogen Chlamydophila abortus strain S26/3 (formerly the abortion subtype of Chlamydia psittaci) is an important cause of late gestation abortions in ruminants and pigs. Furthermore, although relatively rare, zoonotic infection can result in acute illness and miscarriage in pregnant women. The complete genome sequence was determined and shows a high level of conservation in both sequence and overall gene content in comparison to other Chlamydiaceae. The 1,144,377-bp genome contains 961 predicted coding sequences, 842 of which are conserved with those of Chlamydophila caviae and Chlamydophila pneumoniae. Within this conserved Cp. abortus core genome we have identified the major regions of variation and have focused our analysis on these loci, several of which were found to encode highly variable protein families, such as TMH/Inc and Pmp families, which are strong candidates for the source of diversity in host tropism and disease causation in this group of organisms. Significantly, Cp. abortus lacks any toxin genes, and also lacks genes involved in tryptophan metabolism and nucleotide salvaging (guaB is present as a pseudogene), suggesting that the genetic basis of niche adaptation of this species is distinct from those previously proposed for other chlamydial species. PMID:15837807

  17. Molecular cloning and sequencing of the gene encoding the fimbrial subunit protein of Bacteroides gingivalis.

    PubMed Central

    Dickinson, D P; Kubiniec, M A; Yoshimura, F; Genco, R J

    1988-01-01

    The gene encoding the fimbrial subunit protein of Bacteroides gingivalis 381, fimbrilin, has been cloned and sequenced. The gene was present as a single copy on the bacterial chromosome, and the codon usage in the gene conformed closely to that expected for an abundant protein. The predicted size of the mature protein was 35,924 daltons, and the secretory form may have had a 10-amino-acid, hydrophilic leader sequence similar to the leader sequences of the MePhe fimbriae family. The protein sequence had no marked similarity to known fimbrial sequences, and no homologous sequences could be found in other black-pigmented Bacteroides species, suggesting that fimbrillin represents a class of fimbrial subunit protein of limited distribution. Images PMID:2895100

  18. Sequence-specific binding of simian virus 40 A protein to nonorigin and cellular DNA.

    PubMed Central

    Wright, P J; DeLucia, A L; Tegtmeyer, P

    1984-01-01

    The simian virus 40 A protein (T antigen) recognized and bound to the consensus sequence 5'-GAGGC-3' in DNA from many sources. Sequence-specific binding to single pentanucleotides in randomly chosen DNA predominated over binding to nonspecific sequences. The asymmetric orientation of protein bound to nonorigin recognition sequences also resembled that of protein bound to the origin region of simian virus 40 DNA. Sequence variations in the DNA adjacent to single pentanucleotides influenced binding affinities even though methylation interference and protection studies did not reveal specific interactions outside of pentanucleotides. Thus, potential locations of A protein bound to any DNA can be predicted although the determinants of binding affinity are not yet understood. Sequence-specific binding of A protein to cellular DNA would provide a mechanism for specific alterations of host gene expression that facilitate viral function. Images PMID:6570189

  19. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein.

    PubMed

    Oliphant, A R; Brandl, C J; Struhl, K

    1989-07-01

    We describe a new method for accurately defining the sequence recognition properties of DNA-binding proteins by selecting high-affinity binding sites from random-sequence DNA. The yeast transcriptional activator protein GCN4 was coupled to a Sepharose column, and binding sites were isolated by passing short, random-sequence oligonucleotides over the column and eluting them with increasing salt concentrations. Of 43 specifically bound oligonucleotides, 40 contained the symmetric sequence TGA(C/G)TCA, whereas the other 3 contained sequences matching six of these seven bases. The extreme preference for this 7-base-pair sequence suggests that each position directly contacts GCN4. The three nucleotide positions on each side of this core heptanucleotide also showed sequence preferences, indicating their effect on GCN4 binding. Interestingly, deviations in the core and a stronger sequence preference in the flanking region were found on one side of the central C . G base pair. Although GCN4 binds as a dimer, this asymmetry supports a model in which interactions on each side of the binding site are not equivalent. The random selection method should prove generally useful for defining the specificities of other DNA-binding proteins and for identifying putative target sequences from genomic DNA.

  20. Sequence-based prediction of protein protein interaction using a deep-learning algorithm.

    PubMed

    Sun, Tanlin; Zhou, Bo; Lai, Luhua; Pei, Jianfeng

    2017-05-25

    Protein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested. We used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods. To our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.

  1. Determination of the sequences of protein-derived peptides and peptide mixtures by mass spectrometry

    PubMed Central

    Morris, Howard R.; Williams, Dudley H.; Ambler, Richard P.

    1971-01-01

    Micro-quantities of protein-derived peptides have been converted into N-acetylated permethyl derivatives, and their sequences determined by low-resolution mass spectrometry without prior knowledge of their amino acid compositions or lengths. A new strategy is suggested for the mass spectrometric sequencing of oligopeptides or proteins, involving gel filtration of protein hydrolysates and subsequent sequence analysis of peptide mixtures. Finally, results are given that demonstrate for the first time the use of mass spectrometry for the analysis of a protein-derived peptide mixture, again without prior knowledge of the protein or components within the mixture. PMID:5158904

  2. UNIT 11.10 N-Terminal Sequence Analysis of Proteins and Peptides

    PubMed Central

    Speicher, Kaye D.; Gorman, Nicole; Speicher, David W.

    2009-01-01

    Automated N-terminal sequence analysis involves a series of chemical reactions that derivatize and remove one amino acid at a time from the N-terminal of purified peptides or intact proteins. At least several pmoles of a purified protein or 10 to 20 pmoles of a purified peptide with an unmodified N-terminal is required in order to obtain useful sequence information. In recent years the demand for N-terminal sequencing has decreased substantially as some applications for protein identification and characterization can now be more effectively performed using mass spectrometry. However, N-terminal sequencing remains the method of choice for verifying the N-terminal boundary of recombinant proteins, determining the N-terminal of protease-resistant domains, identifying proteins isolated from species where most of the genome has not yet been sequenced, and mapping modified or crosslinked sites in proteins that prove to be refractory to analysis by mass spectrometry. PMID:18429102

  3. Shark myelin basic protein: amino acid sequence, secondary structure, and self-association.

    PubMed

    Milne, T J; Atkins, A R; Warren, J A; Auton, W P; Smith, R

    1990-09-01

    Myelin basic protein (MBP) from the Whaler shark (Carcharhinus obscurus) has been purified from acid extracts of a chloroform/methanol pellet from whole brains. The amino acid sequence of the majority of the protein has been determined and compared with the sequences of other MBPs. The shark protein has only 44% homology with the bovine protein, but, in common with other MBPs, it has basic residues distributed throughout the sequence and no extensive segments that are predicted to have an ordered secondary structure in solution. Shark MBP lacks the triproline sequence previously postulated to form a hairpin bend in the molecule. The region containing the putative consensus sequence for encephalitogenicity in the guinea pig contains several substitutions, thus accounting for the lack of activity of the shark protein. Studies of the secondary structure and self-association have shown that shark MBP possesses solution properties similar to those of the bovine protein, despite the extensive differences in primary structure.

  4. 3D reconstruction software comparison for short sequences

    NASA Astrophysics Data System (ADS)

    Strupczewski, Adam; Czupryński, BłaŻej

    2014-11-01

    Large scale multiview reconstruction is recently a very popular area of research. There are many open source tools that can be downloaded and run on a personal computer. However, there are few, if any, comparisons between all the available software in terms of accuracy on small datasets that a single user can create. The typical datasets for testing of the software are archeological sites or cities, comprising thousands of images. This paper presents a comparison of currently available open source multiview reconstruction software for small datasets. It also compares the open source solutions with a simple structure from motion pipeline developed by the authors from scratch with the use of OpenCV and Eigen libraries.

  5. Overlapping Genes Produce Proteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation▿ †

    PubMed Central

    Rancurel, Corinne; Khosravi, Mahvash; Dunker, A. Keith; Romero, Pedro R.; Karlin, David

    2009-01-01

    It is widely assumed that new proteins are created by duplication, fusion, or fission of existing coding sequences. Another mechanism of protein birth is provided by overlapping genes. They are created de novo by mutations within a coding sequence that lead to the expression of a novel protein in another reading frame, a process called “overprinting.” To investigate this mechanism, we have analyzed the sequences of the protein products of manually curated overlapping genes from 43 genera of unspliced RNA viruses infecting eukaryotes. Overlapping proteins have a sequence composition globally biased toward disorder-promoting amino acids and are predicted to contain significantly more structural disorder than nonoverlapping proteins. By analyzing the phylogenetic distribution of overlapping proteins, we were able to confirm that 17 of these had been created de novo and to study them individually. Most proteins created de novo are orphans (i.e., restricted to one species or genus). Almost all are accessory proteins that play a role in viral pathogenicity or spread, rather than proteins central to viral replication or structure. Most proteins created de novo are predicted to be fully disordered and have a highly unusual sequence composition. This suggests that some viral overlapping reading frames encoding hypothetical proteins with highly biased composition, often discarded as noncoding, might in fact encode proteins. Some proteins created de novo are predicted to be ordered, however, and whenever a three-dimensional structure of such a protein has been solved, it corresponds to a fold previously unobserved, suggesting that the study of these proteins could enhance our knowledge of protein space. PMID:19640978

  6. New Powerful Statistics for Alignment-free Sequence Comparison Under a Pattern Transfer Model

    PubMed Central

    Liu, Xuemei; Wan, Lin; Li, Jing; Reinert, Gesine; Waterman, Michael S.; Sun, Fengzhu

    2011-01-01

    Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D2∗ and D2s showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D2∗ and D2s by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model. PMID:21723298

  7. Hydrophobic-cluster analysis of plant protein sequences. A domain homology between storage and lipid-transfer proteins.

    PubMed Central

    Henrissat, B; Popineau, Y; Kader, J C

    1988-01-01

    Hydrophobic-cluster analysis was used to characterize a conserved domain located near the C-terminal amino acid sequence of wheat (Triticum aestivum) storage proteins. This domain was transformed into a linear template for a global search for similarities in over 5200 protein sequences. In addition to proteins that had already been found to exhibit homology to wheat storage proteins, a previously unreported homology was found with non-specific lipid-transfer proteins from castor bean (Ricinus communis) and from spinach (Spinacia oleracea) leaf. Hydrophobic-cluster analysis of various members of the present protein group clearly shows a typical domain structure where (i) variable and conserved domains are located along the sequence at precise positions, (ii) the conserved domains probably reflect a common ancestor, and (iii) the unique properties of a given protein (chain cut into subunits, repetitive domains, trypsin-inhibitor active site) are associated with the variable domains. PMID:3214430

  8. Protein science by DNA sequencing: how advances in molecular biology are accelerating biochemistry.

    PubMed

    Higgins, Sean Andrew; Savage, David F

    2017-10-09

    A fundamental goal of protein biochemistry is to determine the sequence-function relationship, but the vastness of sequence space makes comprehensive evaluation of this landscape difficult. However, advances in DNA synthesis and sequencing now allow researchers to assess the functional impact of every single mutation in many proteins, but challenges remain in library construction and the development of general assays applicable to a diverse range of protein functions. This perspective briefly outlines the technical innovations in DNA manipulation which allow massively parallel protein biochemistry, then summarizes the methods currently available for library construction and the functional assays of protein variants. Areas in need of future innovation are highlighted with a particular focus on assay development and the use of computational analysis with machine learning to effectively traverse the sequence-function landscape. Finally, applications in the fundamentals of protein biochemistry, disease prediction, and protein engineering are presented.

  9. Operational definition of intrinsically unstructured protein sequences based on susceptibility to the 20S proteasome.

    PubMed

    Tsvetkov, Peter; Asher, Gad; Paz, Aviv; Reuven, Nina; Sussman, Joel L; Silman, Israel; Shaul, Yosef

    2008-03-01

    Intrinsically unstructured proteins (IUPs), also known as natively unfolded proteins, lack well-defined secondary and tertiary structure under physiological conditions. In recent years, growing experimental and theoretical evidence has accumulated, indicating that many entire proteins and protein sequences are unstructured under physiological conditions, and that they play significant roles in diverse cellular processes. Bioinformatic algorithms have been developed to identify such sequences in proteins for which structural data are lacking, but still generate substantial numbers of false positives and negatives. We describe here a simple and reliable in vitro assay for identifying IUP sequences based on their susceptibility to 20S proteasomal degradation. We show that 20S proteasomes digest IUP sequences, under conditions in which native, and even molten globule states, are resistant. Furthermore, we show that protein-protein interactions can protect IUPs against 20S proteasomal action. Taken together, our results thus suggest that the 20S proteasome degradation assay provides a powerful system for operational definition of IUPs.

  10. Conservation of Shannon's redundancy for proteins. [information theory applied to amino acid sequences

    NASA Technical Reports Server (NTRS)

    Gatlin, L. L.

    1974-01-01

    Concepts of information theory are applied to examine various proteins in terms of their redundancy in natural originators such as animals and plants. The Monte Carlo method is used to derive information parameters for random protein sequences. Real protein sequence parameters are compared with the standard parameters of protein sequences having a specific length. The tendency of a chain to contain some amino acids more frequently than others and the tendency of a chain to contain certain amino acid pairs more frequently than other pairs are used as randomness measures of individual protein sequences. Non-periodic proteins are generally found to have random Shannon redundancies except in cases of constraints due to short chain length and genetic codes. Redundant characteristics of highly periodic proteins are discussed. A degree of periodicity parameter is derived.

  11. Microwave-assisted acid and base hydrolysis of intact proteins containing disulfide bonds for protein sequence analysis by mass spectrometry.

    PubMed

    Reiz, Bela; Li, Liang

    2010-09-01

    Controlled hydrolysis of proteins to generate peptide ladders combined with mass spectrometric analysis of the resultant peptides can be used for protein sequencing. In this paper, two methods of improving the microwave-assisted protein hydrolysis process are described to enable rapid sequencing of proteins containing disulfide bonds and increase sequence coverage, respectively. It was demonstrated that proteins containing disulfide bonds could be sequenced by MS analysis by first performing hydrolysis for less than 2 min, followed by 1 h of reduction to release the peptides originally linked by disulfide bonds. It was shown that a strong base could be used as a catalyst for microwave-assisted protein hydrolysis, producing complementary sequence information to that generated by microwave-assisted acid hydrolysis. However, using either acid or base hydrolysis, amide bond breakages in small regions of the polypeptide chains of the model proteins (e.g., cytochrome c and lysozyme) were not detected. Dynamic light scattering measurement of the proteins solubilized in an acid or base indicated that protein-protein interaction or aggregation was not the cause of the failure to hydrolyze certain amide bonds. It was speculated that there were some unknown local structures that might play a role in preventing an acid or base from reacting with the peptide bonds therein.

  12. Protein identities from 'Graphocephala atropunctata' expressed sequence tags: Expanding leafhopper vector biology

    USDA-ARS?s Scientific Manuscript database

    Heat shock proteins and 44 protein sequences from the blue-green sharpshooter, BGSS, were produced and identified. The sequences were submitted and published under accession numbers: DQ445499-DQ445542, in the National Center for Biotechnology Information, NCBI, Public Database. The blue-green sharps...

  13. Quantitative Assessment of RNA-Protein Interactions with High Throughput Sequencing - RNA Affinity Profiling (HiTS-RAP)

    PubMed Central

    Ozer, Abdullah; Tome, Jacob M.; Friedman, Robin C.; Gheba, Dan; Schroth, Gary P.; Lis, John T.

    2016-01-01

    Because RNA-protein interactions play a central role in a wide-array of biological processes, methods that enable a quantitative assessment of these interactions in a high-throughput manner are in great demand. Recently, we developed the High Throughput Sequencing-RNA Affinity Profiling (HiTS-RAP) assay, which couples sequencing on an Illumina GAIIx with the quantitative assessment of one or several proteins’ interactions with millions of different RNAs in a single experiment. We have successfully used HiTS-RAP to analyze interactions of EGFP and NELF-E proteins with their corresponding canonical and mutant RNA aptamers. Here, we provide a detailed protocol for HiTS-RAP, which can be completed in about a month (8 days hands-on time) including the preparation and testing of recombinant proteins and DNA templates, clustering DNA templates on a flowcell, high-throughput sequencing and protein binding with GAIIx, and finally data analysis. We also highlight aspects of HiTS-RAP that can be further improved and points of comparison between HiTS-RAP and two other recently developed methods, RNA-MaP and RBNS. A successful HiTS-RAP experiment provides the sequence and binding curves for approximately 200 million RNAs in a single experiment. PMID:26182240

  14. Exhaustive comparison and classification of ligand-binding surfaces in proteins

    PubMed Central

    Murakami, Yoichi; Kinoshita, Kengo; Kinjo, Akira R; Nakamura, Haruki

    2013-01-01

    Many proteins function by interacting with other small molecules (ligands). Identification of ligand-binding sites (LBS) in proteins can therefore help to infer their molecular functions. A comprehensive comparison among local structures of LBSs was previously performed, in order to understand their relationships and to classify their structural motifs. However, similar exhaustive comparison among local surfaces of LBSs (patches) has never been performed, due to computational complexity. To enhance our understanding of LBSs, it is worth performing such comparisons among patches and classifying them based on similarities of their surface configurations and electrostatic potentials. In this study, we first developed a rapid method to compare two patches. We then clustered patches corresponding to the same PDB chemical component identifier for a ligand, and selected a representative patch from each cluster. We subsequently exhaustively as compared the representative patches and clustered them using similarity score, PatSim. Finally, the resultant PatSim scores were compared with similarities of atomic structures of the LBSs and those of the ligand-binding protein sequences and functions. Consequently, we classified the patches into ∼2000 well-characterized clusters. We found that about 63% of these clusters are used in identical protein folds, although about 25% of the clusters are conserved in distantly related proteins and even in proteins with cross-fold similarity. Furthermore, we showed that patches with higher PatSim score have potential to be involved in similar biological processes. PMID:23934772

  15. Comparison of immunoturbidimetric and immunonephelometric assays for specific proteins.

    PubMed

    Mali, Bahera; Armbruster, David; Serediak, Ernie; Ottenbreit, Tammy

    2009-10-01

    Immunoturbidimetric assays for specific proteins are available on "open system" clinical chemistry analyzers. The analytical performance of nine immunoturbidimetric specific protein assays (C3, C4, CRP, Haptoglobin, IgA, IgG, IgM, RF, and Transferrin) was compared to immunonephelometry. Testing was performed on the Abbott ARCHITECT ci8200 and the Dade Behring BNII nephelometer and evaluated for precision, linearity, limit of detection, prozone phenomenon, method comparison, workflow, and proficiency testing survey comparison. Immunoturbidimetric assays performance was satisfactory for total precision, linearity, limit of detection and the prozone effect was not observed. Method comparison was acceptable for the immunoglobulins, CRP and transferrin but less favorable for the other assays, likely due to methodology and antibody specificity differences. Immunourbidimetric specific protein assays allow for efficient test consolidation on a general purpose clinical chemistry analyzer.

  16. Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry.

    PubMed

    Asara, John M; Schweitzer, Mary H; Freimark, Lisa M; Phillips, Matthew; Cantley, Lewis C

    2007-04-13

    Fossilized bones from extinct taxa harbor the potential for obtaining protein or DNA sequences that could reveal evolutionary links to extant species. We used mass spectrometry to obtain protein sequences from bones of a 160,000- to 600,000-year-old extinct mastodon (Mammut americanum) and a 68-million-year-old dinosaur (Tyrannosaurus rex). The presence of T. rex sequences indicates that their peptide bonds were remarkably stable. Mass spectrometry can thus be used to determine unique sequences from ancient organisms from peptide fragmentation patterns, a valuable tool to study the evolution and adaptation of ancient taxa from which genomic sequences are unlikely to be obtained.

  17. Close Sequence Comparisons are Sufficient to Identify Humancis-Regulatory Elements

    SciTech Connect

    Prabhakar, Shyam; Poulin, Francis; Shoukry, Malak; Afzal, Veena; Rubin, Edward M.; Couronne, Olivier; Pennacchio, Len A.

    2005-12-01

    Cross-species DNA sequence comparison is the primary method used to identify functional noncoding elements in human and other large genomes. However, little is known about the relative merits of evolutionarily close and distant sequence comparisons, due to the lack of a universal metric for sequence conservation, and also the paucity of empirically defined benchmark sets of cis-regulatory elements. To address this problem, we developed a general-purpose algorithm (Gumby) that detects slowly-evolving regions in primate, mammalian and more distant comparisons without requiring adjustment of parameters, and ranks conserved elements by P-value using Karlin-Altschul statistics. We benchmarked Gumby predictions against previously identified cis-regulatory elements at diverse genomic loci, and also tested numerous extremely conserved human-rodent sequences for transcriptional enhancer activity using reporter-gene assays in transgenic mice. Human regulatory elements were identified with acceptable sensitivity and specificity by comparison with 1-5 other eutherian mammals or 6 other simian primates. More distant comparisons (marsupial, avian, amphibian and fish) failed to identify many of the empirically defined functional noncoding elements. We derived an intuitive relationship between ancient and recent noncoding sequence conservation from whole genome comparative analysis, which explains some of these findings. Lastly, we determined that, in addition to strength of conservation, genomic location and/or density of surrounding conserved elements must also be considered in selecting candidate enhancers for testing at embryonic time points.

  18. Protein identification with N and C-terminal sequence tags in proteome projects.

    PubMed

    Wilkins, M R; Gasteiger, E; Tonella, L; Ou, K; Tyler, M; Sanchez, J C; Gooley, A A; Walsh, B J; Bairoch, A; Appel, R D; Williams, K L; Hochstrasser, D F

    1998-05-08

    Genome sequences are available for increasing numbers of organisms. The proteomes (protein complement expressed by the genome) of many such organisms are being studied with two-dimensional (2D) gel electrophoresis. Here we have investigated the application of short N-terminal and C-terminal sequence tags to the identification of proteins separated on 2D gels. The theoretical N and C termini of 15, 519 proteins, representing all SWISS-PROT entries for the organisms Mycoplasma genitalium, Bacillus subtilis, Escherichia coli, Saccharomyces cerevisiae and human, were analysed. Sequence tags were found to be surprisingly specific, with N-terminal tags of four amino acid residues found to be unique for between 43% and 83% of proteins, and C-terminal tags of four amino acid residues unique for between 74% and 97% of proteins, depending on the species studied. Sequence tags of five amino acid residues were found to be even more specific. To utilise this specificity of sequence tags for protein identification, we created a world-wide web-accessible protein identification program, TagIdent (http://www.expasy.ch/www/tools.html), which matches sequence tags of up to six amino acid residues as well as estimated protein pI and mass against proteins in the SWISS-PROT database. We demonstrate the utility of this identification approach with sequence tags generated from 91 different E. coli proteins purified by 2D gel electrophoresis. Fifty-one proteins were unambiguously identified by virtue of their sequence tags and estimated pI and mass, and a further 11 proteins identified when sequence tags were combined with protein amino acid composition data. We conlcude that the TagIdent identification approach is best suited to the identification of proteins from prokaryotes whose complete genome sequences are available. The approach is less well suited to proteins from eukaryotes, as many eukaryotic proteins are not amenable to sequencing via Edman degradation, and tag protein

  19. Basal Murphy belt and Chilhowee Group -- Sequence stratigraphic comparison

    SciTech Connect

    Aylor, J.G. Jr. . Dept. of Geology)

    1994-03-01

    The lower Murphy belt in the central western Blue Ridge is interpreted to be correlative to the Early Cambrian Chilhowee Group of the westernmost Blue Ridge and Appalachian fold and thrust belt. Basal Murphy belt depositional sequence stratigraphy represents a second-order, type-2 transgressive systems tract initiated with deposition of lowstand turbidites of the Dean Formation. These transgressive deposits of the Nantahala and Brasstown Formations are interpreted as middle to outer continental shelf deposits. Cyclic and stacked third-order regressive, coarsening upwards sequences of the Nantahala Formation display an overall increase in feldspar content stratigraphically upsection. These transgressive siliciclastic deposits are interpreted to be conformably overlain by a carbonate highstand systems tract of the Murphy Marble. Palinspastic reconstruction indicates that the Nantahala and Brasstown Formations possibly represent a basinward extension of up to 3 km thick siliciclastic wedge. The wedge tapers to the southwest along the strike of the Murphy belt at 10[degree] and thins northwestward to 2 km in the Tennessee depocenter where it is represented by the Chilhowee Group. The Murphy belt basin is believed to represent a transitional rift-to-drift facies deposited on the lower plate of the southern Blue Ridge rift zone.

  20. Molecular Evolution of the Escherichia Coli Chromosome. IV. Sequence Comparisons

    PubMed Central

    Milkman, R.; Bridges, M. M.

    1993-01-01

    DNA sequences have been compared in a 4,400-bp region for Escherichia coli K12 and 36 ECOR strains. Discontinuities in degree of similarity, previously inferred, are confirmed in detail. Three clonal frames are described on the basis of the present local high-resolution data, as well as previous analyses of restriction fragment length polymorphism (RFLP) and of multilocus enzyme electrophoresis (MLEE) covering small regions more widely dispersed on the chromosome. These three approaches show important consistency. The data illustrate the fact that, in the limited context of intraspecific genomic sequence variation, clonality and homology are synonymous. Two estimable quantitative properties are defined: recency of common ancestry (the reciprocal of the log(10) of the number of generations since the most recent common ancestor), and the number of nucleotide pairs over which a given recency of common ancestry applies. In principle, these parameters are measures of the degree and physical extent of homology. The small size of apparent recombinational replacements, together with the observation that they occasionally occur in discontinuous series, raises the question of whether they result from the superimposition of replacements of much larger size (as expected from an elementary interpretation of conjugation and transduction in experimental E. coli systems) or via an alternative mechanism. Length polymorphisms of several sorts are described. PMID:8095913

  1. Application of 2D graphic representation of protein sequence based on Huffman tree method.

    PubMed

    Qi, Zhao-Hui; Feng, Jun; Qi, Xiao-Qin; Li, Ling

    2012-05-01

    Based on Huffman tree method, we propose a new 2D graphic representation of protein sequence. This representation can completely avoid loss of information in the transfer of data from a protein sequence to its graphic representation. The method consists of two parts. One is about the 0-1 codes of 20 amino acids by Huffman tree with amino acid frequency. The amino acid frequency is defined as the statistical number of an amino acid in the analyzed protein sequences. The other is about the 2D graphic representation of protein sequence based on the 0-1 codes. Then the applications of the method on ten ND5 genes and seven Escherichia coli strains are presented in detail. The results show that the proposed model may provide us with some new sights to understand the evolution patterns determined from protein sequences and complete genomes. Copyright © 2012 Elsevier Ltd. All rights reserved.

  2. A local average distance descriptor for flexible protein structure comparison

    PubMed Central

    2014-01-01

    Background Protein structures are flexible and often show conformational changes upon binding to other molecules to exert biological functions. As protein structures correlate with characteristic functions, structure comparison allows classification and prediction of proteins of undefined functions. However, most comparison methods treat proteins as rigid bodies and cannot retrieve similarities of proteins with large conformational changes effectively. Results In this paper, we propose a novel descriptor, local average distance (LAD), based on either the geodesic distances (GDs) or Euclidean distances (EDs) for pairwise flexible protein structure comparison. The proposed method was compared with 7 structural alignment methods and 7 shape descriptors on two datasets comprising hinge bending motions from the MolMovDB, and the results have shown that our method outperformed all other methods regarding retrieving similar structures in terms of precision-recall curve, retrieval success rate, R-precision, mean average precision and F1-measure. Conclusions Both ED- and GD-based LAD descriptors are effective to search deformed structures and overcome the problems of self-connection caused by a large bending motion. We have also demonstrated that the ED-based LAD is more robust than the GD-based descriptor. The proposed algorithm provides an alternative approach for blasting structure database, discovering previously unknown conformational relationships, and reorganizing protein structure classification. PMID:24694083

  3. Reconstruction of an ancestral Yersinia pestis genome and comparison with an ancient sequence

    PubMed Central

    2015-01-01

    Background We propose the computational reconstruction of a whole bacterial ancestral genome at the nucleotide scale, and its validation by a sequence of ancient DNA. This rare possibility is offered by an ancient sequence of the late middle ages plague agent. It has been hypothesized to be ancestral to extant Yersinia pestis strains based on the pattern of nucleotide substitutions. But the dynamics of indels, duplications, insertion sequences and rearrangements has impacted all genomes much more than the substitution process, which makes the ancestral reconstruction task challenging. Results We use a set of gene families from 13 Yersinia species, construct reconciled phylogenies for all of them, and determine gene orders in ancestral species. Gene trees integrate information from the sequence, the species tree and gene order. We reconstruct ancestral sequences for ancestral genic and intergenic regions, providing nearly a complete genome sequence for the ancestor, containing a chromosome and three plasmids. Conclusion The comparison of the ancestral and ancient sequences provides a unique opportunity to assess the quality of ancestral genome reconstruction methods. But the quality of the sequencing and assembly of the ancient sequence can also be questioned by this comparison. PMID:26450112

  4. Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains

    PubMed Central

    Liao, Weinan; Ren, Jie; Wang, Kun; Wang, Shun; Zeng, Feng; Wang, Ying; Sun, Fengzhu

    2016-01-01

    The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com. PMID:27876823

  5. Sequence-similar, structure-dissimilar protein pairs in the PDB

    PubMed Central

    Kosloff, Mickey; Kolodny, Rachel

    2008-01-01

    It is often assumed that in the Protein Data Bank (PDB), two proteins with similar sequences will also have similar structures. Accordingly, it has proved useful to develop subsets of the PDB from which “redundant” structures have been removed, based on a sequence-based criterion for similarity. Similarly, when predicting protein structure using homology modeling, if a template structure for modeling a target sequence is selected by sequence alone, this implicitly assumes that all sequence-similar templates are equivalent. Here, we show that this assumption is often not correct and that standard approaches to create subsets of the PDB can lead to the loss of structurally and functionally important information. We have carried out sequence-based structural superpositions and geometry-based structural alignments of a large number of protein pairs to determine the extent to which sequence similarity ensures structural similarity. We find many examples where two proteins that are similar in sequence have structures that differ significantly from one another. The source of the structural differences usually has a functional basis. The number of such proteins pairs that are identified and the magnitude of the dissimilarity depend on the approach that is used to calculate the differences; in particular sequence-based structure superpositioning will identify a larger number of structurally dissimilar pairs than geometry-based structural alignments. When two sequences can be aligned in a statistically meaningful way, sequence-based structural superpositioning provides a meaningful measure of structural differences. This approach and geometry-based structure alignments reveal somewhat different information and one or the other might be preferable in a given application. Our results suggest that in some cases, notably homology modeling, the common use of nonredundant datasets, culled from the PDB based on sequence, may mask important structural and functional information

  6. Reprint of "Identification of staphylococcal species based on variations in protein sequences (mass spectrometry) and DNA sequence (sodA microarray)".

    PubMed

    Kooken, Jennifer; Fox, Karen; Fox, Alvin; Altomare, Diego; Creek, Kim; Wunschel, David; Pajares-Merino, Sara; Martínez-Ballesteros, Ilargi; Garaizar, Javier; Oyarzabal, Omar; Samadpour, Mansour

    2014-01-01

    This report is among the first using sequence variation in newly discovered protein markers for staphylococcal (or indeed any other bacterial) speciation. Variation, at the DNA sequence level, in the sodA gene (commonly used for staphylococcal speciation) provided excellent correlation. Relatedness among strains was also assessed using protein profiling using microcapillary electrophoresis and pulsed field electrophoresis. A total of 64 strains were analyzed including reference strains representing the 11 staphylococcal species most commonly isolated from man (Staphylococcus aureus and 10 coagulase negative species [CoNS]). Matrix assisted time of flight ionization/ionization mass spectrometry (MALDI TOF MS) and liquid chromatography-electrospray ionization tandem mass spectrometry (LC ESI MS/MS) were used for peptide analysis of proteins isolated from gel bands. Comparison of experimental spectra of unknowns versus spectra of peptides derived from reference strains allowed bacterial identification after MALDI TOF MS analysis. After LC-MS/MS analysis of gel bands bacterial speciation was performed by comparing experimental spectra versus virtual spectra using the software X!Tandem. Finally LC-MS/MS was performed on whole proteomes and data analysis also employing X!tandem. Aconitate hydratase and oxoglutarate dehydrogenase served as marker proteins on focused analysis after gel separation. Alternatively on full proteomics analysis elongation factor Tu generally provided the highest confidence in staphylococcal speciation.

  7. Hydrophobic Blocks Facilitate Lipid Compatibility and Translocon Recognition of Transmembrane Protein Sequences

    PubMed Central

    2016-01-01

    Biophysical hydrophobicity scales suggest that partitioning of a protein segment from an aqueous phase into a membrane is governed by its perceived segmental hydrophobicity but do not establish specifically (i) how the segment is identified in vivo for translocon-mediated insertion or (ii) whether the destination lipid bilayer is biochemically receptive to the inserted sequence. To examine the congruence between these dual requirements, we designed and synthesized a library of Lys-tagged peptides of a core length sufficient to span a bilayer but with varying patterns of sequence, each composed of nine Leu residues, nine Ser residues, and one (central) Trp residue. We found that peptides containing contiguous Leu residues (Leu-block peptides, e.g., LLLLLLLLLWSSSSSSSSS), in comparison to those containing discontinuous stretches of Leu residues (non-Leu-block peptides, e.g., SLSLLSLSSWSLLSLSLLS), displayed greater helicity (circular dichroism spectroscopy), traveled slower during sodium dodecyl sulfate–polyacrylamide gel electrophoresis, had longer reverse phase high-performance liquid chromatography retention times on a C-18 column, and were helical when reconstituted into 1-palmitoyl-2-oleoylglycero-3-phosphocholine liposomes, each observation indicating superior lipid compatibility when a Leu-block is present. These parameters were largely paralleled in a biological membrane insertion assay using microsomal membranes from dog pancreas endoplasmic reticulum, where we found only the Leu-block sequences successfully inserted; intriguingly, an amphipathic peptide (SLLSSLLSSWLLSSLLSSL; Leu face, Ser face) with biophysical properties similar to those of Leu-block peptides failed to insert. Our overall results identify local sequence lipid compatibility rather than average hydrophobicity as a principal determinant of transmembrane segment potential, while demonstrating that further subtleties of hydrophobic and helical patterning, such as circumferential hydrophobicity

  8. Hydrophobic blocks facilitate lipid compatibility and translocon recognition of transmembrane protein sequences.

    PubMed

    Stone, Tracy A; Schiller, Nina; von Heijne, Gunnar; Deber, Charles M

    2015-02-24

    Biophysical hydrophobicity scales suggest that partitioning of a protein segment from an aqueous phase into a membrane is governed by its perceived segmental hydrophobicity but do not establish specifically (i) how the segment is identified in vivo for translocon-mediated insertion or (ii) whether the destination lipid bilayer is biochemically receptive to the inserted sequence. To examine the congruence between these dual requirements, we designed and synthesized a library of Lys-tagged peptides of a core length sufficient to span a bilayer but with varying patterns of sequence, each composed of nine Leu residues, nine Ser residues, and one (central) Trp residue. We found that peptides containing contiguous Leu residues (Leu-block peptides, e.g., LLLLLLLLLWSSSSSSSSS), in comparison to those containing discontinuous stretches of Leu residues (non-Leu-block peptides, e.g., SLSLLSLSSWSLLSLSLLS), displayed greater helicity (circular dichroism spectroscopy), traveled slower during sodium dodecyl sulfate-polyacrylamide gel electrophoresis, had longer reverse phase high-performance liquid chromatography retention times on a C-18 column, and were helical when reconstituted into 1-palmitoyl-2-oleoylglycero-3-phosphocholine liposomes, each observation indicating superior lipid compatibility when a Leu-block is present. These parameters were largely paralleled in a biological membrane insertion assay using microsomal membranes from dog pancreas endoplasmic reticulum, where we found only the Leu-block sequences successfully inserted; intriguingly, an amphipathic peptide (SLLSSLLSSWLLSSLLSSL; Leu face, Ser face) with biophysical properties similar to those of Leu-block peptides failed to insert. Our overall results identify local sequence lipid compatibility rather than average hydrophobicity as a principal determinant of transmembrane segment potential, while demonstrating that further subtleties of hydrophobic and helical patterning, such as circumferential hydrophobicity in

  9. Differential extraction and protein sequencing reveals major differences in patterns of primary cell wall proteins from plants.

    PubMed

    Robertson, D; Mitchell, G P; Gilroy, J S; Gerrish, C; Bolwell, G P; Slabas, A R

    1997-06-20

    The proteins of the primary cell walls of suspension cultured cells of five plant species, Arabidopsis, carrot, French bean, tomato, and tobacco, have been compared. The approach that has been adopted is differential extraction followed by SDS-polyacrylamide gel electrophoresis (PAGE), rather than two-dimensional gel analysis, to facilitate protein sequencing. Whole cells were washed sequentially with the following aqueous solutions, CaCl2, CDTA (cyclohexane diaminotetraacetic acid, DTT (dithiothreitol), NaCl, and borate. SDS-PAGE analysis showed consistent differences between species. From the 233 proteins that were selected for sequencing, 63% gave N-terminal data. This analysis shows that (i) patterns of proteins revealed by SDS-PAGE are strikingly different for all five species, (ii) a large number of these proteins cannot be identified by data base searches indicating that a significant proportion of wall proteins have not been previously described, (iii) the major proteins that can be identified belong to very different classes of proteins, (iv) the majority of proteins found in the extracellular growth media are absent from their respective cell wall extracts, and (v) the results of the extraction process are indicative of higher order structure. It appears that aspects of speciation reside in the complement of extracellular wall proteins. The data represent a protein resource for cell wall studies complementary to EST (expressed sequence tag) and DNA sequencing strategies.

  10. Proteomic Analysis of Lyme Disease: Global Protein Comparison of Three Strains of Borrelia burgdorferi

    SciTech Connect

    Jacobs, Jon M.; Yang, Xiaohua; Luft, Benjamin J.; Dunn, John J.; Camp, David G.; Smith, Richard D.

    2005-04-01

    The Borrelia burgdorferi spirochete is the causative agent of Lyme disease, the most common tick-borne disease in the United States. It has been studied extensively to help understand its pathogenicity of infection and how it can persist in different mammalian hosts. We report the proteomic analysis of the archetype B. burgdorferi B31 strain and two other strains (ND40, and JD-1) having different Borrelia pathotypes using strong cation exchange fractionation of proteolytic peptides followed by high-resolution, reversed phase capillary liquid chromatography coupled with ion trap tandem mass spectrometric (LC-MS/MS) analysis. Protein identification was facilitated by the availability of the complete B31 genome sequence. A total of 665 Borrelia proteins were identified representing ~38 % coverage of the theoretical B31 proteome. A significant overlap was observed between the identified proteins in direct comparisons between any two strains (>72%), but distinct differences were observed among identified hypothetical and outer membrane proteins of the three strains. Such a concurrent proteomic overview of three Borrelia strains based upon only the B31 genome sequence is shown to provide significant insights into the presence or absence of specific proteins and a broad overall comparison among strains.

  11. An optimistic protein assembly from sequence reads salvaged an uncharacterized segment of mouse picobirnavirus

    PubMed Central

    Gonzalez, Gabriel; Sasaki, Michihito; Burkitt-Gray, Lucy; Kamiya, Tomonori; Tsuji, Noriko M.; Sawa, Hirofumi; Ito, Kimihito

    2017-01-01

    Advances in Next Generation Sequencing technologies have enabled the generation of millions of sequences from microorganisms. However, distinguishing the sequence of a novel species from sequencing errors remains a technical challenge when the novel species is highly divergent from the closest known species. To solve such a problem, we developed a new method called Optimistic Protein Assembly from Reads (OPAR). This method is based on the assumption that protein sequences could be more conserved than the nucleotide sequences encoding them. By taking advantage of metagenomics, bioinformatics and conventional Sanger sequencing, our method successfully identified all coding regions of the mouse picobirnavirus for the first time. The salvaged sequences indicated that segment 1 of this virus was more divergent from its homologues in other Picobirnaviridae species than segment 2. For this reason, only segment 2 of mouse picobirnavirus has been detected in previous studies. OPAR web tool is available at http://bioinformatics.czc.hokudai.ac.jp/opar/. PMID:28071766

  12. Nucleotide sequence variation of the envelope protein gene identifies two distinct genotypes of yellow fever virus.

    PubMed

    Chang, G J; Cropp, B C; Kinney, R M; Trent, D W; Gubler, D J

    1995-09-01

    The evolution of yellow fever virus over 67 years was investigated by comparing the nucleotide sequences of the envelope (E) protein genes of 20 viruses isolated in Africa, the Caribbean, and South America. Uniformly weighted parsimony algorithm analysis defined two major evolutionary yellow fever virus lineages designated E genotypes I and II. E genotype I contained viruses isolated from East and Central Africa. E genotype II viruses were divided into two sublineages: IIA viruses from West Africa and IIB viruses from America, except for a 1979 virus isolated from Trinidad (TRINID79A). Unique signature patterns were identified at 111 nucleotide and 12 amino acid positions within the yellow fever virus E gene by signature pattern analysis. Yellow fever viruses from East and Central Africa contained unique signatures at 60 nucleotide and five amino acid positions, those from West Africa contained unique signatures at 25 nucleotide and two amino acid positions, and viruses from America contained such signatures at 30 nucleotide and five amino acid positions in the E gene. The dissemination of yellow fever viruses from Africa to the Americas is supported by the close genetic relatedness of genotype IIA and IIB viruses and genetic evidence of a possible second introduction of yellow fever virus from West Africa, as illustrated by the TRINID79A virus isolate. The E protein genes of American IIB yellow fever viruses had higher frequencies of amino acid substitutions than did genes of yellow fever viruses of genotypes I and IIA on the basis of comparisons with a consensus amino acid sequence for the yellow fever E gene. The great variation in the E proteins of American yellow fever virus probably results from positive selection imposed by virus interaction with different species of mosquitoes or nonhuman primates in the Americas.

  13. Nucleotide sequence variation of the envelope protein gene identifies two distinct genotypes of yellow fever virus.

    PubMed Central

    Chang, G J; Cropp, B C; Kinney, R M; Trent, D W; Gubler, D J

    1995-01-01

    The evolution of yellow fever virus over 67 years was investigated by comparing the nucleotide sequences of the envelope (E) protein genes of 20 viruses isolated in Africa, the Caribbean, and South America. Uniformly weighted parsimony algorithm analysis defined two major evolutionary yellow fever virus lineages designated E genotypes I and II. E genotype I contained viruses isolated from East and Central Africa. E genotype II viruses were divided into two sublineages: IIA viruses from West Africa and IIB viruses from America, except for a 1979 virus isolated from Trinidad (TRINID79A). Unique signature patterns were identified at 111 nucleotide and 12 amino acid positions within the yellow fever virus E gene by signature pattern analysis. Yellow fever viruses from East and Central Africa contained unique signatures at 60 nucleotide and five amino acid positions, those from West Africa contained unique signatures at 25 nucleotide and two amino acid positions, and viruses from America contained such signatures at 30 nucleotide and five amino acid positions in the E gene. The dissemination of yellow fever viruses from Africa to the Americas is supported by the close genetic relatedness of genotype IIA and IIB viruses and genetic evidence of a possible second introduction of yellow fever virus from West Africa, as illustrated by the TRINID79A virus isolate. The E protein genes of American IIB yellow fever viruses had higher frequencies of amino acid substitutions than did genes of yellow fever viruses of genotypes I and IIA on the basis of comparisons with a consensus amino acid sequence for the yellow fever E gene. The great variation in the E proteins of American yellow fever virus probably results from positive selection imposed by virus interaction with different species of mosquitoes or nonhuman primates in the Americas. PMID:7637022

  14. Unravelling the relationship between protein sequence and low-complexity regions entropies: Interactome implications.

    PubMed

    Martins, F; Gonçalves, R; Oliveira, J; Cruz-Monteagudo, M; Nieto-Villar, J M; Paz-y-Miño, C; Rebelo, I; Tejera, E

    2015-10-07

    Low-complexity regions are sub-sequences of biased composition in a protein sequence. The influence of these regions over protein evolution, specific functions and highly interactive capacities is well known. Although protein sequence entropy has been largely studied, its relationship with low-complexity regions and the subsequent effects on protein function remains unclear. In this work we propose a theoretical and empirical model integrating the sequence entropy with local complexity parameters. Our results indicate that the protein sequence entropy is related with the protein length, the entropies inside and outside the low-complexity regions as well as their number and average size. We found a small but significant increment in the sequence entropy of hubs proteins. In agreement with our theoretical model, this increment is highly dependent of the balance between the increment of protein length and average size of the low-complexity regions. Finally, our models and proteins analysis provide evidence supporting that modifications in the average size is more relevant in hubs proteins than changes in the number of low-complexity regions.

  15. Nucleic acid (cDNA) and amino acid sequences of the maize endosperm protein glutelin-2.

    PubMed Central

    Prat, S; Cortadas, J; Puigdomènech, P; Palau, J

    1985-01-01

    The cDNA coding for a glutelin-2 protein from maize endosperm has been cloned and the complete amino acid sequence of the protein derived for the first time. An immature maize endosperm cDNA bank was screened for the expression of a beta-lactamase:glutelin-2 (G2) fusion polypeptide by using antibodies against the purified 28 kd G2 protein. A clone corresponding to the 28 kd G2 protein was sequenced and the primary structure of this protein was derived. Five regions can be defined in the protein sequence: an 11 residue N-terminal part, a repeated region formed by eight units of the sequence Pro-Pro-Pro-Val-His-Leu, an alternating Pro-X stretch 21 residues long, a Cys rich domain and a C-terminal part rich in Gln. The protein sequence is preceded by 19 residues which have the characteristics of the signal peptide found in secreted proteins. Unlike zeins, the main maize storage proteins, 28 kd glutelin-2 has several homologous sequences in common with other cereal storage proteins. Images PMID:3839076

  16. A comparison of protein quantitation assays for biopharmaceutical applications.

    PubMed

    Noble, J E; Knight, A E; Reason, A J; Di Matola, A; Bailey, M J A

    2007-10-01

    Dye-based protein determination assays are widely used to estimate protein concentration, however various reports suggest that the response is dependent on the composition and sequence of the protein, limiting confidence in the resulting concentration estimates. In this study a diverse set of model proteins representing various sizes of protein and covalent modifications, some typical of biopharmaceuticals have been used to assess the utility of dye-based protein concentration assays. The protein concentration assays (Bicinchoninic acid (BCA), Bradford, 3-(4-carboxybenzoyl)quinoline-2-carboxaldehyde (CBQCA), DC, Fluorescamine and Quant-i) were compared to the 'gold standard' assay, quantitative amino acid analysis (AAA). The assays that displayed the lowest variability between proteins, BCA and DC, also generated improved estimates when BSA was used as a standard, when compared to AAA derived concentrations. Assays read out by absorbance tended to display enhanced robustness and repeatability, whereas the fluorescence based assays had wider quantitation ranges and lower limits of detection. Protein modification, in the form of glycosylation and PEGylation, and the addition of excipients, were found to affect the estimation of protein concentration for some of the assays when compared to the unmodified protein. We discuss the suitability and limitations of the selected assays for the estimation of protein concentration in biopharmaceutical applications.

  17. The matrix protein gene sequence analysis reveals close relationship between peste des petits ruminants virus (PPRV) and dolphin morbillivirus.

    PubMed

    Haffar, A; Libeau, G; Moussa, A; Cécile, M; Diallo, A

    1999-10-01

    The gene encoding the matrix protein of peste des petits ruminants virus (PPRV) has been cloned and its nucleotide sequence determined. This gene is 1466 nucleotides long and contains an open reading frame (ORF) capable of encoding a basic protein of 335 amino acid residues with a predicted molecular weight of 38,057 Da. This ORF starts at position 33-35 and ends with the codon TAA at position 1038-1040 thus leaving a long untranslated region (426 nucleotides) at the 3' end of the messenger RNA. This fragment is very G/C rich (68.5%) and in contrast to the ORF region appears to be least conserved in the M gene sequence of the morbilliviruses. A comparison of the PPRV M protein with those of other viruses in the group confirms the previously noted high degree of conservation for this protein sequence. The percent of identity within the group ranges from 76.7 to 86.9%, the highest being with the dolphin morbillivirus matrix protein.

  18. A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%.

    PubMed Central

    Mehta, P. K.; Heringa, J.; Argos, P.

    1995-01-01

    To improve secondary structure predictions in protein sequences, the information residing in multiple sequence alignments of substituted but structurally related proteins is exploited. A database comprised of 70 protein families and a total of 2,500 sequences, some of which were aligned by tertiary structural superpositions, was used to calculate residue exchange weight matrices within alpha-helical, beta-strand, and coil substructures, respectively. Secondary structure predictions were made based on the observed residue substitutions in local regions of the multiple alignments and the largest possible associated exchange weights in each of the three matrix types. Comparison of the observed and predicted secondary structure on a per-residue basis yielded a mean accuracy of 72.2%. Individual alpha-helix, beta-strand, and coil states were respectively predicted at 66.7, and 75.8% correctness, representing a well-balanced three-state prediction. The accuracy level, verified by cross-validation through jack-knife tests on all protein families, dropped, on average, to only 70.9%, indicating the rigor of the prediction procedure. On the basis of robustness, conceptual clarity, accuracy, and executable efficiency, the method has considerable advantage, especially with its sole reliance on amino acid substitutions within structurally related proteins. PMID:8580842

  19. Camps 2.0: exploring the sequence and structure space of prokaryotic, eukaryotic, and viral membrane proteins.

    PubMed

    Neumann, Sindy; Hartmann, Holger; Martin-Galiano, Antonio J; Fuchs, Angelika; Frishman, Dmitrij

    2012-03-01

    Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/. Copyright © 2011 Wiley Periodicals, Inc.

  20. Development of a protein microarray using sequence-specific DNA binding domain on DNA chip surface

    SciTech Connect

    Choi, Yoo Seong; Pack, Seung Pil; Yoo, Young Je . E-mail: yjyoo@snu.ac.kr

    2005-04-22

    A protein microarray based on DNA microarray platform was developed to identify protein-protein interactions in vitro. The conventional DNA chip surface by 156-bp PCR product was prepared for a substrate of protein microarray. High-affinity sequence-specific DNA binding domain, GAL4 DNA binding domain, was introduced to the protein microarray as fusion partner of a target model protein, enhanced green fluorescent protein. The target protein was oriented immobilized directly on the DNA chip surface. Finally, monoclonal antibody of the target protein was used to identify the immobilized protein on the surface. This study shows that the conventional DNA chip can be used to make a protein microarray directly, and this novel protein microarray can be applicable as a tool for identifying protein-protein interactions.

  1. The amino-acid sequence of the 2S sulphur-rich proteins from seeds of Brazil nut (Bertholletia excelsa H.B.K.).

    PubMed

    Ampe, C; Van Damme, J; de Castro, L A; Sampaio, M J; Van Montagu, M; Vandekerckhove, J

    1986-09-15

    Storage proteins of the albumin solubility fraction from seeds of Bertholletia excelsa H.B.K. were separated by reversed-phase high-performance liquid chromatography and their primary structures were determined by gas-phase sequencing on intact polypeptides and on the overlapping tryptic and thermolysin peptides. The 2S storage proteins consist of two subunits linked by disulphide bridges. The large subunit (8.5 kDa) is expressed in at least six different isoforms while the small subunit (3.6 kDa) consists of only one form. These proteins are extremely rich in glutamine, glutamic acid, arginine and the sulphur-containing amino acids cysteine and methionine. One of the variants even contains a sequence of six methionine residues in a row. Comparison with known sequences of 2S proteins of other dicotyledonous plants shows limited but distinct sequence homology. In particular, the positions of the cysteine residues relative to each other appear to be completely conserved, suggesting that tertiary structure constraints imposed by disulphide bridges dominate sequence conservation. It has been proposed that the two subunits of a related protein (the Brassica napus storage protein) is cleaved from a precursor polypeptide [Crouch, M. L., Tenbarge, K. M., Simon, A. E. & Ferl, R. (1983) J. Mol. Appl. Genet. 2,273-283]. The amino acid sequence homology of the Brazil nut protein with the former suggests that a similar protein processing event could occur.

  2. Comparison of Metalloproteinase Protein and Activity Profiling

    PubMed Central

    Giricz, Orsi; Lauer, Janelle L.; Fields, Gregg B.

    2010-01-01

    Proteolytic enzymes play fundamental roles in many biological processes. Members of the matrix metalloproteinase (MMP) family have been shown to take part in processes crucial in disease progression. The present study used the ExcelArray Human MMP/TIMP Array to quantify MMP and tissue inhibitor of metalloproteinase (TIMP) production in the lysates and media of 14 cancer and one normal cell line. The overall patterns were very similar in terms of which MMPs and TIMPs were secreted in the media versus associated with the cells in the individual samples. However, more MMP was found in the media, both in amount and in variety. TIMP-1 was produced in all cell lines. MMP activity assays with three different FRET substrates were then utilized to determine if protein production correlated with function for the WM-266-4 and BJ cell lines. Metalloproteinase activity was observed for both cell lines with a general MMP substrate (Knight SSP), consistent with protein production data. However, although both cell lines promoted the hydrolysis of a more selective MMP substrate (NFF-3), metalloproteinase activity was only confirmed in the BJ cell line. The use of inhibitors to confirm metalloproteinase activities pointed to the strengths and weaknesses of in situ FRET substrate assays. PMID:20920458

  3. Full validation of therapeutic antibody sequences by middle-up mass measurements and middle-down protein sequencing.

    PubMed

    Resemann, Anja; Jabs, Wolfgang; Wiechmann, Anja; Wagner, Elsa; Colas, Olivier; Evers, Waltraud; Belau, Eckhard; Vorwerg, Lars; Evans, Catherine; Beck, Alain; Suckau, Detlev

    2016-01-01

    The regulatory bodies request full sequence data assessment both for innovator and biosimilar monoclonal antibodies (mAbs). Full sequence coverage is typically used to verify the integrity of the analytical data obtained following the combination of multiple LC-MS/MS datasets from orthogonal protease digests (so called "bottom-up" approaches). Top-down or middle-down mass spectrometric approaches have the potential to minimize artifacts, reduce overall analysis time and provide orthogonality to this traditional approach. In this work we report a new combined approach involving middle-up LC-QTOF and middle-down LC-MALDI in-source decay (ISD) mass spectrometry. This was applied to cetuximab, panitumumab and natalizumab, selected as representative US Food and Drug Administration- and European Medicines Agency-approved mAbs. The goal was to unambiguously confirm their reference sequences and examine the general applicability of this approach. Furthermore, a new measure for assessing the integrity and validity of results from middle-down approaches is introduced - the "Sequence Validation Percentage." Full sequence data assessment of the 3 antibodies was achieved enabling all 3 sequences to be fully validated by a combination of middle-up molecular weight determination and middle-down protein sequencing. Three errors in the reference amino acid sequence of natalizumab, causing a cumulative mass shift of only -2 Da in the natalizumab Fd domain, were corrected as a result of this work.

  4. Full validation of therapeutic antibody sequences by middle-up mass measurements and middle-down protein sequencing

    PubMed Central

    Resemann, Anja; Jabs, Wolfgang; Wiechmann, Anja; Wagner, Elsa; Colas, Olivier; Evers, Waltraud; Belau, Eckhard; Vorwerg, Lars; Evans, Catherine; Beck, Alain; Suckau, Detlev

    2016-01-01

    ABSTRACT The regulatory bodies request full sequence data assessment both for innovator and biosimilar monoclonal antibodies (mAbs). Full sequence coverage is typically used to verify the integrity of the analytical data obtained following the combination of multiple LC-MS/MS datasets from orthogonal protease digests (so called “bottom-up” approaches). Top-down or middle-down mass spectrometric approaches have the potential to minimize artifacts, reduce overall analysis time and provide orthogonality to this traditional approach. In this work we report a new combined approach involving middle-up LC-QTOF and middle-down LC-MALDI in-source decay (ISD) mass spectrometry. This was applied to cetuximab, panitumumab and natalizumab, selected as representative US Food and Drug Administration- and European Medicines Agency-approved mAbs. The goal was to unambiguously confirm their reference sequences and examine the general applicability of this approach. Furthermore, a new measure for assessing the integrity and validity of results from middle-down approaches is introduced – the “Sequence Validation Percentage.” Full sequence data assessment of the 3 antibodies was achieved enabling all 3 sequences to be fully validated by a combination of middle-up molecular weight determination and middle-down protein sequencing. Three errors in the reference amino acid sequence of natalizumab, causing a cumulative mass shift of only −2 Da in the natalizumab Fd domain, were corrected as a result of this work. PMID:26760197

  5. Sequence signatures and the probabilistic identification of proteins in the Myc-Max-Mad network

    PubMed Central

    Atchley, William R.; Fernandes, Andrew D.

    2005-01-01

    Accurate identification of specific groups of proteins by their amino acid sequence is an important goal in genome research. Here we combine information theory with fuzzy logic search procedures to identify sequence signatures or predictive motifs for members of the Myc-Max-Mad transcription factor network. Myc is a well known oncoprotein, and this family is involved in cell proliferation, apoptosis, and differentiation. We describe a small set of amino acid sites from the N-terminal portion of the basic helix-loop-helix (bHLH) domain that provide very accurate sequence signatures for the Myc-Max-Mad transcription factor network and three of its member proteins. A predictive motif involving 28 contiguous bHLH sequence elements found 337 network proteins in the GenBank NR database with no mismatches or misidentifications. This motif also identifies at least one previously unknown fungal protein with strong affinity to the Myc-Max-Mad network. Another motif found 96% of known Myc protein sequences with only a single mismatch, including sequences from genomes previously not thought to contain Myc proteins. The predictive motif for Myc is very similar to the ancestral sequence for the Myc group estimated from phylogenetic analyses. Based on available crystal structure studies, this motif is discussed in terms of its functional consequences. Our results provide insight into evolutionary diversification of DNA binding and dimerization in a well characterized family of regulatory proteins and provide a method of identifying signature motifs in protein families. PMID:15851686

  6. eVolver: an optimization engine for evolving protein sequences to stabilize the respective structures.

    PubMed

    Brylinski, Michal

    2013-07-31

    Many structural bioinformatics approaches employ sequence profile-based threading techniques. To improve fold recognition rates, homology searching may include artificially evolved amino acid sequences, which were demonstrated to enhance the sensitivity of protein threading in targeting midnight zone templates. We describe implementation details of eVolver, an optimization algorithm that evolves protein sequences to stabilize the respective structures by a variety of potentials, which are compatible with those commonly used in protein threading. In a case study focusing on LARG PDZ domain, we show that artificially evolved sequences have quite high capabilities to recognize the correct protein structures using standard sequence profile-based fold recognition. Computationally design protein sequences can be incorporated in existing sequence profile-based threading approaches to increase their sensitivity. They also provide a desired linkage between protein structure and function in in silico experiments that relate to e.g. the completeness of protein structure space, the origin of folds and protein universe. eVolver is freely available as a user-friendly webserver and a well-documented stand-alone software distribution at http://www.brylinski.org/evolver.

  7. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

    PubMed

    Remmert, Michael; Biegert, Andreas; Hauser, Andreas; Söding, Johannes

    2011-12-25

    Sequence-based protein function and structure prediction depends crucially on sequence-search sensitivity and accuracy of the resulting sequence alignments. We present an open-source, general-purpose tool that represents both query and database sequences by profile hidden Markov models (HMMs): 'HMM-HMM-based lightning-fast iterative sequence search' (HHblits; http://toolkit.genzentrum.lmu.de/hhblits/). Compared to the sequence-search tool PSI-BLAST, HHblits is faster owing to its discretized-profile prefilter, has 50-100% higher sensitivity and generates more accurate alignments.

  8. Improvements to pairwise sequence comparison (PASC): a genome-based web tool for virus classification.

    PubMed

    Bao, Yiming; Chetvernin, Vyacheslav; Tatusova, Tatiana

    2014-12-01

    The number of viral genome sequences in the public databases is increasing dramatically, and these sequences are playing an important role in virus classification. Pairwise sequence comparison is a sequence-based virus classification method. A program using this method calculates the pairwise identities of virus sequences within a virus family and displays their distribution, and visual analysis helps to determine demarcations at different taxonomic levels such as strain, species, genus and subfamily. Subsequent comparison of new sequences against existing ones allows viruses from which the new sequences were derived to be classified. Although this method cannot be used as the only criterion for virus classification in some cases, it is a quantitative method and has many advantages over conventional virus classification methods. It has been applied to several virus families, and there is an increasing interest in using this method for other virus families/groups. The Pairwise Sequence Comparison (PASC) classification tool was created at the National Center for Biotechnology Information. The tool's database stores pairwise identities for complete genomes/segments of 56 virus families/groups. Data in the system are updated every day to reflect changes in virus taxonomy and additions of new virus sequences to the public database. The web interface of the tool ( http://www.ncbi.nlm.nih.gov/sutils/pasc/ ) makes it easy to navigate and perform analyses. Multiple new viral genome sequences can be tested simultaneously with this system to suggest the taxonomic position of virus isolates in a specific family. PASC eliminates potential discrepancies in the results caused by different algorithms and/or different data used by researchers.

  9. Investigation of the protein osteocalcin of Camelops hesternus: Sequence, structure and phylogenetic implications

    NASA Astrophysics Data System (ADS)

    Humpula, James F.; Ostrom, Peggy H.; Gandhi, Hasand; Strahler, John R.; Walker, Angela K.; Stafford, Thomas W.; Smith, James J.; Voorhies, Michael R.; George Corner, R.; Andrews, Phillip C.

    2007-12-01

    Ancient DNA sequences offer an extraordinary opportunity to unravel the evolutionary history of ancient organisms. Protein sequences offer another reservoir of genetic information that has recently become tractable through the application of mass spectrometric techniques. The extent to which ancient protein sequences resolve phylogenetic relationships, however, has not been explored. We determined the osteocalcin amino acid sequence from the bone of an extinct Camelid (21 ka, Camelops hesternus) excavated from Isleta Cave, New Mexico and three bones of extant camelids: bactrian camel ( Camelus bactrianus); dromedary camel ( Camelus dromedarius) and guanaco ( Llama guanacoe) for a diagenetic and phylogenetic assessment. There was no difference in sequence among the four taxa. Structural attributes observed in both modern and ancient osteocalcin include a post-translation modification, Hyp 9, deamidation of Gln 35 and Gln 39, and oxidation of Met 36. Carbamylation of the N-terminus in ancient osteocalcin may result in blockage and explain previous difficulties in sequencing ancient proteins via Edman degradation. A phylogenetic analysis using osteocalcin sequences of 25 vertebrate taxa was conducted to explore osteocalcin protein evolution and the utility of osteocalcin sequences for delineating phylogenetic relationships. The maximum likelihood tree closely reflected generally recognized taxonomic relationships. For example, maximum likelihood analysis recovered rodents, birds and, within hominins, the Homo-Pan-Gorilla trichotomy. Within Artiodactyla, character state analysis showed that a substitution of Pro 4 for His 4 defines the Capra-Ovis clade within Artiodactyla. Homoplasy in our analysis indicated that osteocalcin evolution is not a perfect indicator of species evolution. Limited sequence availability prevented assigning functional significance to sequence changes. Our preliminary analysis of osteocalcin evolution represents an initial step towards a

  10. An efficient binomial model-based measure for sequence comparison and its application.

    PubMed

    Liu, Xiaoqing; Dai, Qi; Li, Lihua; He, Zerong

    2011-04-01

    Sequence comparison is one of the major tasks in bioinformatics, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations. There are several similarity/dissimilarity measures for sequence comparison, but challenges remains. This paper presented a binomial model-based measure to analyze biological sequences. With help of a random indicator, the occurrence of a word at any position of sequence can be regarded as a random Bernoulli variable, and the distribution of a sum of the word occurrence is well known to be a binomial one. By using a recursive formula, we computed the binomial probability of the word count and proposed a binomial model-based measure based on the relative entropy. The proposed measure was tested by extensive experiments including classification of HEV genotypes and phylogenetic analysis, and further compared with alignment-based and alignment-free measures. The results demonstrate that the proposed measure based on binomial model is more efficient.

  11. Secure distributed genome analysis for GWAS and sequence comparison computation

    PubMed Central

    2015-01-01

    Background The rapid increase in the availability and volume of genomic data makes significant advances in biomedical research possible, but sharing of genomic data poses challenges due to the highly sensitive nature of such data. To address the challenges, a competition for secure distributed processing of genomic data was organized by the iDASH research center. Methods In this work we propose techniques for securing computation with real-life genomic data for minor allele frequency and chi-squared statistics computation, as well as distance computation between two genomic sequences, as specified by the iDASH competition tasks. We put forward novel optimizations, including a generalization of a version of mergesort, which might be of independent interest. Results We provide implementation results of our techniques based on secret sharing that demonstrate practicality of the suggested protocols and also report on performance improvements due to our optimization techniques. Conclusions This work describes our techniques, findings, and experimental results developed and obtained as part of iDASH 2015 research competition to secure real-life genomic computations and shows feasibility of securely computing with genomic data in practice. PMID:26733307

  12. Chaos game representation of functional protein sequences, and simulation and multifractal analysis of induced measures

    NASA Astrophysics Data System (ADS)

    Yu, Zu-Guo; Xiao, Qian-Jun; Shi, Long; Yu, Jun-Wu; Vo, Anh

    2010-06-01

    Investigating the biological function of proteins is a key aspect of protein studies. Bioinformatic methods become important for studying the biological function of proteins. In this paper, we first give the chaos game representation (CGR) of randomly-linked functional protein sequences, then propose the use of the recurrent iterated function systems (RIFS) in fractal theory to simulate the measure based on their chaos game representations. This method helps to extract some features of functional protein sequences, and furthermore the biological functions of these proteins. Then multifractal analysis of the measures based on the CGRs of randomly-linked functional protein sequences are performed. We find that the CGRs have clear fractal patterns. The numerical results show that the RIFS can simulate the measure based on the CGR very well. The relative standard error and the estimated probability matrix in the RIFS do not depend on the order to link the functional protein sequences. The estimated probability matrices in the RIFS with different biological functions are evidently different. Hence the estimated probability matrices in the RIFS can be used to characterise the difference among linked functional protein sequences with different biological functions. From the values of the Dq curves, one sees that these functional protein sequences are not completely random. The Dq of all linked functional proteins studied are multifractal-like and sufficiently smooth for the Cq (analogous to specific heat) curves to be meaningful. Furthermore, the Dq curves of the measure μ based on their CGRs for different orders to link the functional protein sequences are almost identical if q >= 0. Finally, the Cq curves of all linked functional proteins resemble a classical phase transition at a critical point.

  13. Sensitive protein comparisons with profiles and hidden Markov models.

    PubMed

    Hofmann, K

    2000-05-01

    Sequence database searches have become an important tool for the life sciences in general and for gene discovery-driven biotechnology in particular. Both the functional assignment of newly found proteins and the mining of genome databases for functional candidates are equally important tasks typically addressed by database searches. Sensitivity and reliability of the search methods are of crucial importance. The overall performance of sequence alignments and database searches can be enhanced considerably, when profiles or hidden Markov models (HMMs) derived from protein families are used as query objects instead of single sequences. This review discusses the concept of profiles, generalised profiles and profile-HMMs, the methods how they are constructed and the scope of possible applications in gene discovery and gene functional assignment.

  14. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    SciTech Connect

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-28

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

  15. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    PubMed Central

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-01-01

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences. PMID:26723608

  16. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    NASA Astrophysics Data System (ADS)

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-01

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

  17. In Silico Characterization of Pectate Lyase Protein Sequences from Different Source Organisms

    PubMed Central

    Dubey, Amit Kumar; Yadav, Sangeeta; Kumar, Manish; Singh, Vinay Kumar; Sarangi, Bijaya Ketan; Yadav, Dinesh

    2010-01-01

    A total of 121 protein sequences of pectate lyases were subjected to homology search, multiple sequence alignment, phylogenetic tree construction, and motif analysis. The phylogenetic tree constructed revealed different clusters based on different source organisms representing bacterial, fungal, plant, and nematode pectate lyases. The multiple accessions of bacterial, fungal, nematode, and plant pectate lyase protein sequences were placed closely revealing a sequence level similarity. The multiple sequence alignment of these pectate lyase protein sequences from different source organisms showed conserved regions at different stretches with maximum homology from amino acid residues 439–467, 715–816, and 829–910 which could be used for designing degenerate primers or probes specific for pectate lyases. The motif analysis revealed a conserved Pec_Lyase_C domain uniformly observed in all pectate lyases irrespective of variable sources suggesting its possible role in structural and enzymatic functions. PMID:21048874

  18. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    PubMed

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.

  19. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    PubMed Central

    Sharma, Anuj; Manolakos, Elias S.

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  20. Nucleotide sequence of murine PCNA: interspecies comparison of the cDNA and the 5' flanking region of the gene.

    PubMed

    Shipman-Appasamy, P M; Cohen, K S; Prystowsky, M B

    1991-01-01

    Proliferating cell nuclear antigen (PCNA) RNA levels are regulated by transcription as well as changes in stability, in growing cells. We have cloned the murine PCNA cDNA and a fragment of the murine PCNA gene flanking the transcription initiation site. Comparison of the murine deduced amino acid sequence with the PCNA sequence from rat, human, Drosophila, Saccharomyces cerevisiae, and higher plants, reveals extensive homology between species. The homology is likely to be related to the fundamental role of PCNA as an auxiliary protein for DNA replication. Consensus sequences for transcriptional regulatory factors identified within 520 bp 5' of the cap site of the murine PCNA gene include: an inverted CCAAT site, an enhancer core element (EBP-1), three cAMP-response elements (CRE-BP), one AP-2 site, three Sp1 sites, and two octamer sequences. The first 20 bp of the transcriptional unit are homologous to an initiator element, which may direct transcription from RNA polymerase II in the absence of a TATAA box. The consensus elements in the murine PCNA gene are similar in sequence and/or location to elements identified in the genes for human, Drosophilia, and yeast PCNA.

  1. MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts.

    PubMed

    Deng, Xin; Cheng, Jianlin

    2011-12-14

    Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields. We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores. MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.

  2. Characterization of DNA-protein interactions using high-throughput sequencing data from pulldown experiments

    NASA Astrophysics Data System (ADS)

    Moreland, Blythe; Oman, Kenji; Curfman, John; Yan, Pearlly; Bundschuh, Ralf

    Methyl-binding domain (MBD) protein pulldown experiments have been a valuable tool in measuring the levels of methylated CpG dinucleotides. Due to the frequent use of this technique, high-throughput sequencing data sets are available that allow a detailed quantitative characterization of the underlying interaction between methylated DNA and MBD proteins. Analyzing such data sets, we first found that two such proteins cannot bind closer to each other than 2 bp, consistent with structural models of the DNA-protein interaction. Second, the large amount of sequencing data allowed us to find rather wea