Science.gov

Sample records for protein sequence comparison

  1. Protein sequence comparison and protein evolution

    SciTech Connect

    Pearson, W.R.

    1995-12-31

    This tutorial was one of eight tutorials selected to be presented at the Third International Conference on Intelligent Systems for Molecular Biology which was held in the United Kingdom from July 16 to 19, 1995. This tutorial examines how the information conserved during the evolution of a protein molecule can be used to infer reliably homology, and thus a shared proteinfold and possibly a shared active site or function. The authors start by reviewing a geological/evolutionary time scale. Next they look at the evolution of several protein families. During the tutorial, these families will be used to demonstrate that homologous protein ancestry can be inferred with confidence. They also examine different modes of protein evolution and consider some hypotheses that have been presented to explain the very earliest events in protein evolution. The next part of the tutorial will examine the technical aspects of protein sequence comparison. Both optimal and heuristic algorithms and their associated parameters that are used to characterize protein sequence similarities are discussed. Perhaps more importantly, they survey the statistics of local similarity scores, and how these statistics can both be used to improve the selectivity of a search and to evaluate the significance of a match. They them examine distantly related members of three protein families, the serine proteases, the glutathione transferases, and the G-protein-coupled receptors (GCRs). Finally, the discuss how sequence similarity can be used to examine internal repeated or mosaic structures in proteins.

  2. nWayComp: A Tool for Universal Comparison of DNA and Protein Sequences

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increasing number of whole genomic sequences of microorganisms has increased the complexity of genome-wide annotation and gene sequence comparison among multiple microorganisms. To address this problem, we developed nWayComp software that compares DNA and protein sequences of phylogenetically-r...

  3. Identification of the bacteriophage T5 dUTPase by protein sequence comparisons.

    PubMed

    Kaliman, A V

    1996-01-01

    It is shown by protein sequence comparisons that a 148 amino acid open reading frame (ORF 148) located at 67% of the bacteriophage T5 genome encodes a protein with strong similarity to known dUTPases. This protein contains five characteristic amino acid sequence motifs that are common to the dUTPase gene family. A similarity in size and high degree of sequence identity strongly suggest that the protein encoded by the ORF 148 of bacteriophage T5 is dUTPase. PMID:8988373

  4. Progressive structure-based alignment of homologous proteins: Adopting sequence comparison strategies.

    PubMed

    Joseph, Agnel Praveen; Srinivasan, Narayanaswamy; de Brevern, Alexandre G

    2012-09-01

    Comparison of multiple protein structures has a broad range of applications in the analysis of protein structure, function and evolution. Multiple structure alignment tools (MSTAs) are necessary to obtain a simultaneous comparison of a family of related folds. In this study, we have developed a method for multiple structure comparison largely based on sequence alignment techniques. A widely used Structural Alphabet named Protein Blocks (PBs) was used to transform the information on 3D protein backbone conformation as a 1D sequence string. A progressive alignment strategy similar to CLUSTALW was adopted for multiple PB sequence alignment (mulPBA). Highly similar stretches identified by the pairwise alignments are given higher weights during the alignment. The residue equivalences from PB based alignments are used to obtain a three dimensional fit of the structures followed by an iterative refinement of the structural superposition. Systematic comparisons using benchmark datasets of MSTAs underlines that the alignment quality is better than MULTIPROT, MUSTANG and the alignments in HOMSTRAD, in more than 85% of the cases. Comparison with other rigid-body and flexible MSTAs also indicate that mulPBA alignments are superior to most of the rigid-body MSTAs and highly comparable to the flexible alignment methods. PMID:22676903

  5. 3D representations of amino acids—applications to protein sequence comparison and classification

    PubMed Central

    Li, Jie; Koehl, Patrice

    2014-01-01

    The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements. PMID:25379143

  6. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species

    NASA Technical Reports Server (NTRS)

    Haney, P. J.; Badger, J. H.; Buldak, G. L.; Reich, C. I.; Woese, C. R.; Olsen, G. J.

    1999-01-01

    The genome sequence of the extremely thermophilic archaeon Methanococcus jannaschii provides a wealth of data on proteins from a thermophile. In this paper, sequences of 115 proteins from M. jannaschii are compared with their homologs from mesophilic Methanococcus species. Although the growth temperatures of the mesophiles are about 50 degrees C below that of M. jannaschii, their genomic G+C contents are nearly identical. The properties most correlated with the proteins of the thermophile include higher residue volume, higher residue hydrophobicity, more charged amino acids (especially Glu, Arg, and Lys), and fewer uncharged polar residues (Ser, Thr, Asn, and Gln). These are recurring themes, with all trends applying to 83-92% of the proteins for which complete sequences were available. Nearly all of the amino acid replacements most significantly correlated with the temperature change are the same relatively conservative changes observed in all proteins, but in the case of the mesophile/thermophile comparison there is a directional bias. We identify 26 specific pairs of amino acids with a statistically significant (P < 0.01) preferred direction of replacement.

  7. A statistical physics perspective on alignment-independent protein sequence comparison

    PubMed Central

    Chattopadhyay, Amit K.; Nasiev, Diar; Flower, Darren R.

    2015-01-01

    Motivation: Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Results: Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from ‘first passage probability distribution’ to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. Contact: d.r.flower@aston.ac.uk PMID:25810434

  8. Shotgun protein sequencing.

    SciTech Connect

    Faulon, Jean-Loup Michel; Heffelfinger, Grant S.

    2009-06-01

    A novel experimental and computational technique based on multiple enzymatic digestion of a protein or protein mixture that reconstructs protein sequences from sequences of overlapping peptides is described in this SAND report. This approach, analogous to shotgun sequencing of DNA, is to be used to sequence alternative spliced proteins, to identify post-translational modifications, and to sequence genetically engineered proteins.

  9. Establishing homologies in protein sequences

    NASA Technical Reports Server (NTRS)

    Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.

    1983-01-01

    Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.

  10. Zucchini yellow mosaic virus: biological properties, detection procedures and comparison of coat protein gene sequences.

    PubMed

    Coutts, B A; Kehoe, M A; Webster, C G; Wylie, S J; Jones, R A C

    2011-12-01

    Between 2006 and 2010, 5324 samples from at least 34 weed, two cultivated legume and 11 native species were collected from three cucurbit-growing areas in tropical or subtropical Western Australia. Two new alternative hosts of zucchini yellow mosaic virus (ZYMV) were identified, the Australian native cucurbit Cucumis maderaspatanus, and the naturalised legume species Rhyncosia minima. Low-level (0.7%) seed transmission of ZYMV was found in seedlings grown from seed collected from zucchini (Cucurbita pepo) fruit infected with isolate Cvn-1. Seed transmission was absent in >9500 pumpkin (C. maxima and C. moschata) seedlings from fruit infected with isolate Knx-1. Leaf samples from symptomatic cucurbit plants collected from fields in five cucurbit-growing areas in four Australian states were tested for the presence of ZYMV. When 42 complete coat protein (CP) nucleotide (nt) sequences from the new ZYMV isolates obtained were compared to those of 101 complete CP nt sequences from five other continents, phylogenetic analysis of the 143 ZYMV sequences revealed three distinct groups (A, B and C), with four subgroups in A (I-IV) and two in B (I-II). The new Australian sequences grouped according to collection location, fitting within A-I, A-II and B-II. The 16 new sequences from one isolated location in tropical northern Western Australia all grouped into subgroup B-II, which contained no other isolates. In contrast, the three sequences from the Northern Territory fitted into A-II with 94.6-99.0% nt identities with isolates from the United States, Iran, China and Japan. The 23 new sequences from the central west coast and two east coast locations all fitted into A-I, with 95.9-98.9% nt identities to sequences from Europe and Japan. These findings suggest that (i) there have been at least three separate ZYMV introductions into Australia and (ii) there are few changes to local isolate CP sequences following their establishment in remote growing areas. Isolates from A-I and B

  11. Comparison of the rotavirus nonstructural protein NSP1 (NS53) from different species by sequence analysis and northern blot hybridization.

    PubMed

    Dunn, S J; Cross, T L; Greenberg, H B

    1994-08-15

    The nucleotide sequence of gene 5 encoding the rotavirus nonstructural protein NSP1 (NS53) of 6 strains (EW, EHP, RRV, I321, OSU, and Gottfried) was determined and compared to 6 previously reported strains (SA11, UK, RF, Hu803, DS-1, and Wa). The 12 rotavirus strains were derived from a total of five separate species (murine, bovine, simian, porcine, and human). Gene sizes ranged from 1564 to 1611 nucleotides in length and the deduced protein sequences were found to be 486 to 495 amino acids in length. Comparisons of NSP1 amino acid sequences showed identities ranging from 36 to 92%. This diversity was most evident between strains from different species. Phylogenetic analysis revealed a clustering of NSP1 sequences according to species origin with the exception that the human and porcine strains were included in a single grouping. Northern blot hybridizations using additional rotavirus strains from the five species confirmed the grouping found by sequence analysis. The species specificity of NSP1 is consistent with the hypothesis that NSP1 plays a role in host range restriction. PMID:8030275

  12. Indigenous and introduced potyviruses of legumes and Passiflora spp. from Australia: biological properties and comparison of coat protein sequences

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Coat protein sequences of 33 Potyvirus isolates from legume and Passiflora spp. were sequenced to determine the identity of infecting viruses. Phylogenetic analysis of the sequences revealed the presence of seven distinct virus species....

  13. Protein sequence databases.

    PubMed

    Apweiler, Rolf; Bairoch, Amos; Wu, Cathy H

    2004-02-01

    A variety of protein sequence databases exist, ranging from simple sequence repositories, which store data with little or no manual intervention in the creation of the records, to expertly curated universal databases that cover all species and in which the original sequence data are enhanced by the manual addition of further information in each sequence record. As the focus of researchers moves from the genome to the proteins encoded by it, these databases will play an even more important role as central comprehensive resources of protein information. Several the leading protein sequence databases are discussed here, with special emphasis on the databases now provided by the Universal Protein Knowledgebase (UniProt) consortium. PMID:15036160

  14. Trading accuracy for speed: A quantitative comparison of search algorithms in protein sequence design.

    PubMed

    Voigt, C A; Gordon, D B; Mayo, S L

    2000-06-01

    Finding the minimum energy amino acid side-chain conformation is a fundamental problem in both homology modeling and protein design. To address this issue, numerous computational algorithms have been proposed. However, there have been few quantitative comparisons between methods and there is very little general understanding of the types of problems that are appropriate for each algorithm. Here, we study four common search techniques: Monte Carlo (MC) and Monte Carlo plus quench (MCQ); genetic algorithms (GA); self-consistent mean field (SCMF); and dead-end elimination (DEE). Both SCMF and DEE are deterministic, and if DEE converges, it is guaranteed that its solution is the global minimum energy conformation (GMEC). This provides a means to compare the accuracy of SCMF and the stochastic methods. For the side-chain placement calculations, we find that DEE rapidly converges to the GMEC in all the test cases. The other algorithms converge on significantly incorrect solutions; the average fraction of incorrect rotamers for SCMF is 0.12, GA 0.09, and MCQ 0.05. For the protein design calculations, design positions are progressively added to the side-chain placement calculation until the time required for DEE diverges sharply. As the complexity of the problem increases, the accuracy of each method is determined so that the results can be extrapolated into the region where DEE is no longer tractable. We find that both SCMF and MCQ perform reasonably well on core calculations (fraction amino acids incorrect is SCMF 0.07, MCQ 0.04), but fail considerably on the boundary (SCMF 0.28, MCQ 0.32) and surface calculations (SCMF 0.37, MCQ 0.44). PMID:10835284

  15. Sequence Comparison and Phylogeny of Nucleotide Sequence of Coat Protein and Nucleic Acid Binding Protein of a Distinct Isolate of Shallot virus X from India.

    PubMed

    Majumder, S; Baranwal, V K

    2011-06-01

    Shallot virus X (ShVX), a type species in the genus Allexivirus of the family Alfaflexiviridae has been associated with shallot plants in India and other shallot growing countries like Russia, Germany, Netherland, and New Zealand. Coat protein (CP) and nucleic acid binding protein (NB) region of the virus was obtained by reverse transcriptase polymerase chain reaction from scales leaves of shallot bulbs. The partial cDNA contained two open reading frames encoding proteins of molecular weights of 28.66 and 14.18 kDa belonging to Flexi_CP super-family and viral NB super-family, respectively. The percent identity and phylogenetic analysis of amino acid sequences of CP and NB region of the virus associated with shallot indicated that it was a distinct isolate of ShVX. PMID:23637504

  16. Protein Structure Comparison and Classification

    NASA Astrophysics Data System (ADS)

    Çamoǧlu, Orhan; Singh, Ambuj K.

    The success of genome projects has generated an enormous amount of sequence data. In order to realize the full value of the data, we need to understand its functional role and its evolutionary origin. Sequence comparison methods are incredibly valuable for this task. However, for sequences falling in the twilight zone (usually between 20 and 35% sequence similarity), we need to resort to structural alignment and comparison for a meaningful analysis. Such a structural approach can be used for classification of proteins, isolation of structural motifs, and discovery of drug targets.

  17. Indigenous and introduced potyviruses of legumes and Passiflora spp. from Australia: biological properties and comparison of coat protein nucleotide sequences.

    PubMed

    Coutts, Brenda A; Kehoe, Monica A; Webster, Craig G; Wylie, Stephen J; Jones, Roger A C

    2011-10-01

    Five Australian potyviruses, passion fruit woodiness virus (PWV), passiflora mosaic virus (PaMV), passiflora virus Y, clitoria chlorosis virus (ClCV) and hardenbergia mosaic virus (HarMV), and two introduced potyviruses, bean common mosaic virus (BCMV) and cowpea aphid-borne mosaic virus (CAbMV), were detected in nine wild or cultivated Passiflora and legume species growing in tropical, subtropical or Mediterranean climatic regions of Western Australia. When ClCV (1), PaMV (1), PaVY (8) and PWV (5) isolates were inoculated to 15 plant species, PWV and two PaVY P. foetida isolates infected P. edulis and P. caerulea readily but legumes only occasionally. Another PaVY P. foetida isolate resembled five PaVY legume isolates in infecting legumes readily but not infecting P. edulis. PaMV resembled PaVY legume isolates in legumes but also infected P. edulis. ClCV did not infect P. edulis or P. caerulea and behaved differently from PaVY legume isolates and PaMV when inoculated to two legume species. When complete coat protein (CP) nucleotide (nt) sequences of 33 new isolates were compared with 41 others, PWV (8), HarMV (4), PaMV (1) and ClCV (1) were within a large group of Australian isolates, while PaVY (14), CAbMV (1) and BCMV (3) isolates were in three other groups. Variation among PWV and PaVY isolates was sufficient for division into four clades each (I-IV). A variable block of 56 amino acid residues at the N-terminal region of the CPs of PaMV and ClCV distinguished them from PWV. Comparison of PWV, PaMV and ClCV CP sequences showed that nt identities were both above and below the 76-77% potyvirus species threshold level. This research gives insights into invasion of new hosts by potyviruses at the natural vegetation and cultivated area interface, and illustrates the potential of indigenous viruses to emerge to infect introduced plants. PMID:21744001

  18. Protein sequence comparisons show that the 'pseudoproteases' encoded by poxviruses and certain retroviruses belong to the deoxyuridine triphosphatase family.

    PubMed Central

    McGeoch, D J

    1990-01-01

    Amino acid sequence comparisons show extensive similarities among the deoxyuridine triphosphatases (dUTPases) of Escherichia coli and of herpesviruses, and the 'protease-like' or 'pseudoprotease' sequences encoded by certain retroviruses in the oncovirus and lentivirus families and by poxviruses. These relationships suggest strongly that the 'pseudoproteases' actually are dUTPases, and have not arisen by duplication of an oncovirus protease gene as had been suggested. The herpesvirus dUTPase sequences differ from the others in that they are longer (about 370 residues, against around 140) and one conserved element ('Motif 3') is displaced relative to its position in the other sequences; a model involving internal duplication of the herpesvirus gene can account effectively for these observations. Sequences closely similar to Motif 3 are also found in phosphofructokinases, where they form part of the active site and fructose phosphate binding structure; thus these sequences may represent a class of structural element generally involved in phosphate transfer to and from glycosides. PMID:2165588

  19. Discrimination of Burkholderia mallei/pseudomallei from Burkholderia thailandensis by sequence comparison of a fragment of the ribosomal protein S21 (rpsU) gene

    PubMed Central

    Frickmann, H.; Chantratita, N.; Gauthier, Y. P.; Neubauer, H.; Hagen, R. M.

    2012-01-01

    Discrimination of Burkholderia (B.) pseudomallei and B. mallei from environmental B. thailandensis is challenging. We describe a discrimination method based on sequence comparison of the ribosomal protein S21 (rpsU) gene. The rpsU gene was sequenced in ten B. pseudomallei, six B. mallei, one B. thailandensis reference strains, six isolates of B. pseudomallei, and 37 of B. thailandensis. Further rpsU sequences of six B. pseudomallei, three B. mallei, and one B. thailandensis were identified via NCBI GenBank. Three to four variable base-positions were identified within a 120-base-pair fragment, allowing discrimination of the B. pseudomallei/mallei-cluster from B. thailandensis, whose sequences clustered identically. All B. mallei and three B. pseudomallei sequences were identical, while 17/22 B. pseudomallei strains differed in one nucleotide (78A>C). Sequences of the rpsU fragment of ‘out-stander’ reference strains of B. cepacia, B. gladioli, B. plantarii, and B. vietnamensis clustered differently. Sequence comparison of the described rpsU gene fragment can be used as a supplementary diagnostic procedure for the discrimination of B. mallei/pseudomallei from B. thailandensis as well as from other species of the genus Burkholderia, keeping in mind that it does not allow for a differentiation between B. mallei and B. pseudomallei. PMID:23227305

  20. Molecular cloning and sequence analysis of the Sta58 major antigen gene of Rickettsia tsutsugamushi: sequence homology and antigenic comparison of Sta58 to the 60-kilodalton family of stress proteins.

    PubMed Central

    Stover, C K; Marana, D P; Dasch, G A; Oaks, E V

    1990-01-01

    The scrub typhus 58-kilodalton (kDa) antigen (Sta58) of Rickettsia tsutsugamushi is a major protein antigen often recognized by humans infected with scrub typhus rickettsiae. A 2.9-kilobase HindIII fragment containing a complete sta58 gene was cloned in Escherichia coli and found to express the entire Sta58 antigen and a smaller protein with an apparent molecular mass of 11 kDa (Stp11). DNA sequence analysis of the 2.9-kilobase HindIII fragment revealed two adjacent open reading frames encoding proteins of 11 (Stp11) and 60 (Sta58) kDa. Comparisons of deduced amino acid sequences disclosed a high degree of homology between the R. tsutsugamushi proteins Stp11 and Sta58 and the E. coli proteins GroES and GroEL, respectively, and the family of primordial heat shock proteins designated Hsp10 Hsp60. Although the sequence homology between the Sta58 antigen and the Hsp60 protein family is striking, the Sta58 protein appeared to be antigenically distinct among a sample of other bacterial Hsp60 homologs, including the typhus group of rickettsiae. The antigenic uniqueness of the Sta58 antigen indicates that this protein may be a potentially protective antigen and a useful diagnostic reagent for scrub typhus fever. Images PMID:2108930

  1. Sequence repeats and protein structure

    NASA Astrophysics Data System (ADS)

    Hoang, Trinh X.; Trovato, Antonio; Seno, Flavio; Banavar, Jayanth R.; Maritan, Amos

    2012-11-01

    Repeats are frequently found in known protein sequences. The level of sequence conservation in tandem repeats correlates with their propensities to be intrinsically disordered. We employ a coarse-grained model of a protein with a two-letter amino acid alphabet, hydrophobic (H) and polar (P), to examine the sequence-structure relationship in the realm of repeated sequences. A fraction of repeated sequences comprises a distinct class of bad folders, whose folding temperatures are much lower than those of random sequences. Imperfection in sequence repetition improves the folding properties of the bad folders while deteriorating those of the good folders. Our results may explain why nature has utilized repeated sequences for their versatility and especially to design functional proteins that are intrinsically unstructured at physiological temperatures.

  2. Sequence comparison of JSRV with endogenous proviruses: envelope genotypes and a novel ORF with similarity to a G-protein-coupled receptor.

    PubMed

    Bai, J; Bishop, J V; Carlson, J O; DeMartini, J C

    1999-06-01

    Ovine pulmonary carcinoma, a contagious lung cancer of sheep, is caused by the oncogenic jaagsiekte sheep retrovirus (JSRV) that is closely related to a family of endogenous sheep retroviral sequences (ESRVs). By using exogenous virus-specific U3 oligonucleotide primers, the entire JSRV proviral genome or its 3' part was amplified from tumor DNA. Analysis of these proviral sequences revealed a novel open reading frame (ORF) within the pol coding region, designated ORF X, which was well conserved in ESRV and JSRV sequences. Deduced amino acids of ORF X showed similarity to a portion of the mammalian adenosine receptor subtype 3, a member of the G-protein-coupled receptor family. Comparison of deduced env amino acids of six JSRV strains from three continents identified 15 residues that defined two distinct genotypes of JSRVs. Sequence analysis identified two highly variable regions between JSRV and ESRV in the transmembrane domain of env (TM) and the 3' unique sequence (U3) of the long terminal repeat, from which JSRV-specific DNA probes were derived. By using these DNA probes in Southern hybridization, for the first time we successfully identified JSRV proviral sequences in tumor genomic DNA in the presence of multiple ESRV loci, validating the use of exogenous virus-specific DNA probes in the analysis of oncogenic proviral integration sites and identification of integrated exogenous proviral sequences. PMID:10366570

  3. SEQOPTICS: a protein sequence clustering system

    PubMed Central

    Chen, Yonghui; Reilly, Kevin D; Sprague, Alan P; Guan, Zhijie

    2006-01-01

    Background Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. In most cases single linkage or graph-based clustering algorithms have been applied. OPTICS (Ordering Points To Identify the Clustering Structure) is an attractive approach due to its emphasis on visualization of results and support for interactive work, e.g., in choosing parameters. However, OPTICS has not been used, as far as we know, for protein sequence clustering. Results In this paper, a system of clustering proteins, SEQOPTICS (SEQuence clustering with OPTICS) is demonstrated. The system is implemented with Smith-Waterman as protein distance measurement and OPTICS at its core to perform protein sequence clustering. SEQOPTICS is tested with four data sets from different data sources. Visualization of the sequence clustering structure is demonstrated as well. Conclusion The system was evaluated by comparison with other existing methods. Analysis of the results demonstrates that SEQOPTICS performs better based on some evaluation criteria including Jaccard coefficient, Precision, and Recall. It is a promising protein sequence clustering method with future possible improvement on parallel computing and other protein distance measurements. PMID:17217502

  4. Comparisons of Ribosomal Protein Gene Promoters Indicate Superiority of Heterologous Regulatory Sequences for Expressing Transgenes in Phytophthora infestans

    PubMed Central

    Khachatoorian, Careen; Judelson, Howard S.

    2015-01-01

    Molecular genetics approaches in Phytophthora research can be hampered by the limited number of known constitutive promoters for expressing transgenes and the instability of transgene activity. We have therefore characterized genes encoding the cytoplasmic ribosomal proteins of Phytophthora and studied their suitability for expressing transgenes in P. infestans. Phytophthora spp. encode a standard complement of 79 cytoplasmic ribosomal proteins. Several genes are duplicated, and two appear to be pseudogenes. Half of the genes are expressed at similar levels during all stages of asexual development, and we discovered that the majority share a novel promoter motif named the PhRiboBox. This sequence is enriched in genes associated with transcription, translation, and DNA replication, including tRNA and rRNA biogenesis. Promoters from the three P. infestans genes encoding ribosomal proteins S9, L10, and L23 and their orthologs from P. capsici were tested for their ability to drive transgenes in stable transformants of P. infestans. Five of the six promoters yielded strong expression of a GUS reporter, but the stability of expression was higher using the P. capsici promoters. With the RPS9 and RPL10 promoters of P. infestans, about half of transformants stopped making GUS over two years of culture, while their P. capsici orthologs conferred stable expression. Since cross-talk between native and transgene loci may trigger gene silencing, we encourage the use of heterologous promoters in transformation studies. PMID:26716454

  5. Mining protein sequences for motifs.

    PubMed

    Narasimhan, Giri; Bu, Changsong; Gao, Yuan; Wang, Xuning; Xu, Ning; Mathee, Kalai

    2002-01-01

    We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence. PMID:12487759

  6. Mercury BLASTP: Accelerating Protein Sequence Alignment.

    PubMed

    Jacob, Arpith; Lancaster, Joseph; Buhler, Jeremy; Harris, Brandon; Chamberlain, Roger D

    2008-06-01

    Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results. PMID:19492068

  7. Mercury BLASTP: Accelerating Protein Sequence Alignment

    PubMed Central

    Jacob, Arpith; Lancaster, Joseph; Buhler, Jeremy; Harris, Brandon; Chamberlain, Roger D.

    2008-01-01

    Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results. PMID:19492068

  8. Spectral clustering of protein sequences

    PubMed Central

    Paccanaro, Alberto; Casbon, James A.; Saqi, Mansoor A. S.

    2006-01-01

    An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL]. PMID:16547200

  9. Sequence correlations shape protein promiscuity

    NASA Astrophysics Data System (ADS)

    Lukatsky, David B.; Afek, Ariel; Shakhnovich, Eugene I.

    2011-08-01

    We predict analytically that diagonal correlations of amino acid positions within protein sequences statistically enhance protein propensity for nonspecific binding. We use the term "promiscuity" to describe such nonspecific binding. Diagonal correlations represent statistically significant repeats of sequence patterns where amino acids of the same type are clustered together. The predicted effect is qualitatively robust with respect to the form of the microscopic interaction potentials and the average amino acid composition. Our analytical results provide an explanation for the enhanced diagonal correlations observed in hubs of eukaryotic organismal proteomes [J. Mol. Biol. 409, 439 (2011)], 10.1016/j.jmb.2011.03.056. We suggest experiments that will allow direct testing of the predicted effect.

  10. Rosetta stone method for detecting protein function and protein-protein interactions from genome sequences

    DOEpatents

    Eisenberg, David; Marcotte, Edward M.; Pellegrini, Matteo; Thompson, Michael J.; Yeates, Todd O.

    2002-10-15

    A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.

  11. Distinguishing Proteins From Arbitrary Amino Acid Sequences

    PubMed Central

    Yau, Stephen S.-T.; Mao, Wei-Guang; Benson, Max; He, Rong Lucy

    2015-01-01

    What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe. PMID:25609314

  12. A simple method for global sequence comparison.

    PubMed Central

    Pizzi, E; Attimonelli, M; Liuni, S; Frontali, C; Saccone, C

    1992-01-01

    A simple method of sequence comparison, based on a correlation analysis of oligonucleotide frequency distributions, is here shown to be a reliable test of overall sequence similarity. The method does not involve sequence alignment procedures and permits the rapid screening of large amounts of sequence data. It identifies those sequences which deserve more careful analysis of sequence similarity at the level of resolution of the single nucleotide. It uses observed quantities only and does not involve the adoption of any theoretical model. PMID:1738591

  13. The PIR-International Protein Sequence Database.

    PubMed Central

    George, D G; Barker, W C; Mewes, H W; Pfeiffer, F; Tsugita, A

    1994-01-01

    PIR-International is an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. A major objective of PIR-International is to continue the development of the Protein Sequence Database as an essential public resource for protein sequence information. This paper briefly describes the architecture of the Protein Sequence Database and how it and associated data sets are distributed and can be accessed electronically. PMID:7937060

  14. Method and apparatus for biological sequence comparison

    DOEpatents

    Marr, T.G.; Chang, W.I.

    1997-12-23

    A method and apparatus are disclosed for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence. 5 figs.

  15. Method and apparatus for biological sequence comparison

    DOEpatents

    Marr, Thomas G.; Chang, William I-Wei

    1997-01-01

    A method and apparatus for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence.

  16. Natural protein sequences are more intrinsically disordered than random sequences.

    PubMed

    Yu, Jia-Feng; Cao, Zanxia; Yang, Yuedong; Wang, Chun-Ling; Su, Zhen-Dong; Zhao, Ya-Wei; Wang, Ji-Hua; Zhou, Yaoqi

    2016-08-01

    Most natural protein sequences have resulted from millions or even billions of years of evolution. How they differ from random sequences is not fully understood. Previous computational and experimental studies of random proteins generated from noncoding regions yielded inclusive results due to species-dependent codon biases and GC contents. Here, we approach this problem by investigating 10,000 sequences randomized at the amino acid level. Using well-established predictors for protein intrinsic disorder, we found that natural sequences have more long disordered regions than random sequences, even when random and natural sequences have the same overall composition of amino acid residues. We also showed that random sequences are as structured as natural sequences according to contents and length distributions of predicted secondary structure, although the structures from random sequences may be in a molten globular-like state, according to molecular dynamics simulations. The bias of natural sequences toward more intrinsic disorder suggests that natural sequences are created and evolved to avoid protein aggregation and increase functional diversity. PMID:26801222

  17. A new graphical representation of protein sequences and its applications

    NASA Astrophysics Data System (ADS)

    Hou, Wenbing; Pan, Qiuhui; He, Mingfeng

    2016-02-01

    Sequence analysis is one of the foundations in bioinformatics for the abundant information hidden in the sequences. It is helpful for scientists' study on the function of DNA, proteins and cells. In this paper, we outline a novel method for protein sequences similarity analysis based on the physical-chemical properties of amino acids. We consider the protein sequence as a rigid-body with mass. Then we introduce the moment of inertia to the calculation of similarity of sequences and the sequences are transformed into vectors by the tensor for moment of inertia. The Euclidean distance is employed as a measurement of the similarities. At last, the comparison with other references' results shows our approach is reasonable and effective.

  18. Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences

    PubMed Central

    2009-01-01

    Background Protein-protein interactions underlie many important biological processes. Computational prediction methods can nicely complement experimental approaches for identifying protein-protein interactions. Recently, a unique category of sequence-based prediction methods has been put forward - unique in the sense that it does not require homologous protein sequences. This enables it to be universally applicable to all protein sequences unlike many of previous sequence-based prediction methods. If effective as claimed, these new sequence-based, universally applicable prediction methods would have far-reaching utilities in many areas of biology research. Results Upon close survey, I realized that many of these new methods were ill-tested. In addition, newer methods were often published without performance comparison with previous ones. Thus, it is not clear how good they are and whether there are significant performance differences among them. In this study, I have implemented and thoroughly tested 4 different methods on large-scale, non-redundant data sets. It reveals several important points. First, significant performance differences are noted among different methods. Second, data sets typically used for training prediction methods appear significantly biased, limiting the general applicability of prediction methods trained with them. Third, there is still ample room for further developments. In addition, my analysis illustrates the importance of complementary performance measures coupled with right-sized data sets for meaningful benchmark tests. Conclusions The current study reveals the potentials and limits of the new category of sequence-based protein-protein interaction prediction methods, which in turn provides a firm ground for future endeavours in this important area of contemporary bioinformatics. PMID:20003442

  19. Detecting frame shifts by amino acid sequence comparison.

    PubMed

    Claverie, J M

    1993-12-20

    Various amino acid substitution scoring matrices are used in conjunction with local alignments programs to detect regions of similarity and infer potential common ancestry between proteins. The usual scoring schemes derive from the implicit hypothesis that related proteins evolve from a common ancestor by the accumulation of point mutations and that amino acids tend to be progressively substituted by others with similar properties. However, other frequent single mutation events, like nucleotide insertion or deletion and gene inversion, change the translation reading frame and cause previously encoded amino acid sequences to become unrecognizable at once. Here, I derive five new types of scoring matrix, each capable of detecting a specific frame shift (deletion, insertion and inversion in 3 frames) and use them with a regular local alignments program to detect amino acid sequences that may have derived from alternative reading frames of the same nucleotide sequence. Frame shifts are inferred from the sole comparison of the protein sequences. The five scoring matrices were used with the BLASTP program to compare all the protein sequences in the Swissprot database. Surprisingly, the searches revealed hundreds of highly significant frame shift matches, of which many are likely to represent sequencing errors. Others provide some evidence that frame shift mutations might be used in protein evolution as a way to create new amino acid sequences from pre-existing coding regions. PMID:7903399

  20. Effects of hepatitis C virus on suppressor of cytokine signaling mRNA levels: comparison between different genotypes and core protein sequence analysis.

    PubMed

    Pascarella, Stéphanie; Clément, Sophie; Guilloux, Kévin; Conzelmann, Stéphanie; Penin, François; Negro, Francesco

    2011-06-01

    Glucose metabolism disturbances, including insulin resistance and type 2 diabetes, are frequent and important cofactors of hepatitis C. Increasing epidemiological and experimental data suggest that all major genotypes of hepatitis C virus (HCV), albeit to a different extent, cause insulin resistance. The HCV core protein has been shown to be sufficient to impair insulin signaling in vitro through several post-receptorial mechanisms, mostly via the activation of suppressor of cytokine signaling (SOCS) family members and the consequent decrease of insulin receptor substrate-1 (IRS-1). The levels of IRS-1 and SOCS were investigated upon expression of the core protein of HCV genotypes 1-4. Furthermore, the core protein sequences were analyzed to identify the amino acid residues responsible for IRS-1 decrease, with particular regard to SOCS mRNA deregulation. The results suggest that the activation of SOCS family members is a general mechanism associated with the common HCV genotypes. A rare genotype 1b variant, however, failed to activate any of the SOCS tested: this allowed to analyze in detail the distinct amino acid sequences responsible for SOCS deregulation. By combining approaches using intergenotypic chimeras and site-directed mutagenesis, genetic evidence was provided in favor of a role of amino acids 49 and 131 of the HCV core-encoding sequence in mediating SOCS transactivation. PMID:21503913

  1. Turning yeast sequence into protein function

    SciTech Connect

    Heijne, G. von

    1996-04-01

    The complete genome sequencing of the yeast Saccharomyces Cerevisiae leads us into a new era of potential use for such data base information. Protein engineering studies suggest that genetic selection of overproducing strains may aid the assignment of protein function. Data base management and sequencing software have been developed to scan entire genomes.

  2. Recently published protein sequences. I.

    NASA Technical Reports Server (NTRS)

    Jukes, T. H.; Holmquist, R.

    1972-01-01

    Some polypeptide sequences that have been published in the 1972 scientific literature are listed. Only selected sequences are included. The compilation has two objectives. Current information between periods when more comprehensive compilations are published is to be assembled and the use of data that do not include arrangements of unsequenced peptides for 'maximum homology' is to be encouraged.

  3. Fold homology detection using sequence fragment composition profiles of proteins.

    PubMed

    Solis, Armando D; Rackovsky, Shalom R

    2010-10-01

    The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so-called "twilight zone" problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment-free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20-letter amino acid alphabet) into a more tractable number of reduced tetramers (approximately 15-30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver-operating characteristic measure, we demonstrate potentially significant improvement in using information-optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the "twilight zone". PMID:20635424

  4. Protein Sequencing with Tandem Mass Spectrometry

    NASA Astrophysics Data System (ADS)

    Ziady, Assem G.; Kinter, Michael

    The recent introduction of electrospray ionization techniques that are suitable for peptides and whole proteins has allowed for the design of mass spectrometric protocols that provide accurate sequence information for proteins. The advantages gained by these approaches over traditional Edman Degradation sequencing include faster analysis and femtomole, sometimes attomole, sensitivity. The ability to efficiently identify proteins has allowed investigators to conduct studies on their differential expression or modification in response to various treatments or disease states. In this chapter, we discuss the use of electrospray tandem mass spectrometry, a technique whereby protein-derived peptides are subjected to fragmentation in the gas phase, revealing sequence information for the protein. This powerful technique has been instrumental for the study of proteins and markers associated with various disorders, including heart disease, cancer, and cystic fibrosis. We use the study of protein expression in cystic fibrosis as an example.

  5. Adaptive seeds tame genomic sequence comparison.

    PubMed

    Kiełbasa, Szymon M; Wan, Raymond; Sato, Kengo; Horton, Paul; Frith, Martin C

    2011-03-01

    The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition. PMID:21209072

  6. Vibrio cholerae O395 tcpA pilin gene sequence and comparison of predicted protein structural features to those of type 4 pilins.

    PubMed Central

    Shaw, C E; Taylor, R K

    1990-01-01

    Vibrio cholerae O1 expresses a pilus that is coordinately regulated with cholera toxin production and hence termed TCP, for toxin-coregulated pilus. Insertion of Tn5 IS50L::phoA (TnphoA) into the major pilin subunit gene, tcpA, has previously been shown to render the strain avirulent as a result of its inability to colonize. One such insertion was isolated and used as a probe to screen for clones containing the intact tcpA gene. The DNA sequence of tcpA was determined by using the intact gene and several tcpA-phoA gene fusions. The deduced protein sequence agreed completely with that previously determined for the TcpA N terminus and with the size of the mature pilin protein. The reported homology with N-methylphenylalanine (type 4) pilins near the N terminus was extended and shown to include components of the atypical leader peptide as well as overall predicted structural similarities in other regions of the pilins. In contrast to the modified N-terminal phenylalanine residue found in all characterized type 4 pilins, the corresponding position in tcpA contains a Met codon, thus implying that the previously uncharacterized amino acid corresponding to the N-terminal position of the mature TcpA pilin is a modified form of methionine. Except for this difference, mature TcpA has the overall predicted structural motifs shared among type 4 pilins. Images PMID:1974887

  7. Classification and identification of geminiviruses using sequence comparisons.

    PubMed

    Padidam, M; Beachy, R N; Fauquet, C M

    1995-02-01

    The genomes and ORFs of 36 geminiviruses were compared to obtain phylogenetic trees and frequency distributions of all possible pairwise comparisons with an objective to classify geminiviruses. Such comparisons show that geminiviruses form two distinct clusters of leafhopper-transmitted viruses that infect monocots (subgroup I) and whitefly-transmitted viruses that infect dicots (subgroup III), irrespective of the part of the genome considered. Of the two leafhopper-transmitted viruses that infect dicots, tobacco yellow dwarf virus has a sequence most similar to subgroup I viruses, and that of beet curly top virus differed depending upon the ORF considered. The distributions of identities within subgroups are significantly different suggesting that the taxonomic status of a particular isolate within a subgroup can be quantified. All the recognized strains of any one virus have greater than 90% sequence identity. It was observed that the 200 nucleotide intercistronic regions of geminiviruses are more variable than the remainder of the genome. The amino acid sequences of the coat protein (CP) of subgroup III viruses are more conserved than the remainder of the genome. However, a short N-terminal region (60-70 amino acids) of the CP is more variable than the rest of the CP sequence and is a close representation of the genome. PCR primers based on conserved sequences can be used to clone and sequence the N-terminal sequences of the CP of the geminiviruses; this sequence is sufficient to classify a virus isolate. A possible taxonomic structure for geminiviruses is proposed after considering the sequence comparisons and biological properties. PMID:7844548

  8. Sequencing proteins with transverse ionic transport

    NASA Astrophysics Data System (ADS)

    Boynton, Paul; di Ventra, Massimiliano

    2015-03-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms. By obtaining the order of the amino acids that composes a given protein one can determine both its secondary and tertiary structures through protein structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Mass spectrometry is the current technique of choice for de novo sequencing, but because some amino acids have the same mass the sequence cannot be completely determined in many cases. In this paper we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel, similar to that proposed in for DNA sequencing. Indeed, we find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing.

  9. Protein structure prediction from sequence variation

    PubMed Central

    Marks, Debora S; Hopf, Thomas A; Sander, Chris

    2015-01-01

    Genomic sequences contain rich evolutionary information about functional constraints on macromolecules such as proteins. This information can be efficiently mined to detect evolutionary couplings between residues in proteins and address the long-standing challenge to compute protein three-dimensional structures from amino acid sequences. Substantial progress has recently been made on this problem owing to the explosive growth in available sequences and the application of global statistical methods. In addition to three-dimensional structure, the improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes. We expect computation of covariation patterns to complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics. PMID:23138306

  10. PROCAIN: protein profile comparison with assisting information

    PubMed Central

    Wang, Yong; Sadreyev, Ruslan I.; Grishin, Nick V.

    2009-01-01

    Detection of remote sequence homology is essential for the accurate inference of protein structure, function and evolution. The most sensitive detection methods involve the comparison of evolutionary patterns reflected in multiple sequence alignments (MSAs) of protein families. We present PROCAIN, a new method for MSA comparison based on the combination of ‘vertical’ MSA context (substitution constraints at individual sequence positions) and ‘horizontal’ context (patterns of residue content at multiple positions). Based on a simple and tractable profile methodology and primitive measures for the similarity of horizontal MSA patterns, the method achieves the quality of homology detection comparable to a more complex advanced method employing hidden Markov models (HMMs) and secondary structure (SS) prediction. Adding SS information further improves PROCAIN performance beyond the capabilities of current state-of-the-art tools. The potential value of the method for structure/function predictions is illustrated by the detection of subtle homology between evolutionary distant yet structurally similar protein domains. ProCAIn, relevant databases and tools can be downloaded from: http://prodata.swmed.edu/procain/download. The web server can be accessed at http://prodata.swmed.edu/procain/procain.php. PMID:19357092

  11. PROCAIN: protein profile comparison with assisting information.

    PubMed

    Wang, Yong; Sadreyev, Ruslan I; Grishin, Nick V

    2009-06-01

    Detection of remote sequence homology is essential for the accurate inference of protein structure, function and evolution. The most sensitive detection methods involve the comparison of evolutionary patterns reflected in multiple sequence alignments (MSAs) of protein families. We present PROCAIN, a new method for MSA comparison based on the combination of 'vertical' MSA context (substitution constraints at individual sequence positions) and 'horizontal' context (patterns of residue content at multiple positions). Based on a simple and tractable profile methodology and primitive measures for the similarity of horizontal MSA patterns, the method achieves the quality of homology detection comparable to a more complex advanced method employing hidden Markov models (HMMs) and secondary structure (SS) prediction. Adding SS information further improves PROCAIN performance beyond the capabilities of current state-of-the-art tools. The potential value of the method for structure/function predictions is illustrated by the detection of subtle homology between evolutionary distant yet structurally similar protein domains. ProCAIn, relevant databases and tools can be downloaded from: http://prodata.swmed.edu/procain/download. The web server can be accessed at http://prodata.swmed.edu/procain/procain.php. PMID:19357092

  12. Sequence information signal processor for local and global string comparisons

    DOEpatents

    Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

    1997-01-01

    A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.

  13. The DynaMine webserver: predicting protein dynamics from sequence.

    PubMed

    Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F

    2014-07-01

    Protein dynamics are important for understanding protein function. Unfortunately, accurate protein dynamics information is difficult to obtain: here we present the DynaMine webserver, which provides predictions for the fast backbone movements of proteins directly from their amino-acid sequence. DynaMine rapidly produces a profile describing the statistical potential for such movements at residue-level resolution. The predicted values have meaning on an absolute scale and go beyond the traditional binary classification of residues as ordered or disordered, thus allowing for direct dynamics comparisons between protein regions. Through this webserver, we provide molecular biologists with an efficient and easy to use tool for predicting the dynamical characteristics of any protein of interest, even in the absence of experimental observations. The prediction results are visualized and can be directly downloaded. The DynaMine webserver, including instructive examples describing the meaning of the profiles, is available at http://dynamine.ibsquare.be. PMID:24728994

  14. Structural alphabets for protein structure classification: a comparison study.

    PubMed

    Le, Quan; Pollastri, Gianluca; Koehl, Patrice

    2009-03-27

    Finding structural similarities between proteins often helps reveal shared functionality, which otherwise might not be detected by native sequence information alone. Such similarity is usually detected and quantified by protein structure alignment. Determining the optimal alignment between two protein structures, however, remains a hard problem. An alternative approach is to approximate each three-dimensional protein structure using a sequence of motifs derived from a structural alphabet. Using this approach, structure comparison is performed by comparing the corresponding motif sequences or structural sequences. In this article, we measure the performance of such alphabets in the context of the protein structure classification problem. We consider both local and global structural sequences. Each letter of a local structural sequence corresponds to the best matching fragment to the corresponding local segment of the protein structure. The global structural sequence is designed to generate the best possible complete chain that matches the full protein structure. We use an alphabet of 20 letters, corresponding to a library of 20 motifs or protein fragments having four residues. We show that the global structural sequences approximate well the native structures of proteins, with an average coordinate root mean square of 0.69 A over 2225 test proteins. The approximation is best for all alpha-proteins, while relatively poorer for all beta-proteins. We then test the performance of four different sequence representations of proteins (their native sequence, the sequence of their secondary-structure elements, and the local and global structural sequences based on our fragment library) with different classifiers in their ability to classify proteins that belong to five distinct folds of CATH. Without surprise, the primary sequence alone performs poorly as a structure classifier. We show that addition of either secondary-structure information or local information from the

  15. HPMV: human protein mutation viewer - relating sequence mutations to protein sequence architecture and function changes.

    PubMed

    Sherman, Westley Arthur; Kuchibhatla, Durga Bhavani; Limviphuvadh, Vachiranee; Maurer-Stroh, Sebastian; Eisenhaber, Birgit; Eisenhaber, Frank

    2015-10-01

    Next-generation sequencing advances are rapidly expanding the number of human mutations to be analyzed for causative roles in genetic disorders. Our Human Protein Mutation Viewer (HPMV) is intended to explore the biomolecular mechanistic significance of non-synonymous human mutations in protein-coding genomic regions. The tool helps to assess whether protein mutations affect the occurrence of sequence-architectural features (globular domains, targeting signals, post-translational modification sites, etc.). As input, HPMV accepts protein mutations - as UniProt accessions with mutations (e.g. HGVS nomenclature), genome coordinates, or FASTA sequences. As output, HPMV provides an interactive cartoon showing the mutations in relation to elements of the sequence architecture. A large variety of protein sequence architectural features were selected for their particular relevance to mutation interpretation. Clicking a sequence feature in the cartoon expands a tree view of additional information including multiple sequence alignments of conserved domains and a simple 3D viewer mapping the mutation to known PDB structures, if available. The cartoon is also correlated with a multiple sequence alignment of similar sequences from other organisms. In cases where a mutation is likely to have a straightforward interpretation (e.g. a point mutation disrupting a well-understood targeting signal), this interpretation is suggested. The interactive cartoon can be downloaded as standalone viewer in Java jar format to be saved and viewed later with only a standard Java runtime environment. The HPMV website is: http://hpmv.bii.a-star.edu.sg/ . PMID:26503432

  16. Sequence Motifs in MADS Transcription Factors Responsible for Specificity and Diversification of Protein-Protein Interaction

    PubMed Central

    van Dijk, Aalt D. J.; Morabito, Giuseppa; Fiers, Martijn; van Ham, Roeland C. H. J.; Angenent, Gerco C.; Immink, Richard G. H.

    2010-01-01

    Protein sequences encompass tertiary structures and contain information about specific molecular interactions, which in turn determine biological functions of proteins. Knowledge about how protein sequences define interaction specificity is largely missing, in particular for paralogous protein families with high sequence similarity, such as the plant MADS domain transcription factor family. In comparison to the situation in mammalian species, this important family of transcription regulators has expanded enormously in plant species and contains over 100 members in the model plant species Arabidopsis thaliana. Here, we provide insight into the mechanisms that determine protein-protein interaction specificity for the Arabidopsis MADS domain transcription factor family, using an integrated computational and experimental approach. Plant MADS proteins have highly similar amino acid sequences, but their dimerization patterns vary substantially. Our computational analysis uncovered small sequence regions that explain observed differences in dimerization patterns with reasonable accuracy. Furthermore, we show the usefulness of the method for prediction of MADS domain transcription factor interaction networks in other plant species. Introduction of mutations in the predicted interaction motifs demonstrated that single amino acid mutations can have a large effect and lead to loss or gain of specific interactions. In addition, various performed bioinformatics analyses shed light on the way evolution has shaped MADS domain transcription factor interaction specificity. Identified protein-protein interaction motifs appeared to be strongly conserved among orthologs, indicating their evolutionary importance. We also provide evidence that mutations in these motifs can be a source for sub- or neo-functionalization. The analyses presented here take us a step forward in understanding protein-protein interactions and the interplay between protein sequences and network evolution. PMID

  17. The PIR-International Protein Sequence Database.

    PubMed

    George, D G; Barker, W C; Mewes, H W; Pfeiffer, F; Tsugita, A

    1996-01-01

    From its origin the Protein Sequence Database has been designed to support research and has focused on comprehensive coverage, quality control and organization of the data in accordance with biological principles. Since 1988 the database has been maintained collaboratively within the framework of PIR-International, an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. The database is widely distributed and is available on the World Wide Web, via ftp, email server, on CD-ROM and magnetic media. It is widely redistributed and incorporated into many other protein sequence data compilations, including SWISS-PROT and the Entrez system of the NCBI. PMID:8594572

  18. Dimeric 3-phosphoglycerate kinases from hyperthermophilic Archaea. Cloning, sequencing and expression of the 3-phosphoglycerate kinase gene of Pyrococcus woesei in Escherichia coli and characterization of the protein. Structural and functional comparison with the 3-phosphoglycerate kinase of Methanothermus fervidus.

    PubMed

    Hess, D; Krüger, K; Knappik, A; Palm, P; Hensel, R

    1995-10-01

    The gene coding for the 3-phosphoglycerate kinase (EC 2.7.2.3) of Pyrococcus woesei was cloned and sequenced. The gene sequence comprises 1230 bp coding for a polypeptide with the theoretical M(r) of 46,195. The deduced protein sequence exhibits a high similarity (46.1% and 46.6% identity) to the other known archaeal 3-phosphoglycerate kinases of Methanobacterium bryantii and Methanothermus fervidus [Fabry, S., Heppner, P., Dietmaier, W. & Hensel, R. (1990) Gene 91, 19-25]. By comparing the 3-phosphoglycerate kinase sequences of the mesophilic and the two thermophilic Archaea, trends in thermoadaptation were confirmed that could be deduced from comparisons of glyceraldehyde-3-phosphate dehydrogenase sequences from the same organisms [Zwickl, P., Fabry, S., Bogedain, C., Haas, A. & Hensel, R. (1990) J. Bacteriol. 172, 4329-4338]. With increasing temperature the average hydrophobicity and the portion of aromatic residues increases, whereas the chain flexibility as well as the content in chemically labile residues (Asn, Cys) decreases. To study the phenotypic properties of the 3-phosphoglycerate kinases from thermophilic Archaea in more detail, the 3-phosphoglycerate kinase genes from P. woesei and M. fervidus were expressed in Escherichia coli. Comparisons of kinetic and molecular properties of the enzymes from the original organisms and from E. coli indicate that the proteins expressed in the mesophilic host are folded correctly. Besides their higher thermostability according to their origin from hyperthermophilic organisms, both enzymes differ from their bacterial and eucaryotic homologues mainly in two respects. (a) The 3-phosphoglycerate kinases from P. woesei and M. fervidus are homomeric dimers in their native state contrary to all other known 3-phosphoglycerate kinases, which are monomers including the enzyme from the mesophilic Archaeum M. bryantii. (b) Monovalent cations are essential for the activity of both archaeal enzymes with K+ being significantly more

  19. Predicting protein-protein interactions based only on sequences information.

    PubMed

    Shen, Juwen; Zhang, Jian; Luo, Xiaomin; Zhu, Weiliang; Yu, Kunqian; Chen, Kaixian; Li, Yixue; Jiang, Hualiang

    2007-03-13

    Protein-protein interactions (PPIs) are central to most biological processes. Although efforts have been devoted to the development of methodology for predicting PPIs and protein interaction networks, the application of most existing methods is limited because they need information about protein homology or the interaction marks of the protein partners. In the present work, we propose a method for PPI prediction using only the information of protein sequences. This method was developed based on a learning algorithm-support vector machine combined with a kernel function and a conjoint triad feature for describing amino acids. More than 16,000 diverse PPI pairs were used to construct the universal model. The prediction ability of our approach is better than that of other sequence-based PPI prediction methods because it is able to predict PPI networks. Different types of PPI networks have been effectively mapped with our method, suggesting that, even with only sequence information, this method could be applied to the exploration of networks for any newly discovered protein with unknown biological relativity. In addition, such supplementary experimental information can enhance the prediction ability of the method. PMID:17360525

  20. Integrative visual analysis of protein sequence mutations

    PubMed Central

    2014-01-01

    Background An important aspect of studying the relationship between protein sequence, structure and function is the molecular characterization of the effect of protein mutations. To understand the functional impact of amino acid changes, the multiple biological properties of protein residues have to be considered together. Results Here, we present a novel visual approach for analyzing residue mutations. It combines different biological visualizations and integrates them with molecular data derived from external resources. To show various aspects of the biological information on different scales, our approach includes one-dimensional sequence views, three-dimensional protein structure views and two-dimensional views of residue interaction networks as well as aggregated views. The views are linked tightly and synchronized to reduce the cognitive load of the user when switching between them. In particular, the protein mutations are mapped onto the views together with further functional and structural information. We also assess the impact of individual amino acid changes by the detailed analysis and visualization of the involved residue interactions. We demonstrate the effectiveness of our approach and the developed software on the data provided for the BioVis 2013 data contest. Conclusions Our visual approach and software greatly facilitate the integrative and interactive analysis of protein mutations based on complementary visualizations. The different data views offered to the user are enriched with information about molecular properties of amino acid residues and further biological knowledge. PMID:25237389

  1. A novel method for similarity/dissimilarity analysis of protein sequences

    NASA Astrophysics Data System (ADS)

    Mu, Zengchao; Wu, Jing; Zhang, Yusen

    2013-12-01

    Sequence comparison is one of the major tasks in bioinformatics, which can be used to study structural and functional conservation, as well as evolutionary relations among the sequences. In this paper, we introduce the concept of distance frequency of amino acid pairs and propose a new numerical characterization of protein sequences, which converts any protein sequence into a distance frequency matrix. Using this distance frequency matrix, we can compare the similarity of protein sequences. In order to confirm the validity of our method, we test it with two experiments. The results show that our method is effective.

  2. The nucleotide sequence of the mouse immunoglobulin epsilon gene: comparison with the human epsilon gene sequence.

    PubMed Central

    Ishida, N; Ueda, S; Hayashida, H; Miyata, T; Honjo, T

    1982-01-01

    We have determined the nucleotide sequence of the immunoglobulin epsilon gene cloned from newborn mouse DNA. The epsilon gene sequence allows prediction of the amino acid sequence of the constant region of the epsilon chain and comparison of it with sequences of the human epsilon and other mouse immunoglobulin genes. The epsilon gene was shown to be under the weakest selection pressure at the protein level among the immunoglobulin genes although the divergence at the synonymous position is similar. Our results suggest that the epsilon gene may be dispensable, which is in accord with the fact that IgE has only obscure roles in the immune defense system but has an undesirable role as a mediator of hypersensitivity. The sequence data suggest that the human and murine epsilon genes were derived from different ancestors duplicated a long time ago. The amino acid sequence of the epsilon chain is more homologous to those of the gamma chains than the other mouse heavy chains. Two membrane exons, separated by an 80-base intron, were identified 1.7 kb 3' to the CH4 domain of the epsilon gene and shown to conserve a hydrophobic portion similar to those of other heavy chain genes. RNA blot hybridization showed that the epsilon membrane exons are transcribed into two species of mRNA in an IgE hybridoma. Images Fig. 4. PMID:6329728

  3. Sequence analysis of the AAA protein family.

    PubMed Central

    Beyer, A.

    1997-01-01

    The AAA protein family, a recently recognized group of Walker-type ATPases, has been subjected to an extensive sequence analysis. Multiple sequence alignments revealed the existence of a region of sequence similarity, the so-called AAA cassette. The borders of this cassette were localized and within it, three boxes of a high degree of conservation were identified. Two of these boxes could be assigned to substantial parts of the ATP binding site (namely, to Walker motifs A and B); the third may be a portion of the catalytic center. Phylogenetic trees were calculated to obtain insights into the evolutionary history of the family. Subfamilies with varying degrees of intra-relatedness could be discriminated; these relationships are also supported by analysis of sequences outside the canonical AAA boxes: within the cassette are regions that are strongly conserved within each subfamily, whereas little or even no similarity between different subfamilies can be observed. These regions are well suited to define fingerprints for subfamilies. A secondary structure prediction utilizing all available sequence information was performed and the result was fitted to the general 3D structure of a Walker A/GTPase. The agreement was unexpectedly high and strongly supports the conclusion that the AAA family belongs to the Walker superfamily of A/GTPases. PMID:9336829

  4. Benchmarking NMR experiments: A relational database of protein pulse sequences

    NASA Astrophysics Data System (ADS)

    Senthamarai, Russell R. P.; Kuprov, Ilya; Pervushin, Konstantin

    2010-03-01

    Systematic benchmarking of multi-dimensional protein NMR experiments is a critical prerequisite for optimal allocation of NMR resources for structural analysis of challenging proteins, e.g. large proteins with limited solubility or proteins prone to aggregation. We propose a set of benchmarking parameters for essential protein NMR experiments organized into a lightweight (single XML file) relational database (RDB), which includes all the necessary auxiliaries (waveforms, decoupling sequences, calibration tables, setup algorithms and an RDB management system). The database is interfaced to the Spinach library ( http://spindynamics.org), which enables accurate simulation and benchmarking of NMR experiments on large spin systems. A key feature is the ability to use a single user-specified spin system to simulate the majority of deposited solution state NMR experiments, thus providing the (hitherto unavailable) unified framework for pulse sequence evaluation. This development enables predicting relative sensitivity of deposited implementations of NMR experiments, thus providing a basis for comparison, optimization and, eventually, automation of NMR analysis. The benchmarking is demonstrated with two proteins, of 170 amino acids I domain of αXβ2 Integrin and 440 amino acids NS3 helicase.

  5. Sequence Analysis of Scaffolding Protein CipC and ORFXp, a New Cohesin-Containing Protein in Clostridium cellulolyticum: Comparison of Various Cohesin Domains and Subcellular Localization of ORFXp

    PubMed Central

    Pagès, Sandrine; Bélaïch, Anne; Fierobe, Henri-Pierre; Tardif, Chantal; Gaudin, Christian; Bélaïch, Jean-Pierre

    1999-01-01

    The gene encoding the scaffolding protein of the cellulosome from Clostridium cellulolyticum, whose partial sequence was published earlier (S. Pagès, A. Bélaïch, C. Tardif, C. Reverbel-Leroy, C. Gaudin, and J.-P. Bélaïch, J. Bacteriol. 178:2279–2286, 1996; C. Reverbel-Leroy, A. Bélaïch, A. Bernadac, C. Gaudin, J. P. Bélaïch, and C. Tardif, Microbiology 142:1013–1023, 1996), was completely sequenced. The corresponding protein, CipC, is composed of a cellulose binding domain at the N terminus followed by one hydrophilic domain (HD1), seven highly homologous cohesin domains (cohesin domains 1 to 7), a second hydrophilic domain, and a final cohesin domain (cohesin domain 8) which is only 57 to 60% identical to the seven other cohesin domains. In addition, a second gene located 8.89 kb downstream of cipC was found to encode a three-domain protein, called ORFXp, which includes a cohesin domain. By using antiserum raised against the latter, it was observed that ORFXp is associated with the membrane of C. cellulolyticum and is not detected in the cellulosome fraction. Western blot and BIAcore experiments indicate that cohesin domains 1 and 8 from CipC recognize the same dockerins and have similar affinity for CelA (Ka = 4.8 × 109 M−1) whereas the cohesin from ORFXp, although it is also able to bind all cellulosome components containing a dockerin, has a 19-fold lower Ka for CelA (2.6 × 108 M−1). Taken together, these data suggest that ORFXp may play a role in cellulosome assembly. PMID:10074072

  6. The Lassa fever virus L gene: nucleotide sequence, comparison, and precipitation of a predicted 250 kDa protein with monospecific antiserum

    PubMed Central

    Lukashevich, Igor S.; Djavani, Mahmoud; Shapiro, Keli; Sanchez, Anthony; Ravkov, Eugene; Nichol, Stuart T.; Salvato, Maria S.

    2008-01-01

    The large (L) RNA segment of Lassa fever virus (LAS) encodes a putative RNA-dependent RNA polymerase (RdRp or L protein). Similar to other arenaviruses, the LAS L protein is encoded on the genome-complementary strand and is predicted to be 2218 amino acids in length (253 kDa). It has an unusually large non-coding region adjacent to its translation start site. The LAS L protein contains six motifs of conserved amino acids that have been found among arenavirus L proteins and core RdRp of other segmented negative-stranded (SNS) viruses (Arena-, Bunya- and Orthomyxoviridae). Phylogenetic analyses of the RdRp of 20 SNS viruses reveals that arenavirus L proteins represent a distinct cluster divided into LAS–lymphocytic choriomeningitis and Tacaribe–Pichinde virus lineages. Monospecific serum against a synthetic peptide corresponding to the most conserved central domain precipitates a 250 kDa product from LAS and lymphocytic choriomeningitis virus-infected cells. PMID:9049403

  7. Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.

    PubMed

    Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai

    2015-12-01

    The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences. PMID:26325081

  8. Integrated protein function prediction by mining function associations, sequences, and protein–protein and gene–gene interaction networks

    PubMed Central

    Cao, Renzhi; Cheng, Jianlin

    2016-01-01

    Motivations Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein–protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene–gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. Results In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein–protein interaction and spatial gene–gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein–protein interaction and spatial gene–gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile–sequence comparison, profile–profile comparison, and domain co-occurrence networks according to the maximum F-measure. PMID:26370280

  9. Diverse nucleotide compositions and sequence fluctuation in Rubisco protein genes

    NASA Astrophysics Data System (ADS)

    Holden, Todd; Dehipawala, S.; Cheung, E.; Bienaime, R.; Ye, J.; Tremberger, G., Jr.; Schneider, P.; Lieberman, D.; Cheung, T.

    2011-10-01

    The Rubisco protein-enzyme is arguably the most abundance protein on Earth. The biology dogma of transcription and translation necessitates the study of the Rubisco genes and Rubisco-like genes in various species. Stronger correlation of fractal dimension of the atomic number fluctuation along a DNA sequence with Shannon entropy has been observed in the studied Rubisco-like gene sequences, suggesting a more diverse evolutionary pressure and constraints in the Rubisco sequences. The strategy of using metal for structural stabilization appears to be an ancient mechanism, with data from the porphobilinogen deaminase gene in Capsaspora owczarzaki and Monosiga brevicollis. Using the chi-square distance probability, our analysis supports the conjecture that the more ancient Rubisco-like sequence in Microcystis aeruginosa would have experienced very different evolutionary pressure and bio-chemical constraint as compared to Bordetella bronchiseptica, the two microbes occupying either end of the correlation graph. Our exploratory study would indicate that high fractal dimension Rubisco sequence would support high carbon dioxide rate via the Michaelis- Menten coefficient; with implication for the control of the whooping cough pathogen Bordetella bronchiseptica, a microbe containing a high fractal dimension Rubisco-like sequence (2.07). Using the internal comparison of chi-square distance probability for 16S rRNA (~ E-22) versus radiation repair Rec-A gene (~ E-05) in high GC content Deinococcus radiodurans, our analysis supports the conjecture that high GC content microbes containing Rubisco-like sequence are likely to include an extra-terrestrial origin, relative to Deinococcus radiodurans. Similar photosynthesis process that could utilize host star radiation would not compete with radiation resistant process from the biology dogma perspective in environments such as Mars and exoplanets.

  10. PROCAIN server for remote protein sequence similarity search

    PubMed Central

    Wang, Yong; Sadreyev, Ruslan I.; Grishin, Nick V.

    2009-01-01

    Sensitive and accurate detection of distant protein homology is essential for the studies of protein structure, function and evolution. We recently developed PROCAIN, a method that is based on sequence profile comparison and involves the analysis of four signals—similarities of residue content at the profile positions combined with three types of assisting information: sequence motifs, residue conservation and predicted secondary structure. Here we present the PROCAIN web server that allows the user to submit a query sequence or multiple sequence alignment and perform the search in a profile database of choice. The output is structured similar to that of BLAST, with the list of detected homologs sorted by E-value and followed by profile–profile alignments. The front page allows the user to adjust multiple options of input processing and output formatting, as well as search settings, including the relative weights assigned to the three types of assisting information. Availability: http://prodata.swmed.edu/procain/ Contact: grishin@chop.swmed.edu PMID:19497935

  11. Complete VAX/VMS DNA/protein sequence analysis system

    SciTech Connect

    Smith, D.W.

    1987-05-01

    A complete yet flexible system of programs and database libraries for analysis of DNA, RNA and protein sequences is implemented for VAX/VMS computers. Types of analysis include 1) construction and analysis of chimeric sequences (cloning in the VAX), 2) multiple analysis of one or more single sequences, 3) search and comparison studies using sequence libraries, and 4) direct input and analysis of experimental data. Published groups of programs, including the Staden, Los Alamos, Zuker, Pearson, and PHYLIP programs, are used. GenBank and EMBL DNA libraries and PIR and Doolittle NEWAT protein libraries are available, with associated programs. The system is tutorial, with online documentation for relevent VAX software, the programs, and the databases. The complete documentation is flexibly maintained on reserve via computer printout placed in 3-ring binders. Command files are used extensively; porting of the entire system to another VAX/VMS system requires modification of a single command. Users of the system are members of a VAX group, with automatic implementation of the system upon login. The present system occupies about 140,000 blocks, and is easily expanded, or contracted, as desired. The UCSD system is used extensively for both teaching and research purposes. Use of microcomputers emulating Tektronix 4014 graphics terminals permits saving of graphics output to disk for subsequent modification to generate high quality publishable figures.

  12. Comparison of non-sequential sets of protein residues.

    PubMed

    Garma, Leonardo D; Juffer, André H

    2016-04-01

    A methodology for performing sequence-free comparison of functional sites in protein structures is introduced. The method is based on a new notion of similarity among superimposed groups of amino acid residues that evaluates both geometry and physico-chemical properties. The method is specifically designed to handle disconnected and sparsely distributed sets of residues. A genetic algorithm is employed to find the superimposition of protein segments that maximizes their similarity. The method was evaluated by performing an all-to-all comparison on two separate sets of ligand-binding sites, comprising 47 protein-FAD (Flavin-Adenine Dinucleotide) and 64 protein-NAD (Nicotinamide-Adenine Dinucleotide) complexes, and comparing the results with those of an existing sequence-based structural alignment tool (TM-Align). The quality of the two methodologies is judged by the methods' capacity to, among other, correctly predict the similarities in the protein-ligand contact patterns of each pair of binding sites. The results show that using a sequence-free method significantly improves over the sequence-based one, resulting in 23 significant binding-site homologies being detected by the new method but ignored by the sequence-based one. PMID:26773655

  13. Integrated visual analysis of protein structures, sequences, and feature data

    PubMed Central

    2015-01-01

    Background To understand the molecular mechanisms that give rise to a protein's function, biologists often need to (i) find and access all related atomic-resolution 3D structures, and (ii) map sequence-based features (e.g., domains, single-nucleotide polymorphisms, post-translational modifications) onto these structures. Results To streamline these processes we recently developed Aquaria, a resource offering unprecedented access to protein structure information based on an all-against-all comparison of SwissProt and PDB sequences. In this work, we provide a requirements analysis for several frequently occuring tasks in molecular biology and describe how design choices in Aquaria meet these requirements. Finally, we show how the interface can be used to explore features of a protein and gain biologically meaningful insights in two case studies conducted by domain experts. Conclusions The user interface design of Aquaria enables biologists to gain unprecedented access to molecular structures and simplifies the generation of insight. The tasks involved in mapping sequence features onto structures can be conducted easier and faster using Aquaria. PMID:26329268

  14. DNA Sequencing Using an Engineered Protein Nanopore

    NASA Astrophysics Data System (ADS)

    Gundlach, Jens H.

    2010-03-01

    Inexpensive and fast sequencing of DNA is of paramount importance to medicine, the life sciences and to many other applications. Because of the nanometer diameter of DNA a nanometer-scale reader directly interfaced to macroscopic observables seems particularly attractive. We are working on a new single molecule technique based on a biological pore embedded in a lipid bilayer. When a voltage is applied across the bilayer an ion current is measured that flows through the nanometer opening of the pore. Poly-negatively charged single stranded DNA passes through the pore and reduces the ion current with the remaining ion current being indicative of the nucleotide type in the constriction of the pore. The protein pore that we introduced to the field, MspA, has a shape ideally suited to nanopore sequencing, has robustness comparable to solid state devices, is easily reproduced with sub-nanometer level precision and is engineerable using genetic mutations. I will present proof-of-principle data showing that this technique can lead to a direct very inexpensive and fast sequencing technology. The experimental electronic signatures of the DNA translocation process provide an ideal test bed for molecular dynamics simulations, which in turn allows developing intuition and prediction of nanoscale dynamics.

  15. Comparison of Next-Generation Sequencing Systems

    PubMed Central

    Liu, Lin; Li, Yinhu; Li, Siliang; Hu, Ni; He, Yimin; Pong, Ray; Lin, Danni; Lu, Lihua; Law, Maggie

    2012-01-01

    With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world's biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized. PMID:22829749

  16. RCARE: RNA Sequence Comparison and Annotation for RNA Editing

    PubMed Central

    2015-01-01

    The post-transcriptional sequence modification of transcripts through RNA editing is an important mechanism for regulating protein function and is associated with human disease phenotypes. The identification of RNA editing or RNA-DNA difference (RDD) sites is a fundamental step in the study of RNA editing. However, a substantial number of false-positive RDD sites have been identified recently. A major challenge in identifying RDD sites is to distinguish between the true RNA editing sites and the false positives. Furthermore, determining the location of condition-specific RDD sites and elucidating their functional roles will help toward understanding various biological phenomena that are mediated by RNA editing. The present study developed RNA-sequence comparison and annotation for RNA editing (RCARE) for searching, annotating, and visualizing RDD sites using thousands of previously known editing sites, which can be used for comparative analyses between multiple samples. RCARE also provides evidence for improving the reliability of identified RDD sites. RCARE is a web-based comparison, annotation, and visualization tool, which provides rich biological annotations and useful summary plots. The developers of previous tools that identify or annotate RNA-editing sites seldom mention the reliability of their respective tools. In order to address the issue, RCARE utilizes a number of scientific publications and databases to find specific documentations respective to a particular RNA-editing site, which generates evidence levels to convey the reliability of RCARE. Sequence-based alignment files can be converted into VCF files using a Python script and uploaded to the RCARE server for further analysis. RCARE is available for free at http://www.snubi.org/software/rcare/. PMID:26043858

  17. Similarity/Dissimilarity Analysis of Protein Sequences Based on a New Spectrum-Like Graphical Representation

    PubMed Central

    Yao, Yuhua; Yan, Shoujiang; Xu, Huimin; Han, Jianning; Nan, Xuying; He, Ping-an; Dai, Qi

    2014-01-01

    Sequence comparison is one of the foundations in bioinformatics, which can be used to study evolutionary relations among the sequences. In this study, a 2D spectrum-like graphical representation of protein sequences is presented based on the hydrophobicity scale of amino acids. The frequencies of amplitudes of 4-subsequences are adopted to characterize a spectrum-like graph, and a 17D vector is used as the descriptor of protein sequence. The χ2 value of compatibility test is performed. New similarity analysis approach is illustrated on the all protein sequences, which are encoded by the mitochondrion genome of 20 different species. Finally, comparison with the ClustalW method shows the utility of our method. PMID:25002811

  18. Partner-Aware Prediction of Interacting Residues in Protein-Protein Complexes from Sequence Data

    PubMed Central

    Ahmad, Shandar; Mizuguchi, Kenji

    2011-01-01

    Computational prediction of residues that participate in protein-protein interactions is a difficult task, and state of the art methods have shown only limited success in this arena. One possible problem with these methods is that they try to predict interacting residues without incorporating information about the partner protein, although it is unclear how much partner information could enhance prediction performance. To address this issue, the two following comparisons are of crucial significance: (a) comparison between the predictability of inter-protein residue pairs, i.e., predicting exactly which residue pairs interact with each other given two protein sequences; this can be achieved by either combining conventional single-protein predictions or making predictions using a new model trained directly on the residue pairs, and the performance of these two approaches may be compared: (b) comparison between the predictability of the interacting residues in a single protein (irrespective of the partner residue or protein) from conventional methods and predictions converted from the pair-wise trained model. Using these two streams of training and validation procedures and employing similar two-stage neural networks, we showed that the models trained on pair-wise contacts outperformed the partner-unaware models in predicting both interacting pairs and interacting single-protein residues. Prediction performance decreased with the size of the conformational change upon complex formation; this trend is similar to docking, even though no structural information was used in our prediction. An example application that predicts two partner-specific interfaces of a protein was shown to be effective, highlighting the potential of the proposed approach. Finally, a preliminary attempt was made to score docking decoy poses using prediction of interacting residue pairs; this analysis produced an encouraging result. PMID:22194998

  19. Size dependent complexity of sequences in protein families

    NASA Astrophysics Data System (ADS)

    Li, J.; Wang, J.; Wang, W.

    2005-10-01

    The size dependent complexity of protein sequences in various families in the FSSP database is characterized by sequence entropy, sequence similarity and sequence identity. As the average length Lf of sequences in the family increases, an increasing trend of the sequence entropy and a decreasing trend of the sequence similarity and sequence identity are found. As Lf increases beyond 250, a saturation of the sequence entropy, the sequence similarity and the sequence identity is observed. Such a saturated behavior of complexity is attributed to the saturation of the probability Pg of global (long-range) interactions in protein structures when Lf >250. It is also found that the alphabet size of residue types describing the sequence diversity depends on the value of Lf, and becomes saturated at 12.

  20. Rapid automatic detection and alignment of repeats in protein sequences.

    PubMed

    Heger, A; Holm, L

    2000-11-01

    Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self-alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam-A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information. PMID:10966575

  1. Sequence comparison via polar coordinates representation and curve tree.

    PubMed

    Dai, Qi; Guo, Xiaodong; Li, Lihua

    2012-01-01

    Sequence comparison has become one of the essential bioinformatics tools in bioinformatics research, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Existing graphical representation methods have achieved promising results in sequence comparison, but there are some design challenges with the graphical representations and feature-based measures. We reported here a new method for sequence comparison. It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. The curve tree was then constructed to numerically characterize the closed curve of biological sequences, and further compared biological sequences by evaluating the distance of the curve tree of the query sequence matching against a corresponding curve tree of the template sequence. The proposed method was tested by phylogenetic analysis, and its performance was further compared with alignment-based methods. The results demonstrate that using polar coordinates representation and curve tree to compare sequences is more efficient. PMID:22001081

  2. Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions.

    PubMed

    Mai, Te-Lun; Hu, Geng-Ming; Chen, Chi-Ming

    2016-07-01

    Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences. PMID:27267620

  3. PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context

    PubMed Central

    Zhou, Jiyun; Xu, Ruifeng; He, Yulan; Lu, Qin; Wang, Hongpeng; Kong, Bing

    2016-01-01

    Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community. PMID:27282833

  4. PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context.

    PubMed

    Zhou, Jiyun; Xu, Ruifeng; He, Yulan; Lu, Qin; Wang, Hongpeng; Kong, Bing

    2016-01-01

    Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community. PMID:27282833

  5. The influence of protein coding sequences on protein folding rates of all-β proteins.

    PubMed

    Li, Rui Fang; Li, Hong

    2011-06-01

    It is currently believed that the protein folding rate is related to the protein structures and its amino acid sequence. However, few studies have been done on the problem that whether the protein folding rate is influenced by its corresponding mRNA sequence. In this paper, we analyzed the possible relationship between the protein folding rates and the corresponding mRNA sequences. The content of guanine and cytosine (GC content) of palindromes in protein coding sequence was introduced as a new parameter and added in the Gromiha's model of predicting protein folding rates to inspect its effect in protein folding process. The multiple linear regression analysis and jack-knife test show that the new parameter is significant. The linear correlation coefficient between the experimental and the predicted values of the protein folding rates increased significantly from 0.96 to 0.99, and the population variance decreased from 0.50 to 0.24 compared with Gromiha's results. The results show that the GC content of palindromes in the corresponding protein coding sequence really influences the protein folding rate. Further analysis indicates that this kind of effect mostly comes from the synonymous codon usage and from the information of palindrome structure itself, but not from the translation information from codons to amino acids. PMID:21613670

  6. Proteins: sequence to structure and function--current status.

    PubMed

    Shenoy, Sandhya R; Jayaram, B

    2010-11-01

    In an era that has been dominated by Structural Biology for the last 30-40 years, a dramatic change of focus towards sequence analysis has spurred the advent of the genome projects and the resultant diverging sequence/structure deficit. The central challenge of Computational Structural Biology is therefore to rationalize the mass of sequence information into biochemical and biophysical knowledge and to decipher the structural, functional and evolutionary clues encoded in the language of biological sequences. In investigating the meaning of sequences, two distinct analytical themes have emerged: in the first approach, pattern recognition techniques are used to detect similarity between sequences and hence to infer related structures and functions; in the second ab initio prediction methods are used to deduce 3D structure, and ultimately to infer function, directly from the linear sequence. In this article, we attempt to provide a critical assessment of what one may and may not expect from the biological sequences and to identify major issues yet to be resolved. The presentation is organized under several subtitles like protein sequences, pattern recognition techniques, protein tertiary structure prediction, membrane protein bioinformatics, human proteome, protein-protein interactions, metabolic networks, potential drug targets based on simple sequence properties, disordered proteins, the sequence-structure relationship and chemical logic of protein sequences. PMID:20887265

  7. Recognition of Yeast Species from Gene Sequence Comparisons

    Technology Transfer Automated Retrieval System (TEKTRAN)

    This review discusses recognition of yeast species from gene sequence comparisons, which have been responsible for doubling the number of known yeasts over the past decade. The resolution provided by various single gene sequences is examined for both ascomycetous and basidiomycetous species, and th...

  8. Selection and sequence analysis of a cDNA clone encoding a known chorion protein of the A family.

    PubMed Central

    Tsitilou, S G; Regier, J C; Kafatos, F C

    1980-01-01

    Using as criteria the size, abundance and developmental specificity of hybridizing mRNA sequences, we have selected from our chorion cDNA library a clone corresponding to a specific chorion protein, A4--cl. Comparison between the clone sequence and the largely known sequence of A4--cl validates the use of the cDNA library for sequence analysis of the chorion multigene families. The two major chorion protein families, A and B, share certain structural similarities. Images PMID:7433133

  9. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGESBeta

    Leung, Elo; Huang, Amy; Cadag, Eithon; Montana, Aldrin; Soliman, Jan Lorenz; Zhou, Carol L. Ecale

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  10. The bioinformatics of nucleotide sequence coding for proteins requiring metal coenzymes and proteins embedded with metals

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Cheung, E.; Holden, T.; Sullivan, R.; Nguyen, A.; Lieberman, D.; Cheung, T.

    2015-09-01

    All metallo-proteins need post-translation metal incorporation. In fact, the isotope ratio of Fe, Cu, and Zn in physiology and oncology have emerged as an important tool. The nickel containing F430 is the prosthetic group of the enzyme methyl coenzyme M reductase which catalyzes the release of methane in the final step of methano-genesis, a prime energy metabolism candidate for life exploration space mission in the solar system. The 3.5 Gyr early life sulfite reductase as a life switch energy metabolism had Fe-Mo clusters. The nitrogenase for nitrogen fixation 3 billion years ago had Mo. The early life arsenite oxidase needed for anoxygenic photosynthesis energy metabolism 2.8 billion years ago had Mo and Fe. The selection pressure in metal incorporation inside a protein would be quantifiable in terms of the related nucleotide sequence complexity with fractal dimension and entropy values. Simulation model showed that the studied metal-required energy metabolism sequences had at least ten times more selection pressure relatively in comparison to the horizontal transferred sequences in Mealybug, guided by the outcome histogram of the correlation R-sq values. The metal energy metabolism sequence group was compared to the circadian clock KaiC sequence group using magnesium atomic level bond shifting mechanism in the protein, and the simulation model would suggest a much higher selection pressure for the energy life switch sequence group. The possibility of using Kepler 444 as an example of ancient life in Galaxy with the associated exoplanets has been proposed and is further discussed in this report. Examples of arsenic metal bonding shift probed by Synchrotron-based X-ray spectroscopy data and Zn controlled FOXP2 regulated pathways in human and chimp brain studied tissue samples are studied in relationship to the sequence bioinformatics. The analysis results suggest that relatively large metal bonding shift amount is associated with low probability correlation R

  11. Functional proteins from a random-sequence library

    PubMed Central

    Keefe, Anthony D; Szostak, Jack W.

    2015-01-01

    Functional primordial proteins presumably originated from random sequences, but it is not known how frequently functional, or even folded, proteins occur in collections of random sequences. Here we have used in vitro selection of messenger RNA displayed proteins, in which each protein is covalently linked through its carboxy terminus to the 3′ end of its encoding mRNA1, to sample a large number of distinct random sequences. Starting from a library of 6 × 1012 proteins each containing 80 contiguous random amino acids, we selected functional proteins by enriching for those that bind to ATP. This selection yielded four new ATP-binding proteins that appear to be unrelated to each other or to anything found in the current databases of biological proteins. The frequency of occurrence of functional proteins in random-sequence libraries appears to be similar to that observed for equivalent RNA libraries2,3. PMID:11287961

  12. PSSARD: protein sequence-structure analysis relational database.

    PubMed

    Guruprasad, Kunchur; Srikanth, K; Babu, A V N

    2005-09-15

    We have implemented a relational database comprising a representative dataset of amino acid sequences and their associated secondary structure. The representative amino acid sequences were selected according to the PDB_SELECT program by choosing proteins corresponding to protein crystal structure data deposited in the protein data bank that share less than 25% overall pair-wise sequence identity. The secondary structure was extracted from the protein data bank website. The information content in the database includes the protein description, PDB code, crystal structure resolution, total number of amino acid residues in the protein chain, amino acid sequence, secondary structure conformation and its summary. The database is freely accessible from the website mentioned below and is useful to query on any of the above fields. The database is particularly useful to quickly retrieve amino acid sequences that are compatible to any super-secondary structure conformation from several proteins simultaneously. PMID:16054209

  13. Folding and Stabilization of Native-Sequence-Reversed Proteins.

    PubMed

    Zhang, Yuanzhao; Weber, Jeffrey K; Zhou, Ruhong

    2016-01-01

    Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols. PMID:27113844

  14. Folding and Stabilization of Native-Sequence-Reversed Proteins

    PubMed Central

    Zhang, Yuanzhao; Weber, Jeffrey K; Zhou, Ruhong

    2016-01-01

    Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols. PMID:27113844

  15. Genomic Sequence Comparisons, 1987-2003 Final Report

    SciTech Connect

    George M. Church

    2004-07-29

    This project was to develop new DNA sequencing and RNA and protein quantitation methods and related genome annotation tools. The project began in 1987 with the development of multiplex sequencing (published in Science in 1988), and one of the first automated sequencing methods. This lead to the first commercial genome sequence in 1994 and to the establishment of the main commercial participants (GTC then Agencourt) in the public DOE/NIH genome project. In collaboration with GTC we contributed to one of the first complete DOE genome sequences, in 1997, that of Methanobacterium thermoautotropicum, a species of great relevance to energy-rich gas production.

  16. Orpinomyces cellulase celf protein and coding sequences

    DOEpatents

    Li, Xin-Liang; Chen, Huizhong; Ljungdahl, Lars G.

    2000-09-05

    A cDNA (1,520 bp), designated celF, consisting of an open reading frame (ORF) encoding a polypeptide (CelF) of 432 amino acids was isolated from a cDNA library of the anaerobic rumen fungus Orpinomyces PC-2 constructed in Escherichia coli. Analysis of the deduced amino acid sequence showed that starting from the N-terminus, CelF consists of a signal peptide, a cellulose binding domain (CBD) followed by an extremely Asn-rich linker region which separate the CBD and the catalytic domains. The latter is located at the C-terminus. The catalytic domain of CelF is highly homologous to CelA and CelC of Orpinomyces PC-2, to CelA of Neocallimastix patriciarum and also to cellobiohydrolase IIs (CBHIIs) from aerobic fungi. However, Like CelA of Neocallimastix patriciarum, CelF does not have the noncatalytic repeated peptide domain (NCRPD) found in CelA and CelC from the same organism. The recombinant protein CelF hydrolyzes cellooligosaccharides in the pattern of CBHII, yielding only cellobiose as product with cellotetraose as the substrate. The genomic celF is interrupted by a 111 bp intron, located within the region coding for the CBD. The intron of the celF has features in common with genes from aerobic filamentous fungi.

  17. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Hixson, Kim K.; Purvine, Samuel O.; Anderson, Gordon A.; Smith, Richard D.

    2008-10-15

    De novo sequencing has a promise to discover the protein post-translation modifications; however, such approach is still in their infancy and not widely applied for proteomics practices due to its limited reliability. In this work, we describe a de novo sequencing approach for discovery of protein modifications through identification of the UStags (Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry for peptides and polypeptides in a yeast lysate, and the de novo sequences obtained were filtered to define a more limited set of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags’ prefix and suffix sequences and the UStags themselves) were used to infer the possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances of yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. Random matching of the de novo sequences to the predicted sequences were examined with use of two random (false) databases, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity are described. The de novo-UStag complements the UStag method previously reported by enabling discovery of new protein modifications.

  18. MESSA: MEta-Server for protein Sequence Analysis

    PubMed Central

    2012-01-01

    Background Computational sequence analysis, that is, prediction of local sequence properties, homologs, spatial structure and function from the sequence of a protein, offers an efficient way to obtain needed information about proteins under study. Since reliable prediction is usually based on the consensus of many computer programs, meta-severs have been developed to fit such needs. Most meta-servers focus on one aspect of sequence analysis, while others incorporate more information, such as PredictProtein for local sequence feature predictions, SMART for domain architecture and sequence motif annotation, and GeneSilico for secondary and spatial structure prediction. However, as predictions of local sequence properties, three-dimensional structure and function are usually intertwined, it is beneficial to address them together. Results We developed a MEta-Server for protein Sequence Analysis (MESSA) to facilitate comprehensive protein sequence analysis and gather structural and functional predictions for a protein of interest. For an input sequence, the server exploits a number of select tools to predict local sequence properties, such as secondary structure, structurally disordered regions, coiled coils, signal peptides and transmembrane helices; detect homologous proteins and assign the query to a protein family; identify three-dimensional structure templates and generate structure models; and provide predictive statements about the protein's function, including functional annotations, Gene Ontology terms, enzyme classification and possible functionally associated proteins. We tested MESSA on the proteome of Candidatus Liberibacter asiaticus. Manual curation shows that three-dimensional structure models generated by MESSA covered around 75% of all the residues in this proteome and the function of 80% of all proteins could be predicted. Availability MESSA is free for non-commercial use at http://prodata.swmed.edu/MESSA/ PMID:23031578

  19. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM

    PubMed Central

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences. PMID:26788119

  20. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

    PubMed

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences. PMID:26788119

  1. EVEREST: automatic identification and classification of protein domains in all protein sequences

    PubMed Central

    Portugaly, Elon; Harel, Amir; Linial, Nathan; Linial, Michal

    2006-01-01

    Background Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Results Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. Conclusion The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The

  2. An algorithm to find all palindromic sequences in proteins.

    PubMed

    Prasanth, N; Vaishnavi, M Kirti; Sekar, K

    2013-03-01

    A palindrome is a set of characters that reads the same forwards and backwards. Since the discovery of palindromic peptide sequences two decades ago, little effort has been made to understand its structural, functional and evolutionary significance. Therefore, in view of this, an algorithm has been developed to identify all perfect palindromes (excluding the palindromic subset and tandem repeats) in a single protein sequence. The proposed algorithm does not impose any restriction on the number of residues to be given in the input sequence. This avant-garde algorithm will aid in the identification of palindromic peptide sequences of varying lengths in a single protein sequence. PMID:23385825

  3. The SWISS-PROT protein sequence data bank: current status.

    PubMed Central

    Bairoch, A; Boeckmann, B

    1994-01-01

    SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1988, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library. The SWISS-PROT protein sequence data bank consist of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. A sample SWISS-PROT entry is shown in Figure 1. PMID:7937062

  4. Fold Recognition Using Sequence Fingerprints of Protein Local Substructures

    SciTech Connect

    Kryshtafovych, A A; Hvidsten, T; Komorowski, J; Fidelis, K

    2003-06-04

    A protein local substructure (descriptor) is a set of several short non-overlapping fragments of the polypeptide chain. Each descriptor describes local environment of a particular residue and includes only those segments that are located in the proximity of this residue. Similar descriptors from the representative set of proteins were analyzed to reveal links between the substructures and sequences of their segments. Using detected sequence-based fingerprints specific geometrical conformations are assigned to new sequences. The ability of the approach to recognize correct SCOP folds was tested on 273 sequences from the 49 most popular folds. Good predictions were obtained in 85% of cases. No performance drop was observed with decreasing sequence similarity between target sequences and sequences from the training set of proteins.

  5. Comparison of protein structures using 3D profile alignment.

    PubMed

    Suyama, M; Matsuo, Y; Nishikawa, K

    1997-01-01

    A novel method for protein structure comparison using 3D profile alignment is presented. The 3D profile is a position-dependent scoring matrix derived from three-dimensional structures and is basically used to estimate sequence-structure compatibility for prediction of protein structure. Our idea is to compare two 3D profiles using a dynamic programming algorithm to obtain optimal alignment and a similarity score between them. When the 3D profile of hemoglobin was compared with each of the profiles in the library, which contained 325 profiles of representative structures, all the profiles of other globins were detected with relatively high scores, and proteins in the same structural class followed the globins. Exhaustive comparison of 3D profiles in the library was also performed to depict protein relatedness in the structure space. Using multidimensional scaling, a planar projection of points in the protein structure space revealed an overall grouping in terms of structural classes, i.e., all-alpha, all-beta, alpha/beta, and alpha+beta. These results differ in implication from those obtained by the conventional structure-structure comparison method. Differences are discussed with respect to the structural divergence of proteins in the course of molecular evolution. PMID:9071025

  6. Intra-species sequence comparisons for annotating genomes

    SciTech Connect

    Boffelli, Dario; Weer, Claire V.; Weng, Li; Lewis, Keith D.; Shoukry, Malak I.; Pachter, Lior; Keys, David N.; Rubin, Edward M.

    2004-07-15

    Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intra-species sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents and a set of genomic intervals amplified, resequenced and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom and raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species. The sequence data from this study has been submitted to GenBank under accession nos. AY667278-AY667407.

  7. Sequencing proteins with transverse ionic transport in nanochannels.

    PubMed

    Boynton, Paul; Di Ventra, Massimiliano

    2016-01-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing. PMID:27140520

  8. Sequencing proteins with transverse ionic transport in nanochannels

    PubMed Central

    Boynton, Paul; Di Ventra, Massimiliano

    2016-01-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer’s Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique’s potential for de novo protein sequencing. PMID:27140520

  9. Sequencing proteins with transverse ionic transport in nanochannels

    NASA Astrophysics Data System (ADS)

    Boynton, Paul; di Ventra, Massimiliano

    2016-05-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer’s Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique’s potential for de novo protein sequencing.

  10. Sequence-based protein stabilization in the absence of glycosylation.

    PubMed

    Tan, Nikki Y; Bailey, Ulla-Maja; Jamaluddin, M Fairuz; Mahmud, S Halimah Binte; Raman, Suresh C; Schulz, Benjamin L

    2014-01-01

    Asparagine-linked N-glycosylation is a common modification of proteins that promotes productive protein folding and increases protein stability. Although N-glycosylation is important for glycoprotein folding, the precise sites of glycosylation are often not conserved between protein homologues. Here we show that, in Saccharomyces cerevisiae, proteins upregulated during sporulation under nutrient deprivation have few N-glycosylation sequons and in their place tend to contain clusters of like-charged amino-acid residues. Incorporation of such sequences complements loss of in vivo protein function in the absence of glycosylation. Targeted point mutation to create such sequence stretches at glycosylation sequons in model glycoproteins increases in vitro protein stability and activity. A dependence on glycosylation for protein stability or activity can therefore be rescued with a small number of local point mutations, providing evolutionary flexibility in the precise location of N-glycans, allowing protein expression under nutrient-limiting conditions, and improving recombinant protein production. PMID:24434425

  11. Protein sequence classification with improved extreme learning machine algorithms.

    PubMed

    Cao, Jiuwen; Xiong, Lianglin

    2014-01-01

    Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876

  12. Improvement of protein structure comparison using a structural alphabet.

    PubMed

    Joseph, Agnel Praveen; Srinivasan, N; de Brevern, Alexandre G

    2011-09-01

    The three dimensional structure of a protein provides major insights into its function. Protein structure comparison has implications in functional and evolutionary studies. A structural alphabet (SA) is a library of local protein structure prototypes that can abstract every part of protein main chain conformation. Protein Blocks (PBs) is a widely used SA, composed of 16 prototypes, each representing a pentapeptide backbone conformation defined in terms of dihedral angles. Through this description, the 3D structural information can be translated into a 1D sequence of PBs. In a previous study, we have used this approach to compare protein structures encoded in terms of PBs. A classical sequence alignment procedure based on dynamic programming was used, with a dedicated PB Substitution Matrix (SM). PB-based pairwise structural alignment method gave an excellent performance, when compared to other established methods for mining. In this study, we have (i) refined the SMs and (ii) improved the Protein Block Alignment methodology (named as iPBA). The SM was normalized in regards to sequence and structural similarity. Alignment of protein structures often involves similar structural regions separated by dissimilar stretches. A dynamic programming algorithm that weighs these local similar stretches has been designed. Amino acid substitutions scores were also coupled linearly with the PB substitutions. iPBA improves (i) the mining efficiency rate by 6.8% and (ii) more than 82% of the alignments have a better quality. A higher efficiency in aligning multi-domain proteins could be also demonstrated. The quality of alignment is better than DALI and MUSTANG in 81.3% of the cases. Thus our study has resulted in an impressive improvement in the quality of protein structural alignment. PMID:21569819

  13. Describing Sequence-Ensemble Relationships for Intrinsically Disordered Proteins

    PubMed Central

    Mao, Albert H.; Lyle, Nicholas; Pappu, Rohit V.

    2014-01-01

    Synopsis Intrinsically disordered proteins participate in important protein-protein and protein-nucleic acid interactions and control cellular phenotypes through their prominence as dynamic organizers of transcriptional, post-transcriptional, and signaling networks. These proteins challenge the tenets of the structure-function paradigm and their functional mechanisms remain a mystery given that they fail to fold autonomously into specific structures. Solving this mystery requires a first principles understanding of the quantitative relationships between information encoded in the sequences of disordered proteins and the ensemble of conformations they sample. Advances in quantifying sequence-ensemble relationships have been facilitated through a four-way synergy between bioinformatics, biophysical experiments, computer simulations, and polymer physics theories. Here, we review these advances and the resultant insights that allow us to develop a concise quantitative framework for describing sequence-ensemble relationships of intrinsically disordered proteins. PMID:23240611

  14. Protein sequence design and its applications.

    PubMed

    Sandhya, Sankaran; Mudgal, Richa; Kumar, Gayatri; Sowdhamini, Ramanathan; Srinivasan, Narayanaswamy

    2016-04-01

    Design of proteins has far-reaching potentials in diverse areas that span repurposing of the protein scaffold for reactions and substrates that they were not naturally meant for, to catching a glimpse of the ephemeral proteins that nature might have sampled during evolution. These non-natural proteins, either in synthesized or virtual form have opened the scope for the design of entities that not only rival their natural counterparts but also offer a chance to visualize the protein space continuum that might help to relate proteins and understand their associations. Here, we review the recent advances in protein engineering and design, in multiple areas, with a view to drawing attention to their future potential. PMID:26773478

  15. Computationally mapping sequence space to understand evolutionary protein engineering.

    PubMed

    Armstrong, Kathryn A; Tidor, Bruce

    2008-01-01

    Evolutionary protein engineering has been dramatically successful, producing a wide variety of new proteins with altered stability, binding affinity, and enzymatic activity. However, the success of such procedures is often unreliable, and the impact of the choice of protein, engineering goal, and evolutionary procedure is not well understood. We have created a framework for understanding aspects of the protein engineering process by computationally mapping regions of feasible sequence space for three small proteins using structure-based design protocols. We then tested the ability of different evolutionary search strategies to explore these sequence spaces. The results point to a non-intuitive relationship between the error-prone PCR mutation rate and the number of rounds of replication. The evolutionary relationships among feasible sequences reveal hub-like sequences that serve as particularly fruitful starting sequences for evolutionary search. Moreover, genetic recombination procedures were examined, and tradeoffs relating sequence diversity and search efficiency were identified. This framework allows us to consider the impact of protein structure on the allowed sequence space and therefore on the challenges that each protein presents to error-prone PCR and genetic recombination procedures. PMID:18020358

  16. Dissecting the relationship between protein structure and sequence variation

    NASA Astrophysics Data System (ADS)

    Shahmoradi, Amir; Wilke, Claus; Wilke Lab Team

    2015-03-01

    Over the past decade several independent works have shown that some structural properties of proteins are capable of predicting protein evolution. The strength and significance of these structure-sequence relations, however, appear to vary widely among different proteins, with absolute correlation strengths ranging from 0 . 1 to 0 . 8 . Here we present the results from a comprehensive search for the potential biophysical and structural determinants of protein evolution by studying more than 200 structural and evolutionary properties in a dataset of 209 monomeric enzymes. We discuss the main protein characteristics responsible for the general patterns of protein evolution, and identify sequence divergence as the main determinant of the strengths of virtually all structure-evolution relationships, explaining ~ 10 - 30 % of observed variation in sequence-structure relations. In addition to sequence divergence, we identify several protein structural properties that are moderately but significantly coupled with the strength of sequence-structure relations. In particular, proteins with more homogeneous back-bone hydrogen bond energies, large fractions of helical secondary structures and low fraction of beta sheets tend to have the strongest sequence-structure relation. BEACON-NSF center for the study of evolution in action.

  17. PROMALS web server for accurate multiple protein sequence alignments.

    PubMed

    Pei, Jimin; Kim, Bong-Hyun; Tang, Ming; Grishin, Nick V

    2007-07-01

    Multiple sequence alignments are essential in homology inference, structure modeling, functional prediction and phylogenetic analysis. We developed a web server that constructs multiple protein sequence alignments using PROMALS, a progressive method that improves alignment quality by using additional homologs from PSI-BLAST searches and secondary structure predictions from PSIPRED. PROMALS shows higher alignment accuracy than other advanced methods, such as MUMMALS, ProbCons, MAFFT and SPEM. The PROMALS web server takes FASTA format protein sequences as input. The output includes a colored alignment augmented with information about sequence grouping, predicted secondary structures and positional conservation. The PROMALS web server is available at: http://prodata.swmed.edu/promals/ PMID:17452345

  18. Comparison of mitochondrial genome sequences of pangolins (Mammalia, Pholidota).

    PubMed

    Hassanin, Alexandre; Hugot, Jean-Pierre; van Vuuren, Bettine Jansen

    2015-04-01

    The complete mitochondrial genome was sequenced for three species of pangolins, Manis javanica, Phataginus tricuspis, and Smutsia temminckii, and comparisons were made with two other species, Manis pentadactyla and Phataginus tetradactyla. The genome of Manidae contains the 37 genes found in a typical mammalian genome, and the structure of the control region is highly conserved among species. In Manis, the overall base composition differs from that found in African genera. Phylogenetic analyses support the monophyly of the genera Manis, Phataginus, and Smutsia, as well as the basal division between Maninae and Smutsiinae. Comparisons with GenBank sequences reveal that the reference genomes of M. pentadactyla and P. tetradactyla (accession numbers NC_016008 and NC_004027) were sequenced from misidentified taxa, and that a new species of tree pangolin should be described in Gabon. PMID:25746396

  19. DNA Shape versus Sequence Variations in the Protein Binding Process.

    PubMed

    Chen, Chuanying; Pettitt, B Montgomery

    2016-02-01

    The binding process of a protein with a DNA involves three stages: approach, encounter, and association. It has been known that the complexation of protein and DNA involves mutual conformational changes, especially for a specific sequence association. However, it is still unclear how the conformation and the information in the DNA sequences affects the binding process. What is the extent to which the DNA structure adopted in the complex is induced by protein binding, or is instead intrinsic to the DNA sequence? In this study, we used the multiscale simulation method to explore the binding process of a protein with DNA in terms of DNA sequence, conformation, and interactions. We found that in the approach stage the protein can bind both the major and minor groove of the DNA, but uses different features to locate the binding site. The intrinsic conformational properties of the DNA play a significant role in this binding stage. By comparing the specific DNA with the nonspecific in unbound, intermediate, and associated states, we found that for a specific DNA sequence, ∼40% of the bending in the association forms is intrinsic and that ∼60% is induced by the protein. The protein does not induce appreciable bending of nonspecific DNA. In addition, we proposed that the DNA shape variations induced by protein binding are required in the early stage of the binding process, so that the protein is able to approach, encounter, and form an intermediate at the correct site on DNA. PMID:26840719

  20. Sequence variation in ligand binding sites in proteins

    PubMed Central

    Magliery, Thomas J; Regan, Lynne

    2005-01-01

    Background The recent explosion in the availability of complete genome sequences has led to the cataloging of tens of thousands of new proteins and putative proteins. Many of these proteins can be structurally or functionally categorized from sequence conservation alone. In contrast, little attention has been given to the meaning of poorly-conserved sites in families of proteins, which are typically assumed to be of little structural or functional importance. Results Recently, using statistical free energy analysis of tetratricopeptide repeat (TPR) domains, we observed that positions in contact with peptide ligands are more variable than surface positions in general. Here we show that statistical analysis of TPRs, ankyrin repeats, Cys2His2 zinc fingers and PDZ domains accurately identifies specificity-determining positions by their sequence variation. Sequence variation is measured as deviation from a neutral reference state, and we present probabilistic and information theory formalisms that improve upon recently suggested methods such as statistical free energies and sequence entropies. Conclusion Sequence variation has been used to identify functionally-important residues in four selected protein families. With TPRs and ankyrin repeats, protein families that bind highly diverse ligands, the effect is so pronounced that sequence "hypervariation" alone can be used to predict ligand binding sites. PMID:16194281

  1. Using homology relations within a database markedly boosts protein sequence similarity search.

    PubMed

    Tong, Jing; Sadreyev, Ruslan I; Pei, Jimin; Kinch, Lisa N; Grishin, Nick V

    2015-06-01

    Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence-based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit's known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre. PMID:26038555

  2. What Makes a Protein Sequence a Prion?

    PubMed Central

    Sabate, Raimon; Rousseau, Frederic; Schymkowitz, Joost; Ventura, Salvador

    2015-01-01

    Typical amyloid diseases such as Alzheimer's and Parkinson's were thought to exclusively result from de novo aggregation, but recently it was shown that amyloids formed in one cell can cross-seed aggregation in other cells, following a prion-like mechanism. Despite the large experimental effort devoted to understanding the phenomenon of prion transmissibility, it is still poorly understood how this property is encoded in the primary sequence. In many cases, prion structural conversion is driven by the presence of relatively large glutamine/asparagine (Q/N) enriched segments. Several studies suggest that it is the amino acid composition of these regions rather than their specific sequence that accounts for their priogenicity. However, our analysis indicates that it is instead the presence and potency of specific short amyloid-prone sequences that occur within intrinsically disordered Q/N-rich regions that determine their prion behaviour, modulated by the structural and compositional context. This provides a basis for the accurate identification and evaluation of prion candidate sequences in proteomes in the context of a unified framework for amyloid formation and prion propagation. PMID:25569335

  3. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins.

    PubMed

    Shen, Yufeng; Tolić, Nikola; Hixson, Kim K; Purvine, Samuel O; Anderson, Gordon A; Smith, Richard D

    2008-10-15

    De novo sequencing is a spectrum analysis approach for mass spectrometry data to discover post-translational modifications in proteins; however, such an approach is still in its infancy and is still not widely applied to proteomic practices due to its limited reliability. In this work, we describe a de novo sequencing approach for the discovery of protein modifications based on identification of the proteome UStags (Shen, Y.; Tolić, N.; Hixson, K. K.; Purvine, S. O.; Pasa-Tolić, L.; Qian, W. J.; Adkins, J. N.; Moore, R. J.; Smith, R. D. Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry data for peptides and polypeptides from a yeast lysate, and the de novo sequences obtained were selected based on filter levels designed to provide a limited yet high quality subset of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags' prefix and suffix sequences and the UStags themselves) were used to infer possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances within several yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. To determine false discovery rates, two random (false) databases were independently used for sequence matching, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity of the approach were investigated and described. The combined de novo-UStag approach complements the UStag method previously reported by enabling the discovery of new protein modifications. PMID:18783246

  4. Multifractals, encoded walks and the ergodicity of protein sequences.

    PubMed

    Dewey, T G; Strait, B J

    1996-01-01

    A variety of statistical methods have been developed to explore correlations in protein and nucleic acid sequences. Such correlations have important implications for the evolution and stability of these macromolecules. Recently, a number of fractal analyses of sequence data have been developed. These analyses have considerable appeal as they are extremely sensitive to long range correlations and to hierarchical structures. One such analysis decodes sequence information into a random walk and the statistics of the resulting random walk is investigated. Anomalous scaling of such walks has been interpreted as indicative of a fractal structure. Alternatively, a generalized box counting analysis of decoded sequences can be used to establish multifractal properties. In this work, the connection between these two seemingly disparate approaches is established. This connection is exploited to investigate correlations in protein sequences. An ensemble consisting of a comprehensive data set of representative protein sequences is analyzed to establish the ergodicity of protein sequences. The implications of this ergodicity for information theoretical approaches to protein structure prediction is explored. PMID:9390234

  5. MIPS: a database for genomes and protein sequences

    PubMed Central

    Mewes, H. W.; Frishman, D.; Gruber, C.; Geier, B.; Haase, D.; Kaps, A.; Lemcke, K.; Mannhaupt, G.; Pfeiffer, F.; Schüller, C.; Stocker, S.; Weil, B.

    2000-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried, near Munich, Germany, continues its longstanding tradition to develop and maintain high quality curated genome databases. In addition, efforts have been intensified to cover the wealth of complete genome sequences in a systematic, comprehensive form. Bioinformatics, supporting national as well as European sequencing and functional analysis projects, has resulted in several up-to-date genome-oriented databases. This report describes growing databases reflecting the progress of sequencing the Arabidopsis thaliana (MATDB) and Neurospora crassa genomes (MNCDB), the yeast genome database (MYGD) extended by functional analysis data, the database of annotated human EST-clusters (HIB) and the database of the complete cDNA sequences from the DHGP (German Human Genome Project). It also contains information on the up-to-date database of complete genomes (PEDANT), the classification of protein sequences (ProtFam) and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database. These databases can be accessed through the MIPS WWW server (http://www. mips.biochem.mpg.de ). PMID:10592176

  6. Sequence and comparative genomic analysis of actin-related proteins.

    PubMed

    Muller, Jean; Oma, Yukako; Vallar, Laurent; Friederich, Evelyne; Poch, Olivier; Winsor, Barbara

    2005-12-01

    Actin-related proteins (ARPs) are key players in cytoskeleton activities and nuclear functions. Two complexes, ARP2/3 and ARP1/11, also known as dynactin, are implicated in actin dynamics and in microtubule-based trafficking, respectively. ARP4 to ARP9 are components of many chromatin-modulating complexes. Conventional actins and ARPs codefine a large family of homologous proteins, the actin superfamily, with a tertiary structure known as the actin fold. Because ARPs and actin share high sequence conservation, clear family definition requires distinct features to easily and systematically identify each subfamily. In this study we performed an in depth sequence and comparative genomic analysis of ARP subfamilies. A high-quality multiple alignment of approximately 700 complete protein sequences homologous to actin, including 148 ARP sequences, allowed us to extend the ARP classification to new organisms. Sequence alignments revealed conserved residues, motifs, and inserted sequence signatures to define each ARP subfamily. These discriminative characteristics allowed us to develop ARPAnno (http://bips.u-strasbg.fr/ARPAnno), a new web server dedicated to the annotation of ARP sequences. Analyses of sequence conservation among actins and ARPs highlight part of the actin fold and suggest interactions between ARPs and actin-binding proteins. Finally, analysis of ARP distribution across eukaryotic phyla emphasizes the central importance of nuclear ARPs, particularly the multifunctional ARP4. PMID:16195354

  7. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm

    PubMed Central

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770

  8. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm.

    PubMed

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770

  9. Protein 3D Structure Computed from Evolutionary Sequence Variation

    PubMed Central

    Sheridan, Robert; Hopf, Thomas A.; Pagnani, Andrea; Zecchina, Riccardo; Sander, Chris

    2011-01-01

    The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein

  10. 3D structures of membrane proteins from genomic sequencing

    PubMed Central

    Hopf, Thomas A.; Colwell, Lucy J.; Sheridan, Robert; Rost, Burkhard; Sander, Chris; Marks, Debora S.

    2012-01-01

    Summary We show that amino acid co-variation in proteins, extracted from the evolutionary sequence record, can be used to fold transmembrane proteins. We use this technique to predict previously unknown, 3D structures for 11 transmembrane proteins (with up to 14 helices) from their sequences alone. The prediction method (EVfold_membrane), applies a maximum entropy approach to infer evolutionary co-variation in pairs of sequence positions within a protein family and then generates all-atom models with the derived pairwise distance constraints. We benchmark the approach with blinded, de novo computation of known transmembrane protein structures from 23 families, demonstrating unprecedented accuracy of the method for large transmembrane proteins. We show how the method can predict oligomerization, functional sites, and conformational changes in transmembrane proteins. With the rapid rise in large-scale sequencing, more accurate and more comprehensive information on evolutionary constraints can be decoded from genetic variation, greatly expanding the repertoire of transmembrane proteins amenable to modelling by this method. PMID:22579045

  11. Molecular sled sequences are common in mammalian proteins

    PubMed Central

    Xiong, Kan; Blainey, Paul C.

    2016-01-01

    Recent work revealed a new class of molecular machines called molecular sleds, which are small basic molecules that bind and slide along DNA with the ability to carry cargo along DNA. Here, we performed biochemical and single-molecule flow stretching assays to investigate the basis of sliding activity in molecular sleds. In particular, we identified the functional core of pVIc, the first molecular sled characterized; peptide functional groups that control sliding activity; and propose a model for the sliding activity of molecular sleds. We also observed widespread DNA binding and sliding activity among basic polypeptide sequences that implicate mammalian nuclear localization sequences and many cell penetrating peptides as molecular sleds. These basic protein motifs exhibit weak but physiologically relevant sequence-nonspecific DNA affinity. Our findings indicate that many mammalian proteins contain molecular sled sequences and suggest the possibility that substantial undiscovered sliding activity exists among nuclear mammalian proteins. PMID:26857546

  12. Molecular sled sequences are common in mammalian proteins.

    PubMed

    Xiong, Kan; Blainey, Paul C

    2016-03-18

    Recent work revealed a new class of molecular machines called molecular sleds, which are small basic molecules that bind and slide along DNA with the ability to carry cargo along DNA. Here, we performed biochemical and single-molecule flow stretching assays to investigate the basis of sliding activity in molecular sleds. In particular, we identified the functional core of pVIc, the first molecular sled characterized; peptide functional groups that control sliding activity; and propose a model for the sliding activity of molecular sleds. We also observed widespread DNA binding and sliding activity among basic polypeptide sequences that implicate mammalian nuclear localization sequences and many cell penetrating peptides as molecular sleds. These basic protein motifs exhibit weak but physiologically relevant sequence-nonspecific DNA affinity. Our findings indicate that many mammalian proteins contain molecular sled sequences and suggest the possibility that substantial undiscovered sliding activity exists among nuclear mammalian proteins. PMID:26857546

  13. Nucleotide sequence of the gene encoding the nitrogenase iron protein of Thiobacillus ferrooxidans

    SciTech Connect

    Pretorius, I.M.; Rawlings, D.E.; O'Neill, E.G.; Jones, W.A.; Kirby, R.; Woods, D.R.

    1987-01-01

    The DNA sequence was determined for the cloned Thiobacillus ferrooxidans nifH and part of the nifD genes. The DNA chains were radiolabeled with (..cap alpha..-/sup 32/P)dCTP (3000 Ci/mmol) or (..cap alpha..-/sup 35/S)dCTP (400 Ci/mmol). A putative T. ferrooxidans nifH promoter was identified whose sequences showed perfect consensus with those of the Klebsiella pneumoniae nif promoter. Two putative consensus upstream activator sequences were also identified. The amino acid sequence was deduced from the DNA sequence. In a comparison of nifH DNA sequences from T. ferrooxidans and eight other nitrogen-fixing microbes, a Rhizobium sp. isolated from Parasponia andersonii showed the greatest homology (74%) and Clostridium pasteurianum (nifH1) showed the least homology (54%). In the comparison of the amino acid sequences of the Fe proteins, the Rhizobium sp. and Rhizobium japonicum showed the greatest homology (both 86%) and C. pasteurianum (nifH1 gene product) demonstrated the least homology (56%) to the T. ferrooxidans Fe protein.

  14. In silico comparative analysis of DNA and amino acid sequences for prion protein gene.

    PubMed

    Kim, Y; Lee, J; Lee, C

    2008-01-01

    Genetic variability might contribute to species specificity of prion diseases in various organisms. In this study, structures of the prion protein gene (PRNP) and its amino acids were compared among species of which sequence data were available. Comparisons of PRNP DNA sequences among 12 species including human, chimpanzee, monkey, bovine, ovine, dog, mouse, rat, wallaby, opossum, chicken and zebrafish allowed us to identify candidate regulatory regions in intron 1 and 3'-untranslated region (UTR) in addition to the coding region. Highly conserved putative binding sites for transcription factors, such as heat shock factor 2 (HSF2) and myocite enhancer factor 2 (MEF2), were discovered in the intron 1. In 3'-UTR, the functional sequence (ATTAAA) for nucleus-specific polyadenylation was found in all the analysed species. The functional sequence (TTTTTAT) for maturation-specific polyadenylation was identically observed only in ovine, and one or two nucleotide mismatches in the other species. A comparison of the amino acid sequences in 53 species revealed a large sequence identity. Especially the octapeptide repeat region was observed in all the species but frog and zebrafish. Functional changes and susceptibility to prion diseases with various isoforms of prion protein could be caused by numeric variability and conformational changes discovered in the repeat sequences. PMID:18397498

  15. Nucleotide sequence of Bacillus phage Nf terminal protein gene.

    PubMed Central

    Leavitt, M C; Ito, J

    1987-01-01

    The nucleotide sequence of Bacillus phage Nf gene E has been determined. Gene E codes for phage terminal protein which is the primer necessary for the initiation of DNA replication. The deduced amino acid sequence of Nf terminal protein is approximately 66% homologous with the terminal proteins of Bacillus phages PZA and luminal diameter 29, and shows similar hydropathy and secondary structure predictions. A serine which has been identified as the residue which covalently links the protein to the 5' end of the genome in luminal diameter 29, is conserved in all three phages. The hydropathic and secondary structural environment of this serine is similar in these phage terminal proteins and also similar to the linking serine of adenovirus terminal protein. PMID:3601672

  16. Two modes of protein sequence evolution and their compositional dependencies

    NASA Astrophysics Data System (ADS)

    Mannige, Ranjan V.

    2013-06-01

    Protein sequence evolution has resulted in a vast repertoire of molecular functionality crucial to life. Despite the central importance of sequence evolution to biology, our fundamental understanding of how sequence composition affects evolution is incomplete. This report describes the utilization of lattice model simulations of directed evolution, which indicate that, on average, peptide and protein evolvability is strongly dependent on initial sequence composition. The report also discusses two distinct regimes of sequence evolution by point mutation: (a) the “classical” mode where sequences “crawl” over free energy barriers towards acquiring a target fold, and (b) the “quantum” mode where sequences appear to “tunnel” through large energy barriers generally insurmountable by means of a crawl. Finally, the simulations indicate that oily and charged peptides are the most efficient substrates for evolution at the “classical” and “quantum” regimes, respectively, and that their respective response to temperature is commensurate with analogies made to barrier crossing in classical and quantum systems. On the whole, these results show that sequence composition can tune both the evolvability and the optimal mode of evolution of peptides and proteins.

  17. Increasing Sequence Diversity with Flexible Backbone Protein Design: The Complete Redesign of a Protein Hydrophobic Core

    SciTech Connect

    Murphy, Grant S.; Mills, Jeffrey L.; Miley, Michael J.; Machius, Mischa; Szyperski, Thomas; Kuhlman, Brian

    2015-10-15

    Protein design tests our understanding of protein stability and structure. Successful design methods should allow the exploration of sequence space not found in nature. However, when redesigning naturally occurring protein structures, most fixed backbone design algorithms return amino acid sequences that share strong sequence identity with wild-type sequences, especially in the protein core. This behavior places a restriction on functional space that can be explored and is not consistent with observations from nature, where sequences of low identity have similar structures. Here, we allow backbone flexibility during design to mutate every position in the core (38 residues) of a four-helix bundle protein. Only small perturbations to the backbone, 12 {angstrom}, were needed to entirely mutate the core. The redesigned protein, DRNN, is exceptionally stable (melting point >140C). An NMR and X-ray crystal structure show that the side chains and backbone were accurately modeled (all-atom RMSD = 1.3 {angstrom}).

  18. Nucleotide sequence of a cloned woodchuck hepatitis virus genome: comparison with the hepatitis B virus sequence.

    PubMed Central

    Galibert, F; Chen, T N; Mandart, E

    1982-01-01

    The complete nucleotide sequence of a woodchuck hepatitis virus genome cloned in Escherichia coli was determined by the method of Maxam and Gilbert. This sequence was found to be 3,308 nucleotides long. Potential ATG initiator triplets and nonsense codons were identified and used to locate regions with a substantial coding capacity. A striking similarity was observed between the organization of human hepatitis B virus and woodchuck hepatitis virus. Nucleotide sequences of these open regions in the woodchuck virus were compared with corresponding regions present in hepatitis B virus. This allowed the location of four viral genes on the L strand and indicated the absence of protein coded by the S strand. Evolution rates of the various parts of the genome as well as of the four different proteins coded by hepatitis B virus and woodchuck hepatitis virus were compared. These results indicated that: (i) the core protein has evolved slightly less rapidly than the other proteins; and (ii) when a region of DNA codes for two different proteins, there is less freedom for the DNA to evolve and, moreover, one of the proteins can evolve more rapidly than the other. A hairpin structure, very well conserved in the two genomes, was located in the only region devoid of coding function, suggesting the location of the origin of replication of the viral DNA. Images PMID:7086958

  19. A Fractal Dimension and Wavelet Transform Based Method for Protein Sequence Similarity Analysis.

    PubMed

    Yang, Lina; Tang, Yuan Yan; Lu, Yang; Luo, Huiwu

    2015-01-01

    One of the key tasks related to proteins is the similarity comparison of protein sequences in the area of bioinformatics and molecular biology, which helps the prediction and classification of protein structure and function. It is a significant and open issue to find similar proteins from a large scale of protein database efficiently. This paper presents a new distance based protein similarity analysis using a new encoding method of protein sequence which is based on fractal dimension. The protein sequences are first represented into the 1-dimensional feature vectors by their biochemical quantities. A series of Hybrid method involving discrete Wavelet transform, Fractal dimension calculation (HWF) with sliding window are then applied to form the feature vector. At last, through the similarity calculation, we can obtain the distance matrix, by which, the phylogenic tree can be constructed. We apply this approach by analyzing the ND5 (NADH dehydrogenase subunit 5) protein cluster data set. The experimental results show that the proposed model is more accurate than the existing ones such as Su's model, Zhang's model, Yao's model and MEGA software, and it is consistent with some known biological facts. PMID:26357222

  20. Using homology relations within a database markedly boosts protein sequence similarity search

    PubMed Central

    Tong, Jing; Sadreyev, Ruslan I.; Pei, Jimin; Kinch, Lisa N.; Grishin, Nick V.

    2015-01-01

    Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence–based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit’s known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre. PMID:26038555

  1. A logical sequence search for S100B target proteins.

    PubMed Central

    McClintock, K. A.; Shaw, G. S.

    2000-01-01

    The EF-hand calcium-binding protein S100B has been shown to interact in vitro in a calcium-sensitive manner with many substrates. These potential S100B target proteins have been screened for the preservation of a previously identified consensus sequence across species. The results were compared to known structural and in vitro properties of the proteins to rationalize choices for potential binding partners. Our approach uncovered four oligomeric proteins tubulin (alpha and beta), glial fibrillary acidic protein (GFAP), desmin, and vimentin that have conserved regions matching the consensus sequence. In the type III intermediate filament proteins (GFAP, vimentin, and desmin), this region corresponds to a portion of a coiled-coil (helix 2A), the structural element responsible for their assembly. In tubulin, the sequence matches correspond to regions of alpha and beta tubulin found at the alpha beta tubulin interface. In both cases, these consensus sequence matches provide a logical explanation for in vitro observations that S100B is able to inhibit oligomerization of these proteins. PMID:11106180

  2. Comparison of solution-based exome capture methods for next generation sequencing

    PubMed Central

    2011-01-01

    Background Techniques enabling targeted re-sequencing of the protein coding sequences of the human genome on next generation sequencing instruments are of great interest. We conducted a systematic comparison of the solution-based exome capture kits provided by Agilent and Roche NimbleGen. A control DNA sample was captured with all four capture methods and prepared for Illumina GAII sequencing. Sequence data from additional samples prepared with the same protocols were also used in the comparison. Results We developed a bioinformatics pipeline for quality control, short read alignment, variant identification and annotation of the sequence data. In our analysis, a larger percentage of the high quality reads from the NimbleGen captures than from the Agilent captures aligned to the capture target regions. High GC content of the target sequence was associated with poor capture success in all exome enrichment methods. Comparison of mean allele balances for heterozygous variants indicated a tendency to have more reference bases than variant bases in the heterozygous variant positions within the target regions in all methods. There was virtually no difference in the genotype concordance compared to genotypes derived from SNP arrays. A minimum of 11× coverage was required to make a heterozygote genotype call with 99% accuracy when compared to common SNPs on genome-wide association arrays. Conclusions Libraries captured with NimbleGen kits aligned more accurately to the target regions. The updated NimbleGen kit most efficiently covered the exome with a minimum coverage of 20×, yet none of the kits captured all the Consensus Coding Sequence annotated exons. PMID:21955854

  3. Predicting protein disorder by analyzing amino acid sequence

    PubMed Central

    Yang, Jack Y; Yang, Mary Qu

    2008-01-01

    Background Many protein regions and some entire proteins have no definite tertiary structure, presenting instead as dynamic, disorder ensembles under different physiochemical circumstances. These proteins and regions are known as Intrinsically Unstructured Proteins (IUP). IUP have been associated with a wide range of protein functions, along with roles in diseases characterized by protein misfolding and aggregation. Results Identifying IUP is important task in structural and functional genomics. We exact useful features from sequences and develop machine learning algorithms for the above task. We compare our IUP predictor with PONDRs (mainly neural-network-based predictors), disEMBL (also based on neural networks) and Globplot (based on disorder propensity). Conclusion We find that augmenting features derived from physiochemical properties of amino acids (such as hydrophobicity, complexity etc.) and using ensemble method proved beneficial. The IUP predictor is a viable alternative software tool for identifying IUP protein regions and proteins. PMID:18831799

  4. Correlated mutations in protein sequences: Phylogenetic and structural effects

    SciTech Connect

    Lapedes, A.S. |; Giraud, B.G.; Stormo, G.D.

    1998-12-01

    Covariation analysis of sets of aligned sequences for RNA molecules is relatively successful in elucidating RNA secondary structure, as well as some aspects of tertiary structure. Covariation analysis of sets of aligned sequences for protein molecules is successful in certain instances in elucidating certain structural and functional links, but in general, pairs of sites displaying highly covarying mutations in protein sequences do not necessarily correspond to sites that are spatially close in the protein structure. In this paper the authors identify two reasons why naive use of covariation analysis for protein sequences fails to reliably indicate sequence positions that are spatially proximate. The first reason involves the bias introduced in calculation of covariation measures due to the fact that biological sequences are generally related by a non-trivial phylogenetic tree. The authors present a null-model approach to solve this problem. The second reason involves linked chains of covariation which can result in pairs of sites displaying significant covariation even though they are not spatially proximate. They present a maximum entropy solution to this classic problem of causation versus correlation. The methodologies are validated in simulation.

  5. Sequence and structural analysis of BTB domain proteins

    PubMed Central

    Stogios, Peter J; Downs, Gregory S; Jauhal, Jimmy JS; Nandra, Sukhjeen K; Privé, Gilbert G

    2005-01-01

    Background The BTB domain (also known as the POZ domain) is a versatile protein-protein interaction motif that participates in a wide range of cellular functions, including transcriptional regulation, cytoskeleton dynamics, ion channel assembly and gating, and targeting proteins for ubiquitination. Several BTB domain structures have been experimentally determined, revealing a highly conserved core structure. Results We surveyed the protein architecture, genomic distribution and sequence conservation of BTB domain proteins in 17 fully sequenced eukaryotes. The BTB domain is typically found as a single copy in proteins that contain only one or two other types of domain, and this defines the BTB-zinc finger (BTB-ZF), BTB-BACK-kelch (BBK), voltage-gated potassium channel T1 (T1-Kv), MATH-BTB, BTB-NPH3 and BTB-BACK-PHR (BBP) families of proteins, among others. In contrast, the Skp1 and ElonginC proteins consist almost exclusively of the core BTB fold. There are numerous lineage-specific expansions of BTB proteins, as seen by the relatively large number of BTB-ZF and BBK proteins in vertebrates, MATH-BTB proteins in Caenorhabditis elegans, and BTB-NPH3 proteins in Arabidopsis thaliana. Using the structural homology between Skp1 and the PLZF BTB homodimer, we present a model of a BTB-Cul3 SCF-like E3 ubiquitin ligase complex that shows that the BTB dimer or the T1 tetramer is compatible in this complex. Conclusion Despite widely divergent sequences, the BTB fold is structurally well conserved. The fold has adapted to several different modes of self-association and interactions with non-BTB proteins. PMID:16207353

  6. Single-molecule protein sequencing through fingerprinting: computational assessment

    NASA Astrophysics Data System (ADS)

    Yao, Yao; Docter, Margreet; van Ginkel, Jetty; de Ridder, Dick; Joo, Chirlmin

    2015-10-01

    Proteins are vital in all biological systems as they constitute the main structural and functional components of cells. Recent advances in mass spectrometry have brought the promise of complete proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule (SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different amino acids, which demands 20 molecular reporters. We computationally demonstrate that it suffices to measure only two types of amino acids to identify proteins and suggest an experimental scheme using SM fluorescence. When achieved, this highly sensitive approach will result in a paradigm shift in proteomics, with major impact in the biological and medical sciences.

  7. A novel predictor for protein structural class based on integrated information of the secondary structure sequence.

    PubMed

    Zhang, Lichao; Zhao, Xiqiang; Kong, Liang; Liu, Shuxia

    2014-08-01

    The structural class has become one of the most important features for characterizing the overall folding type of a protein and played important roles in many aspects of protein research. At present, it is still a challenging problem to accurately predict protein structural class for low-similarity sequences. In this study, an 18-dimensional integrated feature vector is proposed by fusing the information about content and position of the predicted secondary structure elements. The consistently high accuracies of jackknife and 10-fold cross-validation tests on different low-similarity benchmark datasets show that the proposed method is reliable and stable. Comparison of our results with other methods demonstrates that our method is an effective computational tool for protein structural class prediction, especially for low-similarity sequences. PMID:24859536

  8. Comprehensive analysis of sequences of a protein switch.

    PubMed

    Chen, Szu-Hua; Meller, Jaroslaw; Elber, Ron

    2016-01-01

    Switches form a special class of proteins that dramatically change their three-dimensional structures upon a small perturbation. One possible perturbation that we explore is that of a single point mutation. Building on the pioneering experimental work of Alexander et al. (Alexander et al. PNAS, 2007; 104,11963-11968) that determines switch sequences between α and α+β folds we conduct a comprehensive sequence sampling by a Markov Chain with multiple fitness criteria to identify new switches given the experimental folds. We screen for switch sequences using a combination of contact potential, secondary structure prediction, and finally molecular dynamics simulations. Statistical properties of switch sequences are discussed and illustrated to be most sensitive to mutation at the N- and C- termini of the switch protein. Based on this analysis, a particularly stable putative switch pair is identified and proposed for further experimental analysis. PMID:26073558

  9. Structure and Sequence Search on Aptamer-Protein Docking

    NASA Astrophysics Data System (ADS)

    Xiao, Jiajie; Bonin, Keith; Guthold, Martin; Salsbury, Freddie

    2015-03-01

    Interactions between proteins and deoxyribonucleic acid (DNA) play a significant role in the living systems, especially through gene regulation. However, short nucleic acids sequences (aptamers) with specific binding affinity to specific proteins exhibit clinical potential as therapeutics. Our capillary and gel electrophoresis selection experiments show that specific sequences of aptamers can be selected that bind specific proteins. Computationally, given the experimentally-determined structure and sequence of a thrombin-binding aptamer, we can successfully dock the aptamer onto thrombin in agreement with experimental structures of the complex. In order to further study the conformational flexibility of this thrombin-binding aptamer and to potentially develop a predictive computational model of aptamer-binding, we use GPU-enabled molecular dynamics simulations to both examine the conformational flexibility of the aptamer in the absence of binding to thrombin, and to determine our ability to fold an aptamer. This study should help further de-novo predictions of aptamer sequences by enabling the study of structural and sequence-dependent effects on aptamer-protein docking specificity.

  10. Nucleotide sequence of the L1 ribosomal protein gene of Xenopus laevis: remarkable sequence homology among introns.

    PubMed Central

    Loreni, F; Ruberti, I; Bozzoni, I; Pierandrei-Amaldi, P; Amaldi, F

    1985-01-01

    Ribosomal protein L1 is encoded by two genes in Xenopus laevis. The comparison of two cDNA sequences shows that the two L1 gene copies (L1a and L1b) have diverged in many silent sites and very few substitution sites; moreover a small duplication occurred at the very end of the coding region of the L1b gene which thus codes for a product five amino acids longer than that coded by L1a. Quantitatively the divergence between the two L1 genes confirms that a whole genome duplication took place in Xenopus laevis approximately 30 million years ago. A genomic fragment containing one of the two L1 gene copies (L1a), with its nine introns and flanking regions, has been completely sequenced. The 5' end of this gene has been mapped within a 20-pyridimine stretch as already found for other vertebrate ribosomal protein genes. Four of the nine introns have a 60-nucleotide sequence with 80% homology; within this region some boxes, one of which is 16 nucleotides long, are 100% homologous among the four introns. This feature of L1a gene introns is interesting since we have previously shown that the activity of this gene is regulated at a post-transcriptional level and it involves the block of the normal splicing of some intron sequences. Images Fig. 3. Fig. 5. PMID:3841512

  11. A Protein Deep Sequencing Evaluation of Metastatic Melanoma Tissues

    PubMed Central

    Welinder, Charlotte; Pawłowski, Krzysztof; Sugihara, Yutaka; Yakovleva, Maria; Jönsson, Göran; Ingvar, Christian; Lundgren, Lotta; Baldetorp, Bo; Olsson, Håkan; Rezeli, Melinda; Jansson, Bo; Laurell, Thomas; Fehniger, Thomas; Döme, Balazs; Malm, Johan; Wieslander, Elisabet; Nishimura, Toshihide; Marko-Varga, György

    2015-01-01

    Malignant melanoma has the highest increase of incidence of malignancies in the western world. In early stages, front line therapy is surgical excision of the primary tumor. Metastatic disease has very limited possibilities for cure. Recently, several protein kinase inhibitors and immune modifiers have shown promising clinical results but drug resistance in metastasized melanoma remains a major problem. The need for routine clinical biomarkers to follow disease progression and treatment efficacy is high. The aim of the present study was to build a protein sequence database in metastatic melanoma, searching for novel, relevant biomarkers. Ten lymph node metastases (South-Swedish Malignant Melanoma Biobank) were subjected to global protein expression analysis using two proteomics approaches (with/without orthogonal fractionation). Fractionation produced higher numbers of protein identifications (4284). Combining both methods, 5326 unique proteins were identified (2641 proteins overlapping). Deep mining proteomics may contribute to the discovery of novel biomarkers for metastatic melanoma, for example dividing the samples into two metastatic melanoma “genomic subtypes”, (“pigmentation” and “high immune”) revealed several proteins showing differential levels of expression. In conclusion, the present study provides an initial version of a metastatic melanoma protein sequence database producing a total of more than 5000 unique protein identifications. The raw data have been deposited to the ProteomeXchange with identifiers PXD001724 and PXD001725. PMID:25874936

  12. MannDB – A microbial database of automated protein sequence analyses and evidence integration for protein characterization

    PubMed Central

    Zhou, Carol L Ecale; Lam, Marisa W; Smith, Jason R; Zemla, Adam T; Dyer, Matthew D; Kuczmarski, Thomas A; Vitalis, Elizabeth A; Slezak, Thomas R

    2006-01-01

    representing organisms listed as high-priority agents on the websites of several governmental organizations concerned with bio-terrorism. MannDB provides the user with a BLAST interface for comparison of native and non-native sequences and a query tool for conveniently selecting proteins of interest. In addition, the user has access to a web-based browser that compiles comprehensive and extensive reports. Access to MannDB is freely available at . PMID:17044936

  13. How does a simplified-sequence protein fold?

    PubMed

    Guarnera, Enrico; Pellarin, Riccardo; Caflisch, Amedeo

    2009-09-16

    To investigate a putatively primordial protein we have simplified the sequence of a 56-residue alpha/beta fold (the immunoglobulin-binding domain of protein G) by replacing it with polyalanine, polythreonine, and diglycine segments at regions of the sequence that in the folded structure are alpha-helical, beta-strand, and turns, respectively. Remarkably, multiple folding and unfolding events are observed in a 15-micros molecular dynamics simulation at 330 K. The most stable state (populated at approximately 20%) of the simplified-sequence variant of protein G has the same alpha/beta topology as the wild-type but shows the characteristics of a molten globule, i.e., loose contacts among side chains and lack of a specific hydrophobic core. The unfolded state is heterogeneous and includes a variety of alpha/beta topologies but also fully alpha-helical and fully beta-sheet structures. Transitions within the denatured state are very fast, and the molten-globule state is reached in <1 micros by a framework mechanism of folding with multiple pathways. The native structure of the wild-type is more rigid than the molten-globule conformation of the simplified-sequence variant. The difference in structural stability and the very fast folding of the simplified protein suggest that evolution has enriched the primordial alphabet of amino acids mainly to optimize protein function by stabilization of a unique structure with specific tertiary interactions. PMID:19751679

  14. How Does a Simplified-Sequence Protein Fold?

    PubMed Central

    Guarnera, Enrico; Pellarin, Riccardo; Caflisch, Amedeo

    2009-01-01

    To investigate a putatively primordial protein we have simplified the sequence of a 56-residue α/β fold (the immunoglobulin-binding domain of protein G) by replacing it with polyalanine, polythreonine, and diglycine segments at regions of the sequence that in the folded structure are α-helical, β-strand, and turns, respectively. Remarkably, multiple folding and unfolding events are observed in a 15-μs molecular dynamics simulation at 330 K. The most stable state (populated at ∼20%) of the simplified-sequence variant of protein G has the same α/β topology as the wild-type but shows the characteristics of a molten globule, i.e., loose contacts among side chains and lack of a specific hydrophobic core. The unfolded state is heterogeneous and includes a variety of α/β topologies but also fully α-helical and fully β-sheet structures. Transitions within the denatured state are very fast, and the molten-globule state is reached in <1 μs by a framework mechanism of folding with multiple pathways. The native structure of the wild-type is more rigid than the molten-globule conformation of the simplified-sequence variant. The difference in structural stability and the very fast folding of the simplified protein suggest that evolution has enriched the primordial alphabet of amino acids mainly to optimize protein function by stabilization of a unique structure with specific tertiary interactions. PMID:19751679

  15. Extracting protein alignment models from the sequence database.

    PubMed Central

    Neuwald, A F; Liu, J S; Lipman, D J; Lawrence, C E

    1997-01-01

    Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans ; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences. PMID:9108146

  16. Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

    PubMed Central

    Ma, Qicheng; Chirn, Gung-Wei; Cai, Richard; Szustakowski, Joseph D; Nirmala, NR

    2005-01-01

    Background The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. Results Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. Conclusion Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes. PMID:16202129

  17. Homolog detection using global sequence properties suggests an alternate view of structural encoding in protein sequences

    PubMed Central

    Scheraga, Harold A.; Rackovsky, S.

    2014-01-01

    We show that a Fourier-based sequence distance function is able to identify structural homologs of target sequences with high accuracy. It is shown that Fourier distances correlate very strongly with independently determined structural distances between molecules, a property of the method that is not attainable using conventional representations. It is further shown that the ability of the Fourier approach to identify protein folds is statistically far in excess of random expectation. It is then shown that, in actual searches for structural homologs of selected target sequences, the Fourier approach gives excellent results. On the basis of these results, we suggest that the global information detected by the Fourier representation is an essential feature of structure encoding in protein sequences and a key to structural homology detection. PMID:24706836

  18. Purification and sequencing of the active site tryptic peptide from penicillin-binding protein 1b of Escherichia coli

    SciTech Connect

    Nicholas, R.A.; Suzuki, H.; Hirota, Y.; Strominger, J.L.

    1985-07-02

    This paper reports the sequence of the active site peptide of penicillin-binding protein 1b from Escherichia coli. Purified penicillin-binding protein 1b was labeled with (/sup 14/C)penicillin G, digested with trypsin, and partially purified by gel filtration. Upon further purification by high-pressure liquid chromatography, two radioactive peaks were observed, and the major peak, representing over 75% of the applied radioactivity, was submitted to amino acid analysis and sequencing. The sequence Ser-Ile-Gly-Ser-Leu-Ala-Lys was obtained. The active site nucleophile was identified by digesting the purified peptide with aminopeptidase M and separating the radioactive products on high-pressure liquid chromatography. Amino acid analysis confirmed that the serine residue in the middle of the sequence was covalently bonded to the (/sup 14/C)penicilloyl moiety. A comparison of this sequence to active site sequences of other penicillin-binding proteins and beta-lactamases is presented.

  19. Determinants of the rate of protein sequence evolution

    PubMed Central

    Zhang, Jianzhi; Yang, Jian-Rong

    2015-01-01

    The rate and mechanism of protein sequence evolution have been central questions in evolutionary biology since the 1960s. Although the rate of protein sequence evolution depends primarily on the level of functional constraint, exactly what constitutes functional constraint has remained unclear. The increasing availability of genomic data has allowed for much needed empirical examinations on the nature of functional constraint. These studies found that the evolutionary rate of a protein is predominantly influenced by its expression level rather than functional importance. A combination of theoretical and empirical analyses have identified multiple mechanisms behind these observations and demonstrated a prominent role that selection against errors in molecular and cellular processes plays in protein evolution. PMID:26055156

  20. Co-evolution of metabolism and protein sequences.

    PubMed

    Schütte, Moritz; Klitgord, Niels; Segrè, Daniel; Ebenhöh, Oliver

    2010-01-01

    The set of chemicals producible and usable by metabolic pathways must have evolved in parallel with the enzymes that catalyze them. One implication of this common historical path should be a correspondence between the innovation steps that gradually added new metabolic reactions to the biosphere-level biochemical toolkit, and the gradual sequence changes that must have slowly shaped the corresponding enzyme structures. However, global signatures of a long-term co-evolution have not been identified. Here we search for such signatures by computing correlations between inter-reaction distances on a metabolic network, and sequence distances of the corresponding enzyme proteins. We perform our calculations using the set of all known metabolic reactions, available from the KEGG database. Reaction-reaction distance on the metabolic network is computed as the length of the shortest path on a projection of the metabolic network, in which nodes are reactions and edges indicate whether two reactions share a common metabolite, after removal of cofactors. Estimating the distance between enzyme sequences in a meaningful way requires some special care: for each enzyme commission (EC) number, we select from KEGG a consensus set of protein sequences using the cluster of orthologous groups of proteins (COG) database. We define the evolutionary distance between protein sequences as an asymmetric transition probability between two enzymes, derived from the corresponding pair-wise BLAST scores. By comparing the distances between sequences to the minimal distances on the metabolic reaction graph, we find a small but statistically significant correlation between the two measures. This suggests that the evolutionary walk in enzyme sequence space has locally mirrored, to some extent, the gradual expansion of metabolism. PMID:20238426

  1. Bioinformatics comparison of sulfate-reducing metabolism nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Nguyen, A.; Cheung, E.; Sullivan, R.; Holden, T.; Lieberman, D.; Cheung, T.

    2015-09-01

    The sulfate-reducing bacteria can be traced back to 3.5 billion years ago. The thermodynamics details of the sulfur cycle have been well documented. A recent sulfate-reducing bacteria report (Robator, Jungbluth, et al , 2015 Jan, Front. Microbiol) with Genbank nucleotide data has been analyzed in terms of the sulfite reductase (dsrAB) via fractal dimension and entropy values. Comparison to oil field sulfate-reducing sequences was included. The AUCG translational mass fractal dimension versus ATCG transcriptional mass fractal dimension for the low temperature dsrB and dsrA sequences reported in Reference Thirteen shows correlation R-sq ~ 0.79 , with a probably of about 3% in simulation. A recent report of using Cystathionine gamma-lyase sequence to produce CdS quantum dot in a biological method, where the sulfur is reduced just like in the H2S production process, was included for comparison. The AUCG mass fractal dimension versus ATCG mass fractal dimension for the Cystathionine gamma-lyase sequences was found to have R-sq of 0.72, similar to the low temperature dissimilatory sulfite reductase dsr group with 3% probability, in contrary to the oil field group having R-sq ~ 0.94, a high probable outcome in the simulation. The other two simulation histograms, namely, fractal dimension versus entropy R-sq outcome values, and di-nucleotide entropy versus mono-nucleotide entropy R-sq outcome values are also discussed in the data analysis focusing on low probability outcomes.

  2. Intermediate divergence levels maximize the strength of structure-sequence correlations in enzymes and viral proteins.

    PubMed

    Jackson, Eleisha L; Shahmoradi, Amir; Spielman, Stephanie J; Jack, Benjamin R; Wilke, Claus O

    2016-07-01

    Structural properties such as solvent accessibility and contact number predict site-specific sequence variability in many proteins. However, the strength and significance of these structure-sequence relationships vary widely among different proteins, with absolute correlation strengths ranging from 0 to 0.8. In particular, two recent works have made contradictory observations. Yeh et al. (Mol. Biol. Evol. 31:135-139, 2014) found that both relative solvent accessibility (RSA) and weighted contact number (WCN) are good predictors of sitewise evolutionary rate in enzymes, with WCN clearly out-performing RSA. Shahmoradi et al. (J. Mol. Evol. 79:130-142, 2014) considered these same predictors (as well as others) in viral proteins and found much weaker correlations and no clear advantage of WCN over RSA. Because these two studies had substantial methodological differences, however, a direct comparison of their results is not possible. Here, we reanalyze the datasets of the two studies with one uniform analysis pipeline, and we find that many apparent discrepancies between the two analyses can be attributed to the extent of sequence divergence in individual alignments. Specifically, the alignments of the enzyme dataset are much more diverged than those of the virus dataset, and proteins with higher divergence exhibit, on average, stronger structure-sequence correlations. However, the highest structure-sequence correlations are observed at intermediate divergence levels, where both highly conserved and highly variable sites are present in the same alignment. PMID:26971720

  3. Automatic generation of primary sequence patterns from sets of related protein sequences.

    PubMed

    Smith, R F; Smith, T F

    1990-01-01

    We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern. PMID:2296575

  4. Rapid Evolution of Virus Sequences in Intrinsically Disordered Protein Regions

    PubMed Central

    Gitlin, Leonid; Hagai, Tzachi; LaBarbera, Anthony; Solovey, Mark; Andino, Raul

    2014-01-01

    Nodamura Virus (NoV) is a nodavirus originally isolated from insects that can replicate in a wide variety of hosts, including mammals. Because of their simplicity and ability to replicate in many diverse hosts, NoV, and the Nodaviridae in general, provide a unique window into the evolution of viruses and host-virus interactions. Here we show that the C-terminus of the viral polymerase exhibits extreme structural and evolutionary flexibility. Indeed, fewer than 10 positively charged residues from the 110 amino acid-long C-terminal region of protein A are required to support RNA1 replication. Strikingly, this region can be replaced by completely unrelated protein sequences, yet still produce a functional replicase. Structure predictions, as well as evolutionary and mutational analyses, indicate that the C-terminal region is structurally disordered and evolves faster than the rest of the viral proteome. Thus, the function of an intrinsically unstructured protein region can be independent of most of its primary sequence, conferring both functional robustness and sequence plasticity on the protein. Our results provide an experimental explanation for rapid evolution of unstructured regions, which enables an effective exploration of the sequence space, and likely function space, available to the virus. PMID:25502394

  5. nWayComp: a genome-wide sequence comparison tool for multiple strains/species of phylogenetically related microorganisms.

    PubMed

    Yao, Jiqiang; Lin, Hong; Doddapaneni, Harshavardhan; Civerolo, Edwin L

    2007-01-01

    The increasing number of whole genomic sequences of microorganisms has led to the complexity of genome-wide annotation and gene sequence comparison among multiple microorganisms. To address this problem, we have developed nWayComp software that compares DNA and protein sequences of phylogenetically-related microorganisms. This package integrates a series of bioinformatics tools such as BLAST, ClustalW, ALIGN, PHYLIP and PRIMER3 for sequence comparison. It searches for homologous sequences among multiple organisms and identifies genes that are unique to a particular organism. The homologous gene sets are then ranked in the descending order of the sequence similarity. For each set of homologous sequences, a table of sequence identity among homologous genes along with sequence variations such as SNPs and INDELS is developed, and a phylogenetic tree is constructed. In addition, a common set of primers that can amplify all the homologous sequences are generated. The nWayComp package provides users with a quick and convenient tool to compare genomic sequences among multiple organisms at the whole-genome level. PMID:17688445

  6. Sequence heterogeneity accelerates protein search for targets on DNA

    NASA Astrophysics Data System (ADS)

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-12-01

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  7. Sequence heterogeneity accelerates protein search for targets on DNA

    SciTech Connect

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-12-28

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  8. Comparison of DNA Quantification Methods for Next Generation Sequencing

    PubMed Central

    Robin, Jérôme D.; Ludlow, Andrew T.; LaRanger, Ryan; Wright, Woodring E.; Shay, Jerry W.

    2016-01-01

    Next Generation Sequencing (NGS) is a powerful tool that depends on loading a precise amount of DNA onto a flowcell. NGS strategies have expanded our ability to investigate genomic phenomena by referencing mutations in cancer and diseases through large-scale genotyping, developing methods to map rare chromatin interactions (4C; 5C and Hi-C) and identifying chromatin features associated with regulatory elements (ChIP-seq, Bis-Seq, ChiA-PET). While many methods are available for DNA library quantification, there is no unambiguous gold standard. Most techniques use PCR to amplify DNA libraries to obtain sufficient quantities for optical density measurement. However, increased PCR cycles can distort the library’s heterogeneity and prevent the detection of rare variants. In this analysis, we compared new digital PCR technologies (droplet digital PCR; ddPCR, ddPCR-Tail) with standard methods for the titration of NGS libraries. DdPCR-Tail is comparable to qPCR and fluorometry (QuBit) and allows sensitive quantification by analysis of barcode repartition after sequencing of multiplexed samples. This study provides a direct comparison between quantification methods throughout a complete sequencing experiment and provides the impetus to use ddPCR-based quantification for improvement of NGS quality. PMID:27048884

  9. Comparison of DNA Quantification Methods for Next Generation Sequencing.

    PubMed

    Robin, Jérôme D; Ludlow, Andrew T; LaRanger, Ryan; Wright, Woodring E; Shay, Jerry W

    2016-01-01

    Next Generation Sequencing (NGS) is a powerful tool that depends on loading a precise amount of DNA onto a flowcell. NGS strategies have expanded our ability to investigate genomic phenomena by referencing mutations in cancer and diseases through large-scale genotyping, developing methods to map rare chromatin interactions (4C; 5C and Hi-C) and identifying chromatin features associated with regulatory elements (ChIP-seq, Bis-Seq, ChiA-PET). While many methods are available for DNA library quantification, there is no unambiguous gold standard. Most techniques use PCR to amplify DNA libraries to obtain sufficient quantities for optical density measurement. However, increased PCR cycles can distort the library's heterogeneity and prevent the detection of rare variants. In this analysis, we compared new digital PCR technologies (droplet digital PCR; ddPCR, ddPCR-Tail) with standard methods for the titration of NGS libraries. DdPCR-Tail is comparable to qPCR and fluorometry (QuBit) and allows sensitive quantification by analysis of barcode repartition after sequencing of multiplexed samples. This study provides a direct comparison between quantification methods throughout a complete sequencing experiment and provides the impetus to use ddPCR-based quantification for improvement of NGS quality. PMID:27048884

  10. Protein landscape at Drosophila melanogaster telomere-associated sequence repeats.

    PubMed

    Antão, José M; Mason, James M; Déjardin, Jérôme; Kingston, Robert E

    2012-06-01

    The specific set of proteins bound at each genomic locus contributes decisively to regulatory processes and to the identity of a cell. Understanding of the function of a particular locus requires the knowledge of what factors interact with that locus and how the protein composition changes in different cell types or during the response to internal and external signals. Proteomic analysis of isolated chromatin segments (PICh) was developed as a tool to target, purify, and identify proteins associated with a defined locus and was shown to allow the purification of human telomeric chromatin. Here we have developed this method to identify proteins that interact with the Drosophila telomere-associated sequence (TAS) repeats. Several of the purified factors were validated as novel TAS-bound proteins by chromatin immunoprecipitation, and the Brahma complex was confirmed as a dominant modifier of telomeric position effect through the use of a genetic test. These results offer information on the efficacy of applying the PICh protocol to loci with sequence more complex than that found at human telomeres and identify proteins that bind to the TAS repeats, which might contribute to TAS biology and chromatin silencing. PMID:22493064

  11. Will my protein crystallize? A sequence-based predictor.

    PubMed

    Smialowski, Pawel; Schmidt, Thorsten; Cox, Jürgen; Kirschner, Andreas; Frishman, Dmitrij

    2006-02-01

    We propose a machine-learning approach to sequence-based prediction of protein crystallizability in which we exploit subtle differences between proteins whose structures were solved by X-ray analysis [or by both X-ray and nuclear magnetic resonance (NMR) spectroscopy] and those proteins whose structures were solved by NMR spectroscopy alone. Because the NMR technique is usually applied on relatively small proteins, sequence length distributions of the X-ray and NMR datasets were adjusted to avoid predictions biased by protein size. As feature space for classification, we used frequencies of mono-, di-, and tripeptides represented by the original 20-letter amino acid alphabet as well as by several reduced alphabets in which amino acids were grouped by their physicochemical and structural properties. The classification algorithm was constructed as a two-layered structure in which the output of primary support vector machine classifiers operating on peptide frequencies was combined by a second-level Naive Bayes classifier. Due to the application of metamethods for cost sensitivity, our method is able to handle real datasets with unbalanced class representation. An overall prediction accuracy of 67% [65% on the positive (crystallizable) and 69% on the negative (noncrystallizable) class] was achieved in a 10-fold cross-validation experiment, indicating that the proposed algorithm may be a valuable tool for more efficient target selection in structural genomics. A Web server for protein crystallizability prediction called SECRET is available at http://webclu.bio.wzw.tum.de:8080/secret. PMID:16315316

  12. FAB overlapping: a strategy for sequencing homologous proteins

    NASA Astrophysics Data System (ADS)

    Ferranti, P.; Malorni, A.; Marino, G.; Pucci, P.; di Luccia, A.; Ferrara, L.

    1991-12-01

    Extensive similarity has been shown to exist between the primary structures of closely related proteins from different species, the only differences being restricted to a few amino acid variations. A new mass spectrometric procedure, which has been called FAB-overlapping, has been developed for sequencing highly homologous proteins based on the detection of these small differences as compared with a known protein used as a reference. Several complementary peptide maps are constructed using fast atom bombardment mass spectrometry (FAB-MS) analysis of different proteolytic digests of the unknown protein and the mass values are related to those expected on the basis of the sequence of the reference protein. The mass signals exhibiting unusual mass values identify those regions where variations have taken place; fine location of the mutations can be obtained by coupling simple protein chemistry methodologies with FAB-MS. Using the FAB-overlapping procedure, it was possible to determine the sequence of [alpha]1, [alpha]3 and [beta] globins from water buffalo (Bubalus bubalis hemoglobins (phenotype AA). Two amino acid substitutions were detected in the buffalo [beta] chain (Lys16 --> His and Asn118 --> His) whereas the [alpha]1 chains were found the [alpha]1 and [alpha]3 chains were found to contain four amino acid replacements, three of which were identical (Glu23 --> Asp, Glu71 --> Gly, Phe117 --> Cys), and the insertion of an alanine residue in position 124. The only differences between [alpha]1 and [alpha]3 globins were identified in the C -terminal region; [alpha]1 contains a Phe residue at position 130 whereas [alpha]3 shows serine at position 132.

  13. The Protein Information Resource (PIR) and the PIR-International Protein Sequence Database.

    PubMed Central

    George, D G; Dodson, R J; Garavelli, J S; Haft, D H; Hunt, L T; Marzec, C R; Orcutt, B C; Sidman, K E; Srinivasarao, G Y; Yeh, L S; Arminski, L M; Ledley, R S; Tsugita, A; Barker, W C

    1997-01-01

    From its origin, the PIR has aspired to support research in computational biology and genomics through the compilation of a comprehensive, quality controlled and well-organized protein sequence information resource. The resource originated with the pioneering work of the late Margaret O. Dayhoff in the early 1960s. Since 1988, the Protein Sequence Database has been maintained collaboratively by PIR-International, an association of macromolecular sequence data collection centers dedicated to fostering international cooperation as an essential element in the development of scientific databases. The work of the resource is widely distributed and is available on the World Wide Web, via FTP, E-mail server, CD-ROM and magnetic media. It is widely redistributed and incorporated into many other protein sequence data compilations including SWISS-PROT and theEntrezsystem of the NCBI. PMID:9016497

  14. Identification of short peptide sequences in complex milk protein hydrolysates.

    PubMed

    O'Keeffe, Martina B; FitzGerald, Richard J

    2015-10-01

    Numerous low molecular mass bioactive peptides (BAPs) can be generated during the hydrolysis of bovine milk proteins. Low molecular mass BAP sequences are less likely to be broken down by digestive enzymes and are thus more likely to be active in vivo. However, the identification of short peptides remains a challenge during mass spectrometry (MS) analysis due to issues with the transfer and over-fragmentation of low molecular mass ions. A method is described herein using time-of-flight ESI-MS/MS to effectively fragment and identify short peptides. This includes (a) short synthetic peptides, (b) short peptides within a defined hydrolysate sample, i.e. a prolyl endoproteinase hydrolysate of β-casein and (c) short peptides within a complex hydrolysate, i.e. a Corolase PP digest of sodium caseinate. The methodology may find widespread utilisation in the efficient identification of low molecular mass peptide sequences in food protein hydrolysates. PMID:25872436

  15. Nucleotide sequence of the phosphoglycerate kinase gene from the extreme thermophile Thermus thermophilus. Comparison of the deduced amino acid sequence with that of the mesophilic yeast phosphoglycerate kinase.

    PubMed Central

    Bowen, D; Littlechild, J A; Fothergill, J E; Watson, H C; Hall, L

    1988-01-01

    Using oligonucleotide probes derived from amino acid sequencing information, the structural gene for phosphoglycerate kinase from the extreme thermophile, Thermus thermophilus, was cloned in Escherichia coli and its complete nucleotide sequence determined. The gene consists of an open reading frame corresponding to a protein of 390 amino acid residues (calculated Mr 41,791) with an extreme bias for G or C (93.1%) in the codon third base position. Comparison of the deduced amino acid sequence with that of the corresponding mesophilic yeast enzyme indicated a number of significant differences. These are discussed in terms of the unusual codon bias and their possible role in enhanced protein thermal stability. Images Fig. 1. PMID:3052437

  16. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing.

    PubMed

    Kanda, Kojun; Pflug, James M; Sproul, John S; Dasenko, Mark A; Maddison, David R

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles

  17. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing

    PubMed Central

    Dasenko, Mark A.

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles

  18. Computational identification of MoRFs in protein sequences

    PubMed Central

    Malhis, Nawar; Gsponer, Jörg

    2015-01-01

    Motivation: Intrinsically disordered regions of proteins play an essential role in the regulation of various biological processes. Key to their regulatory function is the binding of molecular recognition features (MoRFs) to globular protein domains in a process known as a disorder-to-order transition. Predicting the location of MoRFs in protein sequences with high accuracy remains an important computational challenge. Method: In this study, we introduce MoRFCHiBi, a new computational approach for fast and accurate prediction of MoRFs in protein sequences. MoRFCHiBi combines the outcomes of two support vector machine (SVM) models that take advantage of two different kernels with high noise tolerance. The first, SVMS, is designed to extract maximal information from the general contrast in amino acid compositions between MoRFs, their surrounding regions (Flanks), and the remainders of the sequences. The second, SVMT, is used to identify similarities between regions in a query sequence and MoRFs of the training set. Results: We evaluated the performance of our predictor by comparing its results with those of two currently available MoRF predictors, MoRFpred and ANCHOR. Using three test sets that have previously been collected and used to evaluate MoRFpred and ANCHOR, we demonstrate that MoRFCHiBi outperforms the other predictors with respect to different evaluation metrics. In addition, MoRFCHiBi is downloadable and fast, which makes it useful as a component in other computational prediction tools. Availability and implementation: http://www.chibi.ubc.ca/morf/. Contact: gsponer@chibi.ubc.ca. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25637562

  19. Functional analysis of bipartite begomovirus coat protein promoter sequences

    SciTech Connect

    Lacatus, Gabriela; Sunter, Garry

    2008-06-20

    We demonstrate that the AL2 gene of Cabbage leaf curl virus (CaLCuV) activates the CP promoter in mesophyll and acts to derepress the promoter in vascular tissue, similar to that observed for Tomato golden mosaic virus (TGMV). Binding studies indicate that sequences mediating repression and activation of the TGMV and CaLCuV CP promoter specifically bind different nuclear factors common to Nicotiana benthamiana, spinach and tomato. However, chromatin immunoprecipitation demonstrates that TGMV AL2 can interact with both sequences independently. Binding of nuclear protein(s) from different crop species to viral sequences conserved in both bipartite and monopartite begomoviruses, including TGMV, CaLCuV, Pepper golden mosaic virus and Tomato yellow leaf curl virus suggests that bipartite begomoviruses bind common host factors to regulate the CP promoter. This is consistent with a model in which AL2 interacts with different components of the cellular transcription machinery that bind viral sequences important for repression and activation of begomovirus CP promoters.

  20. Quantifying sequence and structural features of protein-RNA interactions.

    PubMed

    Li, Songling; Yamashita, Kazuo; Amada, Karlou Mar; Standley, Daron M

    2014-09-01

    Increasing awareness of the importance of protein-RNA interactions has motivated many approaches to predict residue-level RNA binding sites in proteins based on sequence or structural characteristics. Sequence-based predictors are usually high in sensitivity but low in specificity; conversely structure-based predictors tend to have high specificity, but lower sensitivity. Here we quantified the contribution of both sequence- and structure-based features as indicators of RNA-binding propensity using a machine-learning approach. In order to capture structural information for proteins without a known structure, we used homology modeling to extract the relevant structural features. Several novel and modified features enhanced the accuracy of residue-level RNA-binding propensity beyond what has been reported previously, including by meta-prediction servers. These features include: hidden Markov model-based evolutionary conservation, surface deformations based on the Laplacian norm formalism, and relative solvent accessibility partitioned into backbone and side chain contributions. We constructed a web server called aaRNA that implements the proposed method and demonstrate its use in identifying putative RNA binding sites. PMID:25063293

  1. GlobPlot: exploring protein sequences for globularity and disorder

    PubMed Central

    Linding, Rune; Russell, Robert B.; Neduva, Victor; Gibson, Toby J.

    2003-01-01

    A major challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Non-globular sequence segments often contain short linear peptide motifs (e.g. SH3-binding sites) which are important for protein function. We present here a new tool for discovery of such unstructured, or disordered regions within proteins. GlobPlot (http://globplot.embl.de) is a web service that allows the user to plot the tendency within the query protein for order/globularity and disorder. We show examples with known proteins where it successfully identifies inter-domain segments containing linear motifs, and also apparently ordered regions that do not contain any recognised domain. GlobPlot may be useful in domain hunting efforts. The plots indicate that instances of known domains may often contain additional N- or C-terminal segments that appear ordered. Thus GlobPlot may be of use in the design of constructs corresponding to globular proteins, as needed for many biochemical studies, particularly structural biology. GlobPlot has a pipeline interface—GlobPipe—for the advanced user to do whole proteome analysis. GlobPlot can also be used as a generic infrastructure package for graphical displaying of any possible propensity. PMID:12824398

  2. Binding of CCAAT displacement protein CDP to adenovirus packaging sequences.

    PubMed

    Erturk, Ece; Ostapchuk, Philomena; Wells, Susanne I; Yang, Jihong; Gregg, Keqin; Nepveu, Alain; Dudley, Jaquelin P; Hearing, Patrick

    2003-06-01

    Adenovirus (Ad) type 5 DNA packaging is initiated in a polar fashion from the left end of the genome. The packaging process is dependent upon the cis-acting packaging domain located between nucleotides 194 and 380. Seven A/T-rich repeats have been identified within this domain that direct packaging. A1, A2, A5, and A6 are the most important repeats functionally and share a bipartite sequence motif. Several lines of evidence suggest that there is a limiting trans-acting factor(s) that plays a role in packaging. Two cellular activities that bind to minimal packaging domains in vitro have been previously identified. These binding activities are P complex, an uncharacterized protein(s), and chicken ovalbumin upstream promoter transcription factor (COUP-TF). In this work, we report that a third cellular protein, octamer-1 protein (Oct-1), binds to minimal packaging domains. In vitro binding analyses and in vivo packaging assays were used to examine the relevance of these DNA binding activities to Ad DNA packaging. The results of these experiments reveal that COUP-TF and Oct-1 binding does not play a functional role in Ad packaging, whereas P-complex binding directly correlates with packaging function. We demonstrate that P complex contains the cellular protein CCAAT displacement protein (CDP) and that full-length CDP is found in purified virus particles. In addition to cellular factors, previous evidence indicates that viral factors play a role in the initiation of viral DNA packaging. We propose that CDP, in conjunction with one or more viral proteins, binds to the packaging sequences of Ad to initiate the encapsidation process. PMID:12743282

  3. Methods for optimizing the structure alphabet sequences of proteins.

    PubMed

    Dong, Qi-wen; Wang, Xiao-long; Lin, Lei

    2007-11-01

    Protein structure prediction based on fragment assemble has made great progress in recent years. Local protein structure prediction is receiving increased attention. One essential step of local protein structure prediction method is that the three-dimensional conformations must be compressed into one-dimensional series of letters of a structural alphabet. The traditional method assigns each structure fragment the structure alphabet that has the best local structure similarity. However, such locally optimal structure alphabet sequence does not guarantee to produce the globally optimal structure. This study presents two efficient methods trying to find the optimal structure alphabet sequence, which can model the native structures as accuracy as possible. First, a 28-letter structure alphabet is derived by clustering fragment in Cartesian space with fragment length of seven residues. The average quantization error of the 28 letters is 0.82 A in term of root mean square deviation. Then, two efficient methods are presented to encode the protein structures into series of structure alphabet letters, that is, the greedy and dynamic programming algorithm. They are tested on PDB database using the structure alphabet developed in Cartesian coordinates space (our structure alphabet) and in torsion angles space (the PB structure alphabet), respectively. The experimental results show that these two methods can find the approximately optimal structure alphabet sequences by searching a small fraction of the modeling space. The traditional local-optimization method achieves 26.27 A root mean square deviations between the reconstructed structures and the native one, while the modeling accuracy is improved to 3.28 A by the greedy algorithm. The results are helpful for local protein structure prediction. PMID:17493604

  4. Isolation and characterization of adrenoleukodystrophy protein (ALDP) related sequences in the human genome

    SciTech Connect

    Geraghty, M.T.; Stetten, G.; Kearns, W.

    1994-09-01

    X-linked adrenoleukodystrophy (ALD) is a disorder of peroxisomal {beta}-oxidation of very long chain fatty acids. It presents either as progressive dementia in childhood or as progressive paraparesis in later years. Adrenal insufficiency occurs in both phenotypes. The gene of the ALD protein has been mapped to Xq28 and has recently been cloned and characterized. The ALD protein has significant homology to the peroxisomal membrane protein, PMP70 and belongs to the ATP binding cassette superfamily of transporters. We screened a human genomic library with an ALDP cDNA and isolated 5 different but highly similar clones containing sequences corresponding to the 3{prime} end of the ALDP gene. Comparison of the sequences over the region corresponding to exon 9 through the 3{prime} end of the ALDP gene reveals {approximately}96% nucleotide identity in both exonic and intronic regions. Splice sites and open reading frames are maintained. Using both FISH and human-rodent DNA mapping panels, we positively assign these ALDP-related sequences to chromosomes 2, 16 and 22, and provisionally to 1 and 20. Southern blot of primate DNA probed with a partial ALDP cDNA (exon 2-10) shows that expansion of ALDP-related sequences occurred in higher primates (chimp, gorilla and human). Although Northern blots show multiple ALDP-hybridizing transcripts in certain tissues, we have no evidence to date for expression of these ALDP-related sequences. In conclusion, our data show there has been an unusual and recent dispersal to multiple chromosomes of structural gene sequences related to the ALDP gene. The functional significance of these sequences remains to be determined but their existence complicates PCR and mutation analysis of the ALDP gene.

  5. A New Hidden Markov Model for Protein Quality Assessment Using Compatibility Between Protein Sequence and Structure

    PubMed Central

    He, Zhiquan; Ma, Wenji; Zhang, Jingfen; Xu, Dong

    2015-01-01

    Protein structure Quality Assessment (QA) is an essential component in protein structure prediction and analysis. The relationship between protein sequence and structure often serves as a basis for protein structure QA. In this work, we developed a new Hidden Markov Model (HMM) to assess the compatibility of protein sequence and structure for capturing their complex relationship. More specifically, the emission of the HMM consists of protein local structures in angular space, secondary structures, and sequence profiles. This model has two capabilities: (1) encoding local structure of each position by jointly considering sequence and structure information, and (2) assigning a global score to estimate the overall quality of a predicted structure, as well as local scores to assess the quality of specific regions of a structure, which provides useful guidance for targeted structure refinement. We compared the HMM model to state-of-art single structure quality assessment methods OPUSCA, DFIRE, GOAP, and RW in protein structure selection. Computational results showed our new score HMM.Z can achieve better overall selection performance on the benchmark datasets. PMID:26221066

  6. Properties of Sequence Conservation in Upstream Regulatory and Protein Coding Sequences among Paralogs in Arabidopsis thaliana

    NASA Astrophysics Data System (ADS)

    Richardson, Dale N.; Wiehe, Thomas

    Whole genome duplication (WGD) has catalyzed the formation of new species, genes with novel functions, altered expression patterns, complexified signaling pathways and has provided organisms a level of genetic robustness. We studied the long-term evolution and interrelationships of 5’ upstream regulatory sequences (URSs), protein coding sequences (CDSs) and expression correlations (EC) of duplicated gene pairs in Arabidopsis. Three distinct methods revealed significant evolutionary conservation between paralogous URSs and were highly correlated with microarray-based expression correlation of the respective gene pairs. Positional information on exact matches between sequences unveiled the contribution of micro-chromosomal rearrangements on expression divergence. A three-way rank analysis of URS similarity, CDS divergence and EC uncovered specific gene functional biases. Transcription factor activity was associated with gene pairs exhibiting conserved URSs and divergent CDSs, whereas a broad array of metabolic enzymes was found to be associated with gene pairs showing diverged URSs but conserved CDSs.

  7. Substrate-Driven Mapping of the Degradome by Comparison of Sequence Logos

    PubMed Central

    Fuchs, Julian E.; von Grafenstein, Susanne; Huber, Roland G.; Kramer, Christian; Liedl, Klaus R.

    2013-01-01

    Sequence logos are frequently used to illustrate substrate preferences and specificity of proteases. Here, we employed the compiled substrates of the MEROPS database to introduce a novel metric for comparison of protease substrate preferences. The constructed similarity matrix of 62 proteases can be used to intuitively visualize similarities in protease substrate readout via principal component analysis and construction of protease specificity trees. Since our new metric is solely based on substrate data, we can engraft the protease tree including proteolytic enzymes of different evolutionary origin. Thereby, our analyses confirm pronounced overlaps in substrate recognition not only between proteases closely related on sequence basis but also between proteolytic enzymes of different evolutionary origin and catalytic type. To illustrate the applicability of our approach we analyze the distribution of targets of small molecules from the ChEMBL database in our substrate-based protease specificity trees. We observe a striking clustering of annotated targets in tree branches even though these grouped targets do not necessarily share similarity on protein sequence level. This highlights the value and applicability of knowledge acquired from peptide substrates in drug design of small molecules, e.g., for the prediction of off-target effects or drug repurposing. Consequently, our similarity metric allows to map the degradome and its associated drug target network via comparison of known substrate peptides. The substrate-driven view of protein-protein interfaces is not limited to the field of proteases but can be applied to any target class where a sufficient amount of known substrate data is available. PMID:24244149

  8. Identification of Sequences Encoding Symbiodinium minutum Mitochondrial Proteins.

    PubMed

    Butterfield, Erin R; Howe, Christopher J; Nisbet, R Ellen R

    2016-02-01

    The dinoflagellates are an extremely diverse group of algae closely related to the Apicomplexa and the ciliates. Much work has previously been undertaken to determine the presence of various biochemical pathways within dinoflagellate mitochondria. However, these studies were unable to identify several key transcripts including those encoding proteins involved in the pyruvate dehydrogenase complex, iron-sulfur cluster biosynthesis, and protein import. Here, we analyze the draft nuclear genome of the dinoflagellate Symbiodinium minutum, as well as RNAseq data to identify nuclear genes encoding mitochondrial proteins. The results confirm the presence of a complete tricarboxylic acid cycle in the dinoflagellates. Results also demonstrate the difficulties in using the genome sequence for the identification of genes due to the large number of introns, but show that it is highly useful for the determination of gene duplication events. PMID:26798115

  9. Identification of Sequences Encoding Symbiodinium minutum Mitochondrial Proteins

    PubMed Central

    Butterfield, Erin R.; Howe, Christopher J.; Nisbet, R. Ellen R.

    2016-01-01

    The dinoflagellates are an extremely diverse group of algae closely related to the Apicomplexa and the ciliates. Much work has previously been undertaken to determine the presence of various biochemical pathways within dinoflagellate mitochondria. However, these studies were unable to identify several key transcripts including those encoding proteins involved in the pyruvate dehydrogenase complex, iron–sulfur cluster biosynthesis, and protein import. Here, we analyze the draft nuclear genome of the dinoflagellate Symbiodinium minutum, as well as RNAseq data to identify nuclear genes encoding mitochondrial proteins. The results confirm the presence of a complete tricarboxylic acid cycle in the dinoflagellates. Results also demonstrate the difficulties in using the genome sequence for the identification of genes due to the large number of introns, but show that it is highly useful for the determination of gene duplication events. PMID:26798115

  10. Whole Chloroplast Genome Sequencing in Fragaria Using Deep Sequencing: A Comparison of Three Methods

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Chloroplast sequences previously investigated in Fragaria revealed low amounts of variation. Deep sequencing technologies enable economical sequencing of complete chloroplast genomes. These sequences can potentially provide robust phylogenetic resolution, even at low taxonomic levels within plant gr...

  11. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison

    PubMed Central

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  12. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison.

    PubMed

    Kato, Mikio

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  13. DNA topology confers sequence specificity to nonspecific architectural proteins.

    PubMed

    Wei, Juan; Czapla, Luke; Grosner, Michael A; Swigon, David; Olson, Wilma K

    2014-11-25

    Topological constraints placed on short fragments of DNA change the disorder found in chain molecules randomly decorated by nonspecific, architectural proteins into tightly organized 3D structures. The bacterial heat-unstable (HU) protein builds up, counter to expectations, in greater quantities and at particular sites along simulated DNA minicircles and loops. Moreover, the placement of HU along loops with the "wild-type" spacing found in the Escherichia coli lactose (lac) and galactose (gal) operons precludes access to key recognition elements on DNA. The HU protein introduces a unique spatial pathway in the DNA upon closure. The many ways in which the protein induces nearly the same closed circular configuration point to the statistical advantage of its nonspecificity. The rotational settings imposed on DNA by the repressor proteins, by contrast, introduce sequential specificity in HU placement, with the nonspecific protein accumulating at particular loci on the constrained duplex. Thus, an architectural protein with no discernible DNA sequence-recognizing features becomes site-specific and potentially assumes a functional role upon loop formation. The locations of HU on the closed DNA reflect long-range mechanical correlations. The protein responds to DNA shape and deformability—the stiff, naturally straight double-helical structure—rather than to the unique features of the constituent base pairs. The structures of the simulated loops suggest that HU architecture, like nucleosomal architecture, which modulates the ability of regulatory proteins to recognize their binding sites in the context of chromatin, may influence repressor-operator interactions in the context of the bacterial nucleoid. PMID:25385626

  14. Comparison of the Folding Mechanism of Highly Homologous Proteins in the Lipid-binding Protein Family

    EPA Science Inventory

    The folding mechanism of two closely related proteins in the intracellular lipid binding protein family, human bile acid binding protein (hBABP) and rat bile acid binding protein (rBABP) were examined. These proteins are 77% identical (93% similar) in sequence Both of these singl...

  15. Phosphatidylinositol transfer proteins: sequence motifs in structural and evolutionary analyses

    PubMed Central

    Wyckoff, Gerald J.; Solidar, Ada; Yoden, Marilyn D.

    2016-01-01

    Phosphatidylinositol transfer proteins (PITP) are a family of monomeric proteins that bind and transfer phosphatidylinositol and phosphatidylcholine between membrane compartments. They are required for production of inositol and diacylglycerol second messengers, and are found in most metazoan organisms. While PITPs are known to carry out crucial cell-signaling roles in many organisms, the structure, function and evolution of the majority of family members remains unexplored; primarily because the ubiquity and diversity of the family thwarts traditional methods of global alignment. To surmount this obstacle, we instead took a novel approach, using MEME and a parsimony-based analysis to create a cladogram of conserved sequence motifs in 56 PITP family proteins from 26 species. In keeping with previous functional annotations, three clades were supported within our evolutionary analysis; two classes of soluble proteins and a class of membrane-associated proteins. By, focusing on conserved regions, the analysis allowed for in depth queries regarding possible functional roles of PITP proteins in both intra- and extra- cellular signaling.

  16. Probing protein sequences as sources for encrypted antimicrobial peptides.

    PubMed

    Brand, Guilherme D; Magalhães, Mariana T Q; Tinoco, Maria L P; Aragão, Francisco J L; Nicoli, Jacques; Kelly, Sharon M; Cooper, Alan; Bloch, Carlos

    2012-01-01

    Starting from the premise that a wealth of potentially biologically active peptides may lurk within proteins, we describe here a methodology to identify putative antimicrobial peptides encrypted in protein sequences. Candidate peptides were identified using a new screening procedure based on physicochemical criteria to reveal matching peptides within protein databases. Fifteen such peptides, along with a range of natural antimicrobial peptides, were examined using DSC and CD to characterize their interaction with phospholipid membranes. Principal component analysis of DSC data shows that the investigated peptides group according to their effects on the main phase transition of phospholipid vesicles, and that these effects correlate both to antimicrobial activity and to the changes in peptide secondary structure. Consequently, we have been able to identify novel antimicrobial peptides from larger proteins not hitherto associated with such activity, mimicking endogenous and/or exogenous microorganism enzymatic processing of parent proteins to smaller bioactive molecules. A biotechnological application for this methodology is explored. Soybean (Glycine max) plants, transformed to include a putative antimicrobial protein fragment encoded in its own genome were tested for tolerance against Phakopsora pachyrhizi, the causative agent of the Asian soybean rust. This procedure may represent an inventive alternative to the transgenic technology, since the genetic material to be used belongs to the host organism and not to exogenous sources. PMID:23029273

  17. Detecting pore-lining regions in transmembrane protein sequences

    PubMed Central

    2012-01-01

    Background Alpha-helical transmembrane channel and transporter proteins play vital roles in a diverse range of essential biological processes and are crucial in facilitating the passage of ions and molecules across the lipid bilayer. However, the experimental difficulties associated with obtaining high quality crystals has led to their significant under-representation in structural databases. Computational methods that can identify structural features from sequence alone are therefore of high importance. Results We present a method capable of automatically identifying pore-lining regions in transmembrane proteins from sequence information alone, which can then be used to determine the pore stoichiometry. By labelling pore-lining residues in crystal structures using geometric criteria, we have trained a support vector machine classifier to predict the likelihood of a transmembrane helix being involved in pore formation. Results from testing this approach under stringent cross-validation indicate that prediction accuracy of 72% is possible, while a support vector regression model is able to predict the number of subunits participating in the pore with 62% accuracy. Conclusion To our knowledge, this is the first tool capable of identifying pore-lining regions in proteins and we present the results of applying it to a data set of sequences with available crystal structures. Our method provides a way to characterise pores in transmembrane proteins and may even provide a starting point for discovering novel routes of therapeutic intervention in a number of important diseases. This software is freely available as source code from: http://bioinf.cs.ucl.ac.uk/downloads/memsat-svm/. PMID:22805427

  18. Sequence comparisons in the aminoacyl-tRNA synthetases with emphasis on regions of likely homology with sequences in the Rossmann fold in the methionyl and tyrosyl enzymes.

    PubMed

    Walker, E J; Jeffrey, P D

    1988-02-01

    Amino acid sequences of aminoacyl-tRNA synthetases specific for 12 different amino acids have now been published. Differences in origin at the species and organelle level result in 20 distinct sequences being available for comparison. Some of these were compared in small groups as they were determined and, although some homologies were detected, it was generally concluded that there was surprisingly little sequence homology in this functionally related group of enzymes. We have made comparisons of all of the available sequences by using a combination of computer and manual alignment methods and knowledge of the sequences in the Rossmann fold region of methionyl-tRNA synthetase from E. coli and tyrosyl-tRNA synthetase from B. stearothermophilus, enzymes whose three-dimensional structures have been described. It emerges that all of the aminoacyl-tRNA synthetase sequences thus examined show considerable homology with each other over at least parts of this region, some over virtually all of it. We conclude that a great deal more similarity than had previously been suspected exists in these proteins. In particular, the alignments we have made strongly imply the existence of a mononucleotide binding site of the Rossmann fold configuration in all of the synthetases compared. PMID:3283733

  19. Alignment-free comparison of genome sequences by a new numerical characterization.

    PubMed

    Huang, Guohua; Zhou, Houqing; Li, Yongfan; Xu, Lixin

    2011-07-21

    In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison. PMID:21536050

  20. Prediction of neddylation sites from protein sequences and sequence-derived properties

    PubMed Central

    2015-01-01

    Background Neddylation is a reversible post-translational modification that plays a vital role in maintaining cellular machinery. It is shown to affect localization, binding partners and structure of target proteins. Disruption of protein neddylation was observed in various diseases such as Alzheimer's and cancer. Therefore, understanding the neddylation mechanism and determining neddylation targets possibly bears a huge importance in further understanding the cellular processes. This study is the first attempt to predict neddylated sites from protein sequences by using several sequence and sequence-based structural features. Results We have developed a neddylation site prediction method using a support vector machine based on various sequence properties, position-specific scoring matrices, and disorder. Using 21 amino acid long lysine-centred windows, our model was able to predict neddylation sites successfully, with an average 5-fold stratified cross validation performance of 0.91, 0.91, 0.75, 0.44, 0.95 for accuracy, specificity, sensitivity, Matthew's correlation coefficient and area under curve, respectively. Independent test set results validated the robustness of reported new method. Additionally, we observed that neddylation sites are commonly flexible and there is a significant positively charged amino acid presence in neddylation sites. Conclusions In this study, a neddylation site prediction method was developed for the first time in literature. Common characteristics of neddylation sites and their discriminative properties were explored for further in silico studies on neddylation. Lastly, up-to-date neddylation dataset was provided for researchers working on post-translational modifications in the accompanying supplementary material of this article. PMID:26679222

  1. Sequence Heterogeneity Accelerates Protein Search for Targets on DNA

    NASA Astrophysics Data System (ADS)

    Shvets, Alexey; Kolomeisky, Anatoly

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry and heterogeneity of a genome. The work was supported by the Welch Foundation (Grant C-1559), by the NSF (Grant CHE-1360979), and by the Center for Theoretical Biological Physics sponsored by the NSF (Grant PHY-1427654).

  2. A minimal sequence code for switching protein structure and function.

    PubMed

    Alexander, Patrick A; He, Yanan; Chen, Yihong; Orban, John; Bryan, Philip N

    2009-12-15

    We present here a structural and mechanistic description of how a protein changes its fold and function, mutation by mutation. Our approach was to create 2 proteins that (i) are stably folded into 2 different folds, (ii) have 2 different functions, and (iii) are very similar in sequence. In this simplified sequence space we explore the mutational path from one fold to another. We show that an IgG-binding, 4beta+alpha fold can be transformed into an albumin-binding, 3-alpha fold via a mutational pathway in which neither function nor native structure is completely lost. The stabilities of all mutants along the pathway are evaluated, key high-resolution structures are determined by NMR, and an explanation of the switching mechanism is provided. We show that the conformational switch from 4beta+alpha to 3-alpha structure can occur via a single amino acid substitution. On one side of the switch point, the 4beta+alpha fold is >90% populated (pH 7.2, 20 degrees C). A single mutation switches the conformation to the 3-alpha fold, which is >90% populated (pH 7.2, 20 degrees C). We further show that a bifunctional protein exists at the switch point with affinity for both IgG and albumin. PMID:19923431

  3. MutaBind estimates and interprets the effects of sequence variants on protein-protein interactions.

    PubMed

    Li, Minghui; Simonetti, Franco L; Goncearenco, Alexander; Panchenko, Anna R

    2016-07-01

    Proteins engage in highly selective interactions with their macromolecular partners. Sequence variants that alter protein binding affinity may cause significant perturbations or complete abolishment of function, potentially leading to diseases. There exists a persistent need to develop a mechanistic understanding of impacts of variants on proteins. To address this need we introduce a new computational method MutaBind to evaluate the effects of sequence variants and disease mutations on protein interactions and calculate the quantitative changes in binding affinity. The MutaBind method uses molecular mechanics force fields, statistical potentials and fast side-chain optimization algorithms. The MutaBind server maps mutations on a structural protein complex, calculates the associated changes in binding affinity, determines the deleterious effect of a mutation, estimates the confidence of this prediction and produces a mutant structural model for download. MutaBind can be applied to a large number of problems, including determination of potential driver mutations in cancer and other diseases, elucidation of the effects of sequence variants on protein fitness in evolution and protein design. MutaBind is available at http://www.ncbi.nlm.nih.gov/projects/mutabind/. PMID:27150810

  4. Phenotypic comparisons of consensus variants versus laboratory resurrections of Precambrian proteins.

    PubMed

    Risso, Valeria A; Gavira, Jose A; Gaucher, Eric A; Sanchez-Ruiz, Jose M

    2014-06-01

    Consensus-sequence engineering has generated protein variants with enhanced stability, and sometimes, with modulated biological function. Consensus mutations are often interpreted as the introduction of ancestral amino acid residues. However, the precise relationship between consensus engineering and ancestral protein resurrection is not fully understood. Here, we report the properties of proteins encoded by consensus sequences derived from a multiple sequence alignment of extant, class A β-lactamases, as compared with the properties of ancient Precambrian β-lactamases resurrected in the laboratory. These comparisons considered primary sequence, secondary, and tertiary structure, as well as stability and catalysis against different antibiotics. Out of the three consensus variants generated, one could not be expressed and purified (likely due to misfolding and/or low stability) and only one displayed substantial stability having substrate promiscuity, although to a lower extent than ancient β-lactamases. These results: (i) highlight the phenotypic differences between consensus variants and laboratory resurrections of ancestral proteins; (ii) question interpretations of consensus proteins as phenotypic proxies of ancestral proteins; and (iii) support the notion that ancient proteins provide a robust approach toward the preparation of protein variants having large numbers of mutational changes while possessing unique biomolecular properties. PMID:24710963

  5. Comparison of Complete Genome Sequences of Usutu Virus Strains Detected in Spain, Central Europe, and Africa

    PubMed Central

    Busquets, Núria; Nowotny, Norbert

    2014-01-01

    Abstract The complete genomic sequence of Usutu virus (USUV, genus Flavivirus, family Flaviviridae) strain MB119/06, detected in a pool of Culex pipiens mosquitoes in northeastern Spain (Viladecans, Catalonia) in 2006, was determined and analyzed. The phylogenetic relationship with all other available complete USUV genome sequences was established. The Spanish sequence investigated showed the closest relationship to the USUV prototype strain SA AR 1776 isolated in South Africa in 1959 (96.9% nucleotide and 98.8% amino acid identities). Conserved structural elements and enzyme motifs of the putative polyprotein precursor were identified. Unique amino acid substitutions were recognized; however, their potential roles as virulence markers could not be verified. Comparisons of the polyprotein precursor sequences of USUV strains detected in mosquitoes, birds, and humans could not confirm the predicted role of unique amino acid substitutions in relation to virulence in humans. Phylogenetic analysis of a partial coding section of the NS5 protein gene region indicated that USUV strains circulating in Europe form three different genetic clusters. Broad and targeted surveys for USUV in mosquitoes could reveal further details of the geographic distribution and genetic diversity of the virus in Europe and in Africa. PMID:24746182

  6. Amino acid sequence of the Amur tiger prion protein.

    PubMed

    Wu, Changde; Pang, Wanyong; Zhao, Deming

    2006-10-01

    Prion diseases are fatal neurodegenerative disorders in human and animal associated with conformational conversion of a cellular prion protein (PrP(C)) into the pathologic isoform (PrP(Sc)). Various data indicate that the polymorphisms within the open reading frame (ORF) of PrP are associated with the susceptibility and control the species barrier in prion diseases. In the present study, partial Prnp from 25 Amur tigers (tPrnp) were cloned and screened for polymorphisms. Four single nucleotide polymorphisms (T423C, A501G, C511A, A610G) were found; the C511A and A610G nucleotide substitutions resulted in the amino acid changes Lysine171Glutamine and Alanine204Threoine, respectively. The tPrnp amino acid sequence is similar to house cat (Felis catus ) and sheep, but differs significantly from other two cat Prnp sequences that were previously deposited in GenBank. PMID:16780982

  7. Sequence-Specific Solvent Accessibilities of Protein Residues in Unfolded Protein Ensembles

    PubMed Central

    Bernadó, Pau; Blackledge, Martin; Sancho, Javier

    2006-01-01

    Protein stability cannot be understood without the correct description of the unfolded state. We present here an efficient method for accurate calculation of atomic solvent exposures for denatured protein ensembles. The method used to generate the ensembles has been shown to reproduce diverse biophysical experimental data corresponding to natively and chemically unfolded proteins. Using a data set of 19 nonhomologous proteins containing from 98 to 579 residues, we report average accessibilities for all residue types. These averaged accessibilities are considerably lower than those previously reported for tripeptides and close to the lower limit reported by Creamer and co-workers. Of importance, we observe remarkable sequence dependence for the exposure to solvent of all residue types, which indicates that average residue solvent exposures can be inappropriate to interpret mutational studies. In addition, we observe smaller influences of both protein size and protein amino acid composition in the averaged residue solvent exposures for individual proteins. Calculating residue-specific solvent accessibilities within the context of real sequences is thus necessary and feasible. The approach presented here may allow a more precise parameterization of protein energetics as a function of polar- and apolar-area burial and opens new ways to investigate the energetics of the unfolded state of proteins. PMID:17012314

  8. Complete genome sequence of the hyperthermophilic archaeon Thermococcus kodakaraensis KOD1 and comparison with Pyrococcus genomes

    PubMed Central

    Fukui, Toshiaki; Atomi, Haruyuki; Kanai, Tamotsu; Matsumi, Rie; Fujiwara, Shinsuke; Imanaka, Tadayuki

    2005-01-01

    The genus Thermococcus, comprised of sulfur-reducing hyperthermophilic archaea, belongs to the order Thermococcales in Euryarchaeota along with the closely related genus Pyrococcus. The members of Thermococcus are ubiquitously present in natural high-temperature environments, and are therefore considered to play a major role in the ecology and metabolic activity of microbial consortia within hot-water ecosystems. To obtain insight into this important genus, we have determined and annotated the complete 2,088,737-base genome of Thermococcus kodakaraensis strain KOD1, followed by a comparison with the three complete genomes of Pyrococcus spp. A total of 2306 coding DNA sequences (CDSs) have been identified, among which half (1165 CDSs) are annotatable, whereas the functions of 41% (936 CDSs) cannot be predicted from the primary structures. The genome contains seven genes for probable transposases and four virus-related regions. Several proteins within these genetic elements show high similarities to those in Pyrococcus spp., implying the natural occurrence of horizontal gene transfer of such mobile elements among the order Thermococcales. Comparative genomics clarified that 1204 proteins, including those for information processing and basic metabolisms, are shared among T. kodakaraensis and the three Pyrococcus spp. On the other hand, among the set of 689 proteins unique to T. kodakaraensis, there are several intriguing proteins that might be responsible for the specific trait of the genus Thermococcus, such as proteins involved in additional pyruvate oxidation, nucleotide metabolisms, unique or additional metal ion transporters, improved stress response system, and a distinct restriction system. PMID:15710748

  9. Identification of Sequence Specificity of 5-Methylcytosine Oxidation by Tet1 Protein with High-Throughput Sequencing.

    PubMed

    Kizaki, Seiichiro; Chandran, Anandhakumar; Sugiyama, Hiroshi

    2016-03-01

    Tet (ten-eleven translocation) family proteins have the ability to oxidize 5-methylcytosine (mC) to 5-hydroxymethylcytosine (hmC), 5-formylcytosine (fC), and 5-carboxycytosine (caC). However, the oxidation reaction of Tet is not understood completely. Evaluation of genomic-level epigenetic changes by Tet protein requires unbiased identification of the highly selective oxidation sites. In this study, we used high-throughput sequencing to investigate the sequence specificity of mC oxidation by Tet1. A 6.6×10(4) -member mC-containing random DNA-sequence library was constructed. The library was subjected to Tet-reactive pulldown followed by high-throughput sequencing. Analysis of the obtained sequence data identified the Tet1-reactive sequences. We identified mCpG as a highly reactive sequence of Tet1 protein. PMID:26715454

  10. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

    PubMed

    Pruitt, Kim D; Tatusova, Tatiana; Maglott, Donna R

    2005-01-01

    The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff. PMID:15608248

  11. Fibronectin-binding protein of Streptococcus pyogenes: sequence of the binding domain involved in adherence of streptococci to epithelial cells.

    PubMed Central

    Talay, S R; Valentin-Weigand, P; Jerlström, P G; Timmis, K N; Chhatwal, G S

    1992-01-01

    The sequence of the fibronectin-binding domain of the fibronectin-binding protein of Streptococcus pyogenes (Sfb protein) was determined, and its role in streptococcal adherence was investigated by use of an Sfb fusion protein in adherence studies. A 1-kb DNA fragment coding for the binding domain of Sfb protein was cloned into the expression vector pEX31 to produce an Sfb fusion protein consisting of the N-terminal part of MS2 polymerase and a C-terminal fragment of the streptococcal protein. Induction of the vector promoter resulted in hyperexpression of fibronectin-binding fusion protein in the cytoplasm of the recombinant Escherichia coli cells. Sequence determination of the cloned 1-kb fragment revealed an in-frame reading frame for a 268-amino-acid peptide composed of a 37-amino-acid sequence which is completely repeated three times and incompletely repeated a fourth time. Cloning of one repeat into pEX31 resulted in expression of small fusion peptides that show fibronectin-binding activity, indicating that one repeat contains at least one binding domain. Each repeat exhibits two charged domains and shows high homology with the 38-amino-acid D3 repeat of the fibronectin-binding protein of Staphylococcus aureus. Sequence comparison with other streptococcal ligand-binding surface proteins, including M protein, failed to reveal significant homology, which suggests that Sfb protein represents a novel type of functional protein in S. pyogenes. The Sfb fusion protein isolated from the cytoplasm of recombinant cells was purified by fast protein liquid chromatography. It showed a strong competitive inhibition of fibronectin binding to S. pyogenes and of the adherence of bacteria to cultured epithelial cells. In contrast, purified streptococcal lipoteichoic acid showed only a weak inhibition of fibronectin binding and streptococcal adherence. These results demonstrate that Sfb protein is directly involved in the fibronectin-mediated adherence of S. pyogenes to

  12. Evolution of Protein-binding DNA Sequences through Competitive Binding

    NASA Astrophysics Data System (ADS)

    Peng, Weiqun; Gerland, Ulrich; Hwa, Terence; Levine, Herbert

    2002-03-01

    The dynamics of in vitro DNA evolution controlled via competitive binding of DNA sequences to proteins has been explored in a recent serial transfer experiment footnote B. Dubertret, S.Liu, Q. Ouyang, A. Libchaber, Phys. Rev. Lett. 86, 6022 (2001).. Motivated by the experiment, we investigate a continuum model for this evolution process in various parameter regimes. We establish a self-consistent mean-field evolution equation, determine its dynamical properties and finite population size corrections. In addition, we discuss the experimental implications of our results.

  13. Sequence comparison on a cluster of workstations using the PVM system

    SciTech Connect

    Guan, X.; Mural, R.J.; Uberbacher, E.C.

    1995-02-01

    We have implemented a distributed sequence comparison algorithm on a cluster of workstations using the PVM paradigm. This implementation has achieved similar performance to the intel iPSC/860 Hypercube, a massively parallel computer. The distributed sequence comparison algorithm serves as a search tool for two Internet servers GRAIL and GENQUEST. This paper describes the implementation and the performance of the algorithm.

  14. Genome sequence comparison of two United States live attenuated vaccines of infectious laryngotracheitis virus (ILTV).

    PubMed

    Chandra, Yohanna Gita; Lee, Jeongyoon; Kong, Byung-Whi

    2012-06-01

    This study was conducted to identify unique nucleotide differences in two U.S. chicken embryo origin (CEO) vaccines [LT Blen (GenBank accession: JQ083493) designated as vaccine 1; Laryngo-Vac(®) (GenBank accession: JQ083494) designated as vaccine 2] of infectious laryngotracheitis virus (ILTV) genomes compared to an Australian Serva vaccine reference ILTV genome sequence [Gallid herpesvirus 1 (GaHV-1); GenBank accession number: HQ630064]. Genomes of the two vaccine ILTV strains were sequenced using Illumina Genome Analyzer 2X of 36 cycles of single-end reads. Results revealed that few nucleotide differences (23 in vaccine 1; 31 in vaccine 2) were found and indicate that the US CEO strains are practically identical to the Australian Serva CEO strain, which is a European-origin vaccine. The sequence differences demonstrated the spectrum of variability among vaccine strains. Only eight amino acid differences were found in ILTV proteins including UL54, UL27, UL28, UL20, UL1, ICP4, and US8 in vaccine 1. Similarly, in vaccine 2, eight amino acid differences were found in UL54, UL27, UL28, UL36, UL1, ICP4, US10, and US8. Further comparison of US CEO vaccines to several ILTV genome sequences revealed that US CEO vaccines are genetically close to both the Serva vaccine and 63140/C/08/BR (GenBank accession: HM188407) and are distinct from the two Australian-origin CEO vaccines, SA2 (GenBank accession: JN596962) and A20 (GenBank accession: JN596963), which showed close similarity to each other. These data demonstrate the potential of high-throughput sequencing technology to yield insight into the sequence variation of different ILTV strains. This information can be used to discriminate between vaccine ILTV strains and further, to identify newly emerging mutant strains of field isolates. PMID:22382591

  15. Sequence-dependent Prion Protein Misfolding and Neurotoxicity*

    PubMed Central

    Fernandez-Funez, Pedro; Zhang, Yan; Casas-Tinto, Sergio; Xiao, Xiangzhu; Zou, Wen-Quan; Rincon-Limas, Diego E.

    2010-01-01

    Prion diseases are neurodegenerative disorders caused by misfolding of the normal prion protein (PrP) into a pathogenic “scrapie” conformation. To better understand the cellular and molecular mechanisms that govern the conformational changes (conversion) of PrP, we compared the dynamics of PrP from mammals susceptible (hamster and mouse) and resistant (rabbit) to prion diseases in transgenic flies. We recently showed that hamster PrP induces spongiform degeneration and accumulates into highly aggregated, scrapie-like conformers in transgenic flies. We show now that rabbit PrP does not induce spongiform degeneration and does not convert into scrapie-like conformers. Surprisingly, mouse PrP induces weak neurodegeneration and accumulates small amounts of scrapie-like conformers. Thus, the expression of three highly conserved mammalian prion proteins in transgenic flies uncovered prominent differences in their conformational dynamics. How these properties are encoded in the amino acid sequence remains to be elucidated. PMID:20817727

  16. No Genome-Wide Protein Sequence Convergence for Echolocation

    PubMed Central

    Zou, Zhengting; Zhang, Jianzhi

    2015-01-01

    Toothed whales and two groups of bats independently acquired echolocation, the ability to locate and identify objects by reflected sound. Echolocation requires physiologically complex and coordinated vocal, auditory, and neural functions, but the molecular basis of the capacity for echolocation is not well understood. A recent study suggested that convergent amino acid substitutions widespread in the proteins of echolocators underlay the convergent origins of mammalian echolocation. Here, we show that genomic signatures of molecular convergence between echolocating lineages are generally no stronger than those between echolocating and comparable nonecholocating lineages. The same is true for the group of 29 hearing-related proteins claimed to be enriched with molecular convergence. Reexamining the previous selection test reveals several flaws and invalidates the asserted evidence for adaptive convergence. Together, these findings indicate that the reported genomic signatures of convergence largely reflect the background level of sequence convergence unrelated to the origins of echolocation. PMID:25631925

  17. No genome-wide protein sequence convergence for echolocation.

    PubMed

    Zou, Zhengting; Zhang, Jianzhi

    2015-05-01

    Toothed whales and two groups of bats independently acquired echolocation, the ability to locate and identify objects by reflected sound. Echolocation requires physiologically complex and coordinated vocal, auditory, and neural functions, but the molecular basis of the capacity for echolocation is not well understood. A recent study suggested that convergent amino acid substitutions widespread in the proteins of echolocators underlay the convergent origins of mammalian echolocation. Here, we show that genomic signatures of molecular convergence between echolocating lineages are generally no stronger than those between echolocating and comparable nonecholocating lineages. The same is true for the group of 29 hearing-related proteins claimed to be enriched with molecular convergence. Reexamining the previous selection test reveals several flaws and invalidates the asserted evidence for adaptive convergence. Together, these findings indicate that the reported genomic signatures of convergence largely reflect the background level of sequence convergence unrelated to the origins of echolocation. PMID:25631925

  18. Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads

    PubMed Central

    Song, Kai; Ren, Jie; Zhai, Zhiyuan; Liu, Xuemei

    2013-01-01

    Abstract Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, D2, \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\textbf{\\textit{D}}_{\\bf 2}^{\\bf *}$$ \\end{document}, and \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\textbf{\\textit{D}}_{\\bf 2}^S$$ \\end{document}, both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\textbf{\\textit{D}}_{\\bf 2}^{\\bf *}$$ \\end{document} and \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin

  19. Transitive Homology-Guided Structural Studies Lead to Discovery of Cro Proteins With 40% Sequence Identify But Different Folds

    SciTech Connect

    Roessler, C.G.; Hall, B.M.; Anderson, W.J.; Ingram, W.M.; Roberts, S.A.; Montfort, W.R.; Cordes, M.H.J.

    2009-05-27

    Proteins that share common ancestry may differ in structure and function because of divergent evolution of their amino acid sequences. For a typical diverse protein superfamily, the properties of a few scattered members are known from experiment. A satisfying picture of functional and structural evolution in relation to sequence changes, however, may require characterization of a larger, well chosen subset. Here, we employ a 'stepping-stone' method, based on transitive homology, to target sequences intermediate between two related proteins with known divergent properties. We apply the approach to the question of how new protein folds can evolve from preexisting folds and, in particular, to an evolutionary change in secondary structure and oligomeric state in the Cro family of bacteriophage transcription factors, initially identified by sequence-structure comparison of distant homologs from phages P22 and {lambda}. We report crystal structures of two Cro proteins, Xfaso 1 and Pfl 6, with sequences intermediate between those of P22 and {lambda}. The domains show 40% sequence identity but differ by switching of {alpha}-helix to {beta}-sheet in a C-terminal region spanning {approx}25 residues. Sedimentation analysis also suggests a correlation between helix-to-sheet conversion and strengthened dimerization.

  20. Gleditsia sinensis: Transcriptome Sequencing, Construction, and Application of Its Protein-Protein Interaction Network

    PubMed Central

    Zhu, Liucun; Zhang, Ying; Guo, Wenna; Wang, Qiang

    2014-01-01

    Gleditsia sinensis is a genus of deciduous tree in the family Caesalpinioideae, native to China, and is of great economic importance. However, despite its economic value, gene sequence information is strongly lacking. In the present study, transcriptome sequencing of G. sinensis was performed resulting in approximately 75.5 million clean reads assembled into 142155 unique transcripts generating 58583 unigenes. The average length of the unigenes was 900 bp, with an N50 of 549 bp. The obtained unigene sequences were then compared to four protein databases to include NCBI nonredundant protein (NRDB), Swiss-prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), and the Cluster of Orthologous Groups (COG). Using BLAST procedure, 31385 unigenes (53.6%) were generated to have functional annotations. Additionally, sequence homologies between identified unigenes and genes of known species in a protein-protein interaction (PPI) network facilitated G. sinensis PPI network construction. Based on this network construction, new stress resistance genes (including cold, drought, and high salinity) were predicted. The present study is the first investigation of genome-wide gene expression in G. sinensis with the results providing a basis for future functional genomic studies relating to this species. PMID:24982878

  1. A Parallel Non-Alignment Based Approach to Efficient Sequence Comparison using Longest Common Subsequences

    NASA Astrophysics Data System (ADS)

    Bhowmick, S.; Shafiullah, M.; Rai, H.; Bastola, D.

    2010-11-01

    Biological sequence comparison programs have revolutionized the practice of biochemistry, and molecular and evolutionary biology. Pairwise comparison of genomic sequences is a popular method of choice for analyzing genetic sequence data. However the quality of results from most sequence comparison methods are significantly affected by small perturbations in the data and furthermore, there is a dearth of computational tools to compare sequences beyond a certain length. In this paper, we describe a parallel algorithm for comparing genetic sequences using an alignment free-method based on computing the Longest Common Subsequence (LCS) between genetic sequences. We validate the quality of our results by comparing the phylogenetic tress obtained from ClustalW and LCS. We also show through complexity analysis of the isoefficiency and by empirical measurement of the running time that our algorithm is very scalable.

  2. Comparison of Whole-Genome Sequences from Two Colony Morphovars of Burkholderia pseudomallei

    PubMed Central

    Hsueh, Pei-Tan; Chen, Yao-Shen; Lin, Hsi-Hsu; Liu, Pei-Ju; Ni, Wen-Fan; Liu, Mei-Chun

    2015-01-01

    The entire genomes of two isogenic morphovars (vgh16W and vgh16R) of Burkholderia pseudomallei were sequenced. A comparison of the sequences from both strains indicates that they show 99.99% identity, are composed of 22 tandem repeated sequences with <100 bp of indels, and have 199 single-base variants. PMID:26472836

  3. Nucleotide and derived amino acid sequences of the major porin of Comamonas acidovorans and comparison of porin primary structures.

    PubMed Central

    Gerbl-Rieger, S; Peters, J; Kellermann, J; Lottspeich, F; Baumeister, W

    1991-01-01

    The DNA sequence of the gene which codes for the major outer membrane porin (Omp32) of Comamonas acidovorans has been determined. The structural gene encodes a precursor consisting of 351 amino acid residues with a signal peptide of 19 amino acid residues. Comparisons with amino acid sequences of outer membrane proteins and porins from several other members of the class Proteobacteria and of the Chlamydia trachomatis porin and the Neurospora crassa mitochondrial porin revealed a motif of eight regions of local homology. The results of this analysis are discussed with regard to common structural features of porins. PMID:1848840

  4. The amino acid sequence of protein CM-3 from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J

    1985-01-01

    Protein CM-3 from Dendroaspis polylepis polylepis venom was purified by gel filtration and ion exchange chromatography. It comprises 65 amino acids including eight half-cystines. The complete amino acid sequence of protein CM-3 has been elucidated. The sequence (residues 1-50) resembles that of the N-terminal sequence of the subunits of a synergistic type protein and residues 51-65 that of the C-terminal sequence of an angusticeps type protein. Mixtures of protein CM-3 and angusticeps type proteins showed no apparent synergistic effect, in that their toxicity in combination was no greater than the sum of their individual toxicities. PMID:4029488

  5. Protein multiple sequence alignment by hybrid bio-inspired algorithms.

    PubMed

    Cutello, Vincenzo; Nicosia, Giuseppe; Pavone, Mario; Prizzi, Igor

    2011-03-01

    This article presents an immune inspired algorithm to tackle the Multiple Sequence Alignment (MSA) problem. MSA is one of the most important tasks in biological sequence analysis. Although this paper focuses on protein alignments, most of the discussion and methodology may also be applied to DNA alignments. The problem of finding the multiple alignment was investigated in the study by Bonizzoni and Vedova and Wang and Jiang, and proved to be a NP-hard (non-deterministic polynomial-time hard) problem. The presented algorithm, called Immunological Multiple Sequence Alignment Algorithm (IMSA), incorporates two new strategies to create the initial population and specific ad hoc mutation operators. It is based on the 'weighted sum of pairs' as objective function, to evaluate a given candidate alignment. IMSA was tested using both classical benchmarks of BAliBASE (versions 1.0, 2.0 and 3.0), and experimental results indicate that it is comparable with state-of-the-art multiple alignment algorithms, in terms of quality of alignments, weighted Sums-of-Pairs (SP) and Column Score (CS) values. The main novelty of IMSA is its ability to generate more than a single suboptimal alignment, for every MSA instance; this behaviour is due to the stochastic nature of the algorithm and of the populations evolved during the convergence process. This feature will help the decision maker to assess and select a biologically relevant multiple sequence alignment. Finally, the designed algorithm can be used as a local search procedure to properly explore promising alignments of the search space. PMID:21071394

  6. A novel approach to sequence validating protein expression clones with automated decision making

    PubMed Central

    Taycher, Elena; Rolfs, Andreas; Hu, Yanhui; Zuo, Dongmei; Mohr, Stephanie E; Williamson, Janice; LaBaer, Joshua

    2007-01-01

    Background Whereas the molecular assembly of protein expression clones is readily automated and routinely accomplished in high throughput, sequence verification of these clones is still largely performed manually, an arduous and time consuming process. The ultimate goal of validation is to determine if a given plasmid clone matches its reference sequence sufficiently to be "acceptable" for use in protein expression experiments. Given the accelerating increase in availability of tens of thousands of unverified clones, there is a strong demand for rapid, efficient and accurate software that automates clone validation. Results We have developed an Automated Clone Evaluation (ACE) system – the first comprehensive, multi-platform, web-based plasmid sequence verification software package. ACE automates the clone verification process by defining each clone sequence as a list of multidimensional discrepancy objects, each describing a difference between the clone and its expected sequence including the resulting polypeptide consequences. To evaluate clones automatically, this list can be compared against user acceptance criteria that specify the allowable number of discrepancies of each type. This strategy allows users to re-evaluate the same set of clones against different acceptance criteria as needed for use in other experiments. ACE manages the entire sequence validation process including contig management, identifying and annotating discrepancies, determining if discrepancies correspond to polymorphisms and clone finishing. Designed to manage thousands of clones simultaneously, ACE maintains a relational database to store information about clones at various completion stages, project processing parameters and acceptance criteria. In a direct comparison, the automated analysis by ACE took less time and was more accurate than a manual analysis of a 93 gene clone set. Conclusion ACE was designed to facilitate high throughput clone sequence verification projects. The

  7. Detecting protein candidate fragments using a structural alphabet profile comparison approach.

    PubMed

    Shen, Yimin; Picord, Géraldine; Guyon, Frédéric; Tuffery, Pierre

    2013-01-01

    Predicting accurate fragments from sequence has recently become a critical step for protein structure modeling, as protein fragment assembly techniques are presently among the most efficient approaches for de novo prediction. A key step in these approaches is, given the sequence of a protein to model, the identification of relevant fragments - candidate fragments - from a collection of the available 3D structures. These fragments can then be assembled to produce a model of the complete structure of the protein of interest. The search for candidate fragments is classically achieved by considering local sequence similarity using profile comparison, or threading approaches. In the present study, we introduce a new profile comparison approach that, instead of using amino acid profiles, is based on the use of predicted structural alphabet profiles, where structural alphabet profiles contain information related to the 3D local shapes associated with the sequences. We show that structural alphabet profile-profile comparison can be used efficiently to retrieve accurate structural fragments, and we introduce a fully new protocol for the detection of candidate fragments. It identifies fragments specific of each position of the sequence and of size varying between 6 and 27 amino-acids. We find it outperforms present state of the art approaches in terms (i) of the accuracy of the fragments identified, (ii) the rate of true positives identified, while having a high coverage score. We illustrate the relevance of the approach on complete target sets of the two previous Critical Assessment of Techniques for Protein Structure Prediction (CASP) rounds 9 and 10. A web server for the approach is freely available at http://bioserv.rpbs.univ-paris-diderot.fr/SAFrag. PMID:24303019

  8. Detecting Protein Candidate Fragments Using a Structural Alphabet Profile Comparison Approach

    PubMed Central

    Shen, Yimin; Picord, Géraldine; Guyon, Frédéric; Tuffery, Pierre

    2013-01-01

    Predicting accurate fragments from sequence has recently become a critical step for protein structure modeling, as protein fragment assembly techniques are presently among the most efficient approaches for de novo prediction. A key step in these approaches is, given the sequence of a protein to model, the identification of relevant fragments - candidate fragments - from a collection of the available 3D structures. These fragments can then be assembled to produce a model of the complete structure of the protein of interest. The search for candidate fragments is classically achieved by considering local sequence similarity using profile comparison, or threading approaches. In the present study, we introduce a new profile comparison approach that, instead of using amino acid profiles, is based on the use of predicted structural alphabet profiles, where structural alphabet profiles contain information related to the 3D local shapes associated with the sequences. We show that structural alphabet profile-profile comparison can be used efficiently to retrieve accurate structural fragments, and we introduce a fully new protocol for the detection of candidate fragments. It identifies fragments specific of each position of the sequence and of size varying between 6 and 27 amino-acids. We find it outperforms present state of the art approaches in terms (i) of the accuracy of the fragments identified, (ii) the rate of true positives identified, while having a high coverage score. We illustrate the relevance of the approach on complete target sets of the two previous Critical Assessment of Techniques for Protein Structure Prediction (CASP) rounds 9 and 10. A web server for the approach is freely available at http://bioserv.rpbs.univ-paris-diderot.fr/SAFrag. PMID:24303019

  9. Identification of a 35-kilodalton serovar-cross-reactive flagellar protein, FlaB, from Leptospira interrogans by N-terminal sequencing, gene cloning, and sequence analysis.

    PubMed Central

    Lin, M; Surujballi, O; Nielsen, K; Nadin-Davis, S; Randall, G

    1997-01-01

    During the screening of antibodies to pathogenic leptospires, a murine monoclonal antibody (designated M138) was found to react with various serovars. An antigen of approximately 35 kDa from Leptospira interrogans serovar pomona, which reacted strongly with M138, was characterized by N-terminal amino acid sequencing and identified as a flagellin, a class B polypeptide subunit (FlaB) of the periplasmic flagella. The gene encoding the FlaB protein, flaB, was amplified from the genomic DNA of several pathogenic serovars by PCR with a single pair of oligonucleotide primers, suggesting that FlaB is highly conserved among these serovars. Cloning and sequence analysis of flaB from serovar pomona revealed that it contains an 849-bp open reading frame with a G + C content of 46.88% which encodes a 283-amino-acid protein with a calculated molecular mass of 31.297 kDa and a predicted pI of 9.065. A sequence comparison of flagellin proteins revealed that the amino acid sequence is most variable in the central portion of the serovar pomona FlaB, which is believed to contain specific sequence information and which may thus be useful in the design of DNA or synthetic peptide probes suitable for the detection of infection with pathogenic leptospires. PMID:9317049

  10. Comparison of seven techniques for typing international epidemic strains of Clostridium difficile: restriction endonuclease analysis, pulsed-field gel electrophoresis, PCR-ribotyping, multilocus sequence typing, multilocus variable-number tandem-repeat analysis, amplified fragment length polymorphism, and surface layer protein A gene sequence typing.

    PubMed

    Killgore, George; Thompson, Angela; Johnson, Stuart; Brazier, Jon; Kuijper, Ed; Pepin, Jacques; Frost, Eric H; Savelkoul, Paul; Nicholson, Brad; van den Berg, Renate J; Kato, Haru; Sambol, Susan P; Zukowski, Walter; Woods, Christopher; Limbago, Brandi; Gerding, Dale N; McDonald, L Clifford

    2008-02-01

    Using 42 isolates contributed by laboratories in Canada, The Netherlands, the United Kingdom, and the United States, we compared the results of analyses done with seven Clostridium difficile typing techniques: multilocus variable-number tandem-repeat analysis (MLVA), amplified fragment length polymorphism (AFLP), surface layer protein A gene sequence typing (slpAST), PCR-ribotyping, restriction endonuclease analysis (REA), multilocus sequence typing (MLST), and pulsed-field gel electrophoresis (PFGE). We assessed the discriminating ability and typeability of each technique as well as the agreement among techniques in grouping isolates by allele profile A (AP-A) through AP-F, which are defined by toxinotype, the presence of the binary toxin gene, and deletion in the tcdC gene. We found that all isolates were typeable by all techniques and that discrimination index scores for the techniques tested ranged from 0.964 to 0.631 in the following order: MLVA, REA, PFGE, slpAST, PCR-ribotyping, MLST, and AFLP. All the techniques were able to distinguish the current epidemic strain of C. difficile (BI/027/NAP1) from other strains. All of the techniques showed multiple types for AP-A (toxinotype 0, binary toxin negative, and no tcdC gene deletion). REA, slpAST, MLST, and PCR-ribotyping all included AP-B (toxinotype III, binary toxin positive, and an 18-bp deletion in tcdC) in a single group that excluded other APs. PFGE, AFLP, and MLVA grouped two, one, and two different non-AP-B isolates, respectively, with their AP-B isolates. All techniques appear to be capable of detecting outbreak strains, but only REA and MLVA showed sufficient discrimination to distinguish strains from different outbreaks. PMID:18039796

  11. Design of Protein Multi-specificity Using an Independent Sequence Search Reduces the Barrier to Low Energy Sequences.

    PubMed

    Sevy, Alexander M; Jacobs, Tim M; Crowe, James E; Meiler, Jens

    2015-07-01

    Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a 'single state' design (SSD) paradigm. Multi-specificity design (MSD), on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON). The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain) an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design "promiscuous", polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes. PMID:26147100

  12. Sequence analysis and expression of the M1 and M2 matrix protein genes of hirame rhabdovirus (HIRRV)

    USGS Publications Warehouse

    Nishizawa, T.; Kurath, G.; Winton, J.R.

    1997-01-01

    We have cloned and sequenced a 2318 nucleotide region of the genomic RNA of hirame rhabdovirus (HIRRV), an important viral pathogen of Japanese flounder Paralichthys olivaceus. This region comprises approximately two-thirds of the 3' end of the nucleocapsid protein (N) gene and the complete matrix protein (M1 and M2) genes with the associated intergenic regions. The partial N gene sequence was 812 nucleotides in length with an open reading frame (ORF) that encoded the carboxyl-terminal 250 amino acids of the N protein. The M1 and M2 genes were 771 and 700 nucleotides in length, respectively, with ORFs encoding proteins of 227 and 193 amino acids. The M1 gene sequence contained an additional small ORF that could encode a highly basic, arginine-rich protein of 25 amino acids. Comparisons of the N, M1, and M2 gene sequences of HIRRV with the corresponding sequences of the fish rhabdoviruses, infectious hematopoietic necrosis virus (IHNV) or viral hemorrhagic septicemia virus (VHSV) indicated that HIRRV was more closely related to IHNV than to VHSV, but was clearly distinct from either. The putative consensus gene termination sequence for IHNV and VHSV, AGAYAG(A)(7), was present in the N-M1, M1-M2, and M2-G intergenic regions of HIRRV as were the putative transcription initiation sequences YGGCAC and AACA. An Escherichia coli expression system was used to produce recombinant proteins from the M1 and M2 genes of HIRRV. These were the same size as the authentic M1 and M2 proteins and reacted with anti-HIRRV rabbit serum in western blots. These reagents can be used for further study of the fish immune response and to test novel control methods.

  13. Sequence-Specific Protein Aggregation Generates Defined Protein Knockdowns in Plants1[OPEN

    PubMed Central

    Vuylsteke, Marnik; Aesaert, Stijn; Rombaut, Debbie; De Smet, Frederik; Xu, Jie; Van Lijsebettens, Mieke; Rousseau, Frederic

    2016-01-01

    Protein aggregation is determined by short (5–15 amino acids) aggregation-prone regions (APRs) of the polypeptide sequence that self-associate in a specific manner to form β-structured inclusions. Here, we demonstrate that the sequence specificity of APRs can be exploited to selectively knock down proteins with different localization and function in plants. Synthetic aggregation-prone peptides derived from the APRs of either the negative regulators of the brassinosteroid (BR) signaling, the glycogen synthase kinase 3/Arabidopsis SHAGGY-like kinases (GSK3/ASKs), or the starch-degrading enzyme α-glucan water dikinase were designed. Stable expression of the APRs in Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) induced aggregation of the target proteins, giving rise to plants displaying constitutive BR responses and increased starch content, respectively. Overall, we show that the sequence specificity of APRs can be harnessed to generate aggregation-associated phenotypes in a targeted manner in different subcellular compartments. This study points toward the potential application of induced targeted aggregation as a useful tool to knock down protein functions in plants and, especially, to generate beneficial traits in crops. PMID:27208282

  14. Sequence-Specific Protein Aggregation Generates Defined Protein Knockdowns in Plants.

    PubMed

    Betti, Camilla; Vanhoutte, Isabelle; Coutuer, Silvie; De Rycke, Riet; Mishev, Kiril; Vuylsteke, Marnik; Aesaert, Stijn; Rombaut, Debbie; Gallardo, Rodrigo; De Smet, Frederik; Xu, Jie; Van Lijsebettens, Mieke; Van Breusegem, Frank; Inzé, Dirk; Rousseau, Frederic; Schymkowitz, Joost; Russinova, Eugenia

    2016-06-01

    Protein aggregation is determined by short (5-15 amino acids) aggregation-prone regions (APRs) of the polypeptide sequence that self-associate in a specific manner to form β-structured inclusions. Here, we demonstrate that the sequence specificity of APRs can be exploited to selectively knock down proteins with different localization and function in plants. Synthetic aggregation-prone peptides derived from the APRs of either the negative regulators of the brassinosteroid (BR) signaling, the glycogen synthase kinase 3/Arabidopsis SHAGGY-like kinases (GSK3/ASKs), or the starch-degrading enzyme α-glucan water dikinase were designed. Stable expression of the APRs in Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) induced aggregation of the target proteins, giving rise to plants displaying constitutive BR responses and increased starch content, respectively. Overall, we show that the sequence specificity of APRs can be harnessed to generate aggregation-associated phenotypes in a targeted manner in different subcellular compartments. This study points toward the potential application of induced targeted aggregation as a useful tool to knock down protein functions in plants and, especially, to generate beneficial traits in crops. PMID:27208282

  15. Direct Chloroplast Sequencing: Comparison of Sequencing Platforms and Analysis Tools for Whole Chloroplast Barcoding

    PubMed Central

    Brozynska, Marta; Furtado, Agnelo; Henry, Robert James

    2014-01-01

    Direct sequencing of total plant DNA using next generation sequencing technologies generates a whole chloroplast genome sequence that has the potential to provide a barcode for use in plant and food identification. Advances in DNA sequencing platforms may make this an attractive approach for routine plant identification. The HiSeq (Illumina) and Ion Torrent (Life Technology) sequencing platforms were used to sequence total DNA from rice to identify polymorphisms in the whole chloroplast genome sequence of a wild rice plant relative to cultivated rice (cv. Nipponbare). Consensus chloroplast sequences were produced by mapping sequence reads to the reference rice chloroplast genome or by de novo assembly and mapping of the resulting contigs to the reference sequence. A total of 122 polymorphisms (SNPs and indels) between the wild and cultivated rice chloroplasts were predicted by these different sequencing and analysis methods. Of these, a total of 102 polymorphisms including 90 SNPs were predicted by both platforms. Indels were more variable with different sequencing methods, with almost all discrepancies found in homopolymers. The Ion Torrent platform gave no apparent false SNP but was less reliable for indels. The methods should be suitable for routine barcoding using appropriate combinations of sequencing platform and data analysis. PMID:25329378

  16. Accurate prediction of protein–protein interactions from sequence alignments using a Bayesian method

    PubMed Central

    Burger, Lukas; van Nimwegen, Erik

    2008-01-01

    Accurate and large-scale prediction of protein–protein interactions directly from amino-acid sequences is one of the great challenges in computational biology. Here we present a new Bayesian network method that predicts interaction partners using only multiple alignments of amino-acid sequences of interacting protein domains, without tunable parameters, and without the need for any training examples. We first apply the method to bacterial two-component systems and comprehensively reconstruct two-component signaling networks across all sequenced bacteria. Comparisons of our predictions with known interactions show that our method infers interaction partners genome-wide with high accuracy. To demonstrate the general applicability of our method we show that it also accurately predicts interaction partners in a recent dataset of polyketide synthases. Analysis of the predicted genome-wide two-component signaling networks shows that cognates (interacting kinase/regulator pairs, which lie adjacent on the genome) and orphans (which lie isolated) form two relatively independent components of the signaling network in each genome. In addition, while most genes are predicted to have only a small number of interaction partners, we find that 10% of orphans form a separate class of ‘hub' nodes that distribute and integrate signals to and from up to tens of different interaction partners. PMID:18277381

  17. Next-Generation Sequencing for Binary Protein–Protein Interactions

    PubMed Central

    Suter, Bernhard; Zhang, Xinmin; Pesce, C. Gustavo; Mendelsohn, Andrew R.; Dinesh-Kumar, Savithramma P.; Mao, Jian-Hua

    2015-01-01

    The yeast two-hybrid (Y2H) system exploits host cell genetics in order to display binary protein–protein interactions (PPIs) via defined and selectable phenotypes. Numerous improvements have been made to this method, adapting the screening principle for diverse applications, including drug discovery and the scale-up for proteome wide interaction screens in human and other organisms. Here we discuss a systematic workflow and analysis scheme for screening data generated by Y2H and related assays that includes high-throughput selection procedures, readout of comprehensive results via next-generation sequencing (NGS), and the interpretation of interaction data via quantitative statistics. The novel assays and tools will serve the broader scientific community to harness the power of NGS technology to address PPI networks in health and disease. We discuss examples of how this next-generation platform can be applied to address specific questions in diverse fields of biology and medicine. PMID:26734059

  18. Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines.

    PubMed

    You, Zhu-Hong; Li, Jianqiang; Gao, Xin; He, Zhou; Zhu, Lin; Lei, Ying-Ke; Ji, Zhiwei

    2015-01-01

    Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally. PMID:26000305

  19. Species-specific protein sequence and fold optimizations

    PubMed Central

    Dumontier, Michel; Michalickova, Katerina; Hogue, Christopher WV

    2002-01-01

    Background An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes. Results Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archae, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archae and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 ± 8% whereas the CG detected 73 ± 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at . Conclusion Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events. PMID:12487631

  20. Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier.

    PubMed

    Dhole, Kaustubh; Singh, Gurdeep; Pai, Priyadarshini P; Mondal, Sukanta

    2014-05-01

    Protein-protein interactions are of central importance for virtually every process in a living cell. Information about the interaction sites in proteins improves our understanding of disease mechanisms and can provide the basis for new therapeutic approaches. Since a multitude of unique residue-residue contacts facilitate the interactions, protein-protein interaction sites prediction has become one of the most important and challenging problems of computational biology. Although much progress in this field has been reported, this problem is yet to be satisfactorily solved. Here, a novel method (LORIS: L1-regularized LOgistic Regression based protein-protein Interaction Sites predictor) is proposed, that identifies interaction residues, using sequence features and is implemented via the L1-logreg classifier. Results show that LORIS is not only quite effective, but also, performs better than existing state-of-the art methods. LORIS, available as standalone package, can be useful for facilitating drug-design and targeted mutation related studies, which require a deeper knowledge of protein interactions sites. PMID:24486250

  1. An optimized approach to the rapid assessment and detection of sequence variants in recombinant protein products.

    PubMed

    Brady, Lowell J; Scott, Rebecca A; Balland, Alain

    2015-05-01

    The development of sensitive techniques to detect sequence variants (SVs), which naturally arise due to DNA mutations and errors in transcription/translation (amino acid misincorporations), has resulted in increased attention to their potential presence in protein-based biologic drugs in recent years. Often, these SVs may be below 0.1%, adding challenges for consistent and accurate detection. Furthermore, the presence of false-positive (FP) signals, a hallmark of SV analysis, requires time-consuming analyst inspection of the data to sort true from erroneous signal. Consequently, gaps in information about the prevalence, type, and impact of SVs in marketed and in-development products are significant. Here, we report the results of a simple, straightforward, and sensitive approach to sequence variant analysis. This strategy employs mixing of two samples of an antibody or protein with the same amino acid sequence in a dilution series followed by subsequent sequence variant analysis. Using automated peptide map analysis software, a quantitative assessment of the levels of SVs in each sample can be made based on the signal derived from the mass spectrometric data. We used this strategy to rapidly detect differences in sequence variants in a monoclonal antibody after a change in process scale, and in a comparison of three mAbs as part of a biosimilar program. This approach is powerful, as true signals can be readily distinguished from FP signal, even at a level well below 0.1%, by using a simple linear regression analysis across the data set with none to minimal inspection of the MS/MS data. Additionally, the data produced from these studies can also be used to make a quantitative assessment of relative levels of product quality attributes. The information provided here extends the published knowledge about SVs and provides context for the discussion around the potential impact of these SVs on product heterogeneity and immunogenicity. PMID:25795027

  2. Evolutionary sequence comparisons using high-density oligonucleotide arrays.

    PubMed

    Hacia, J G; Makalowski, W; Edgemon, K; Erdos, M R; Robbins, C M; Fodor, S P; Brody, L C; Collins, F S

    1998-02-01

    We explored the utility of high-density oligonucleotide arrays (DNA chips) for obtaining sequence information from homologous genes in closely related species. Orthologues of the human BRCA1 exon 11, all approximately 3.4 kb in length and ranging from 98.2% to 83.5% nucleotide identity, were subjected to hybridization-based and conventional dideoxysequencing analysis. Retrospective guidelines for identifying high-fidelity hybridization-based sequence calls were formulated based upon dideoxysequencing results. Prospective application of these rules yielded base-calling with at least 98.8% accuracy over orthologous sequence tracts shown to have approximately 99% identity. For higher primate sequences with greater than 97% nucleotide identity, base-calling was made with at least 99.91% accuracy covering a minimum of 97% of the sequence. Using a second-tier confirmatory hybridization chip strategy, shown in several cases to confirm the identity of predicted sequence changes, the complete sequence of the chimpanzee, gorilla and orangutan orthologues should be deducible solely through hybridization-based methodologies. Analysis of less highly conserved orthologues can still identify conserved nucleotide tracts of at least 15 nucleotides and can provide useful information for designing primers. DNA-chip based assays can be a valuable new technology for obtaining high-throughput cost-effective sequence information from related genomes. PMID:9462745

  3. ClusCo: clustering and comparison of protein models

    PubMed Central

    2013-01-01

    Background The development, optimization and validation of protein modeling methods require efficient tools for structural comparison. Frequently, a large number of models need to be compared with the target native structure. The main reason for the development of Clusco software was to create a high-throughput tool for all-versus-all comparison, because calculating similarity matrix is the one of the bottlenecks in the protein modeling pipeline. Results Clusco is fast and easy-to-use software for high-throughput comparison of protein models with different similarity measures (cRMSD, dRMSD, GDT_TS, TM-Score, MaxSub, Contact Map Overlap) and clustering of the comparison results with standard methods: K-means Clustering or Hierarchical Agglomerative Clustering. Conclusions The application was highly optimized and written in C/C++, including the code for parallel execution on CPU and GPU, which resulted in a significant speedup over similar clustering and scoring computation programs. PMID:23433004

  4. Impaired nuclear import of mammalian Dlx4 proteins as a consequence of rapid sequence divergence

    SciTech Connect

    Coubrough, Melissa L.; Bendall, Andrew J. . E-mail: abendall@uoguelph.ca

    2006-11-15

    Dlx genes encode a developmentally important family of transcription factors with a variety of functions and sites of action during vertebrate embryogenesis. The murine Dlx4 gene is an enigmatic member of the family; little is known about the normal developmental function(s) of Dlx4. Here, we show that Dlx4 is expressed in the murine placenta and in a trophoblast cell line where the protein localizes to both the nucleus and cytoplasm. Despite the presence of several leucine/valine-rich motifs that match known nuclear export sequences, cytoplasmic Dlx4 is not due to CRM-1-mediated nuclear export. Rather, nuclear import of Dlx4 is compromised by specific residues that flank the nuclear localization signal. One of these residues represents a novel conserved feature of the Dlx4 protein in placental mammals, and the second represents novel variation within mouse Dlx4 isoforms. Comparison of orthologous protein sequences reveals a particularly high rate of non-synonymous change in the coding regions of mammalian Dlx4 genes. Since impaired nuclear localization is unlikely to enhance the function of a nuclear transcription factor, these data point to reduced selection pressure as the basis for the rapid divergence of the Dlx4 gene within the mammalian clade.

  5. The Bioinformatics Report of Mutation Outcome on NADPH Flavin Oxidoreductase Protein Sequence in Clinical Isolates of H. pylori.

    PubMed

    Mirzaei, Nasrin; Poursina, Farkhondeh; Moghim, Sharareh; Ghaempanah, Abdol Majid; Safaei, Hajieh Ghasemian

    2016-05-01

    frxA gene has been implicated in the metronidazole nitro reduction by H. pylori. Alternatively, frxA is expected to contribute to the protection of urease and to the in vivo survival of H. pylori. The aim of present study is to report the mutation effects on the frxA protein sequence in clinical isolates of H. pylori in our community. Metronidazole resistance was proven in 27 of 48 isolates. glmM and frxA genes were used for molecular confirmation of H. pylori isolates. The primer set for detection of whole sequence of frxA gene for the effect of mutation on protein sequence was used. DNA and protein sequence evaluation and analysis were done by blast, Clustal Omega, and T COFFEE programs. Then, FrxA protein sequences from six metronidazole-resistant clinical isolates were analyzed by web-based bioinformatics tools. The result of six metronidazole-resistant clinical isolates in comparison with strain 26695 showed ten missense mutations. The result with the STRING program revealed that no change was seen after alterations in these sequences. According to consensus data involving four methods, residue substitutions at 40, 13, and 141 increase the stability of protein sequence after mutation, while other alterations decrease. Residue substitutions at 40, 43, 141, 138, 169, and 179 are deleterious, while, V7I, Q10R, V34I, and V96I alterations are neutral. As FrxA contribute to survival of bacterium and in regard to the effect of mutations on protein function, it might affect the survival and bacterium phenotype and it need to be studied more. Also, none of the stability prediction tool is perfect; iStable is the best predictor method among all methods. PMID:26821239

  6. X-ray sequence and crystal structure of luffaculin 1, a novel type 1 ribosome-inactivating protein

    PubMed Central

    Hou, Xiaomin; Chen, Minghuang; Chen, Liqing; Meehan, Edward J; Xie, Jieming; Huang, Mingdong

    2007-01-01

    Background Protein sequence can be obtained through Edman degradation, mass spectrometry, or cDNA sequencing. High resolution X-ray crystallography can also be used to derive protein sequence information, but faces the difficulty in distinguishing the Asp/Asn, Glu/Gln, and Val/Thr pairs. Luffaculin 1 is a new type 1 ribosome-inactivating protein (RIP) isolated from the seeds of Luffa acutangula. Besides rRNA N-glycosidase activity, luffaculin 1 also demonstrates activities including inhibiting tumor cells' proliferation and inducing tumor cells' differentiation. Results The crystal structure of luffaculin 1 was determined at 1.4 Å resolution. Its amino-acid sequence was derived from this high resolution structure using the following criteria: 1) high resolution electron density; 2) comparison of electron density between two molecules that exist in the same crystal; 3) evaluation of the chemical environment of residues to break down the sequence assignment ambiguity in residue pairs Glu/Gln, Asp/Asn, and Val/Thr; 4) comparison with sequences of the homologous proteins. Using the criteria 1 and 2, 66% of the residues can be assigned. By incorporating with criterion 3, 86% of the residues were assigned, suggesting the effectiveness of chemical environment evaluation in breaking down residue ambiguity. In total, 94% of the luffaculin 1 sequence was assigned with high confidence using this improved X-ray sequencing strategy. Two N-acetylglucosamine moieties, linked respectively to the residues Asn77 and Asn84, can be identified in the structure. Residues Tyr70, Tyr110, Glu159 and Arg162 define the active site of luffaculin 1 as an RNA N-glycosidase. Conclusion X-ray sequencing method can be effective to derive sequence information of proteins. The evaluation of the chemical environment of residues is a useful method to break down the assignment ambiguity in Glu/Gln, Asp/Asn, and Val/Thr pairs. The sequence and the crystal structure confirm that luffaculin 1 is a new

  7. Quantitative Comparison of Large-Scale DNA Enrichment Sequencing Data.

    PubMed

    Lienhard, Matthias; Chavez, Lukas

    2016-01-01

    DNA enrichment followed by sequencing (DNA-IP seq) is a versatile tool in molecular biology with a wide variety of applications. Computational analysis of differential DNA enrichment between conditions is important for identifying epigenetic alterations in disease compared to healthy controls and for revealing dynamic epigenetic modifications throughout normal and distorted cell differentiation and development. We present a protocol for genome-wide comparative analysis of DNA-IP sequencing data to identify statistically significant differential sequencing coverage between two conditions by considering variation across replicates. The protocol provides a detailed description for the comparative analysis of DNA-IP sequencing data including basic data processing, quality controls, and identification of differential enrichment using the Bioconductor package "MEDIPS". PMID:27008016

  8. Close sequence comparisons are sufficient to identify human cis-regulatory elements.

    PubMed

    Prabhakar, Shyam; Poulin, Francis; Shoukry, Malak; Afzal, Veena; Rubin, Edward M; Couronne, Olivier; Pennacchio, Len A

    2006-07-01

    Cross-species DNA sequence comparison is the primary method used to identify functional noncoding elements in human and other large genomes. However, little is known about the relative merits of evolutionarily close and distant sequence comparisons. To address this problem, we identified evolutionarily conserved noncoding regions in primate, mammalian, and more distant comparisons using a uniform approach (Gumby) that facilitates unbiased assessment of the impact of evolutionary distance on predictive power. We benchmarked computational predictions against previously identified cis-regulatory elements at diverse genomic loci and also tested numerous extremely conserved human-rodent sequences for transcriptional enhancer activity using an in vivo enhancer assay in transgenic mice. Human regulatory elements were identified with acceptable sensitivity (53%-80%) and true-positive rate (27%-67%) by comparison with one to five other eutherian mammals or six other simian primates. More distant comparisons (marsupial, avian, amphibian, and fish) failed to identify many of the empirically defined functional noncoding elements. Our results highlight the practical utility of close sequence comparisons, and the loss of sensitivity entailed by more distant comparisons. We derived an intuitive relationship between ancient and recent noncoding sequence conservation from whole-genome comparative analysis that explains most of the observations from empirical benchmarking. Lastly, we determined that, in addition to strength of conservation, genomic location and/or density of surrounding conserved elements must also be considered in selecting candidate enhancers for in vivo testing at embryonic time points. PMID:16769978

  9. A potent antimicrobial protein from onion seeds showing sequence homology to plant lipid transfer proteins.

    PubMed Central

    Cammue, B P; Thevissen, K; Hendriks, M; Eggermont, K; Goderis, I J; Proost, P; Van Damme, J; Osborn, R W; Guerbette, F; Kader, J C

    1995-01-01

    An antimicrobial protein of about 10 kD, called Ace-AMP1, was isolated from onion (Allium cepa L.) seeds. Based on the near-complete amino acid sequence of this protein, oligonucleotides were designed for polymerase chain reaction-based cloning of the corresponding cDNA. The mature protein is homologous to plant nonspecific lipid transfer proteins (nsLTPs), but it shares only 76% of the residues that are conserved among all known plant nsLTPs and is unusually rich in arginine. Ace-AMP1 inhibits all 12 tested plant pathogenic fungi at concentrations below 10 micrograms mL-1. Its antifungal activity is either not at all or is weakly affected by the presence of different cations at concentrations approximating physiological ionic strength conditions. Ace-AMP1 is also active on two Gram-positive bacteria but is apparently not toxic for Gram-negative bacteria and cultured human cells. In contrast to nsLTPs such as those isolated from radish or maize seeds, Ace-AMP1 was unable to transfer phospholipids from liposomes to mitochondria. On the other hand, lipid transfer proteins from wheat and maize seeds showed little or no antimicrobial activity, whereas the radish lipid transfer protein displayed antifungal activity only in media with low cation concentrations. The relevance of these findings with regard to the function of nsLTPs is discussed. PMID:7480341

  10. Comparison of simple sequence repeats in 19 Archaea.

    PubMed

    Trivedi, S

    2006-01-01

    All organisms that have been studied until now have been found to have differential distribution of simple sequence repeats (SSRs), with more SSRs in intergenic than in coding sequences. SSR distribution was investigated in Archaea genomes where complete chromosome sequences of 19 Archaea were analyzed with the program SPUTNIK to find di- to penta-nucleotide repeats. The number of repeats was determined for the complete chromosome sequences and for the coding and non-coding sequences. Different from what has been found for other groups of organisms, there is an abundance of SSRs in coding regions of the genome of some Archaea. Dinucleotide repeats were rare and CG repeats were found in only two Archaea. In general, trinucleotide repeats are the most abundant SSR motifs; however, pentanucleotide repeats are abundant in some Archaea. Some of the tetranucleotide and pentanucleotide repeat motifs are organism specific. In general, repeats are short and CG-rich repeats are present in Archaea having a CG-rich genome. Among the 19 Archaea, SSR density was not correlated with genome size or with optimum growth temperature. Pentanucleotide density had an inverse correlation with the CG content of the genome. PMID:17183484

  11. Use of gene sequence analyses and genome comparisons for yeast systematics

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Detection, identification, and classification of yeasts has undergone a major transformation in the past decade and a half following application of gene sequence analyses and genome comparisons. Development of a database (barcode) of easily determined gene sequences from domains 1 and 2 of large sub...

  12. Evolution of EF-hand calcium-modulated proteins. III. Exon sequences confirm most dendrograms based on protein sequences: calmodulin dendrograms show significant lack of parallelism

    NASA Technical Reports Server (NTRS)

    Nakayama, S.; Kretsinger, R. H.

    1993-01-01

    In the first report in this series we presented dendrograms based on 152 individual proteins of the EF-hand family. In the second we used sequences from 228 proteins, containing 835 domains, and showed that eight of the 29 subfamilies are congruent and that the EF-hand domains of the remaining 21 subfamilies have diverse evolutionary histories. In this study we have computed dendrograms within and among the EF-hand subfamilies using the encoding DNA sequences. In most instances the dendrograms based on protein and on DNA sequences are very similar. Significant differences between protein and DNA trees for calmodulin remain unexplained. In our fourth report we evaluate the sequences and the distribution of introns within the EF-hand family and conclude that exon shuffling did not play a significant role in its evolution.

  13. Beta.-glucosidase coding sequences and protein from orpinomyces PC-2

    DOEpatents

    Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong; Ximenes, Eduardo A.

    2001-02-06

    Provided is a novel .beta.-glucosidase from Orpinomyces sp. PC2, nucleotide sequences encoding the mature protein and the precursor protein, and methods for recombinant production of this .beta.-glucosidase.

  14. Phylogenetic relationships of Cryptosporidium determined by ribosomal RNA sequence comparison.

    PubMed

    Johnson, A M; Fielke, R; Lumb, R; Baverstock, P R

    1990-04-01

    Reverse transcription of total cellular RNA was used to obtain a partial sequence of the small subunit ribosomal RNA of Cryptosporidium, a protist currently placed in the phylum Apicomplexa. The semi-conserved regions were aligned with homologous sequences in a range of other eukaryotes, and the evolutionary relationships of Cryptosporidium were determined by two different methods of phylogenetic analysis. The prokaryotes Escherichia coli and Halobacterium cuti were included as outgroups. The results do not show an especially close relationship of Cryptosporidium to other members of the phylum Apicomplexa. PMID:2332273

  15. A local average distance descriptor for flexible protein structure comparison

    PubMed Central

    2014-01-01

    Background Protein structures are flexible and often show conformational changes upon binding to other molecules to exert biological functions. As protein structures correlate with characteristic functions, structure comparison allows classification and prediction of proteins of undefined functions. However, most comparison methods treat proteins as rigid bodies and cannot retrieve similarities of proteins with large conformational changes effectively. Results In this paper, we propose a novel descriptor, local average distance (LAD), based on either the geodesic distances (GDs) or Euclidean distances (EDs) for pairwise flexible protein structure comparison. The proposed method was compared with 7 structural alignment methods and 7 shape descriptors on two datasets comprising hinge bending motions from the MolMovDB, and the results have shown that our method outperformed all other methods regarding retrieving similar structures in terms of precision-recall curve, retrieval success rate, R-precision, mean average precision and F1-measure. Conclusions Both ED- and GD-based LAD descriptors are effective to search deformed structures and overcome the problems of self-connection caused by a large bending motion. We have also demonstrated that the ED-based LAD is more robust than the GD-based descriptor. The proposed algorithm provides an alternative approach for blasting structure database, discovering previously unknown conformational relationships, and reorganizing protein structure classification. PMID:24694083

  16. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and 3-dimensional structural information

    PubMed Central

    Pei, Jimin; Grishin, Nick V.

    2015-01-01

    SUMMARY Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of 3-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  17. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information.

    PubMed

    Pei, Jimin; Grishin, Nick V

    2014-01-01

    Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of three-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D Web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  18. Proteomic Analysis of Lyme Disease: Global Protein Comparison of Three Strains of Borrelia burgdorferi

    SciTech Connect

    Jacobs, Jon M.; Yang, Xiaohua; Luft, Benjamin J.; Dunn, John J.; Camp, David G.; Smith, Richard D.

    2005-04-01

    The Borrelia burgdorferi spirochete is the causative agent of Lyme disease, the most common tick-borne disease in the United States. It has been studied extensively to help understand its pathogenicity of infection and how it can persist in different mammalian hosts. We report the proteomic analysis of the archetype B. burgdorferi B31 strain and two other strains (ND40, and JD-1) having different Borrelia pathotypes using strong cation exchange fractionation of proteolytic peptides followed by high-resolution, reversed phase capillary liquid chromatography coupled with ion trap tandem mass spectrometric (LC-MS/MS) analysis. Protein identification was facilitated by the availability of the complete B31 genome sequence. A total of 665 Borrelia proteins were identified representing ~38 % coverage of the theoretical B31 proteome. A significant overlap was observed between the identified proteins in direct comparisons between any two strains (>72%), but distinct differences were observed among identified hypothetical and outer membrane proteins of the three strains. Such a concurrent proteomic overview of three Borrelia strains based upon only the B31 genome sequence is shown to provide significant insights into the presence or absence of specific proteins and a broad overall comparison among strains.

  19. A COMPARISON OF FIXED SEQUENCE AND OPTIONAL BRANCHING AUTIOINSTRUCTIONAL METHODS.

    ERIC Educational Resources Information Center

    MELARAGNO, RALPH J.; AND OTHERS

    HYPOTHESES RELATED TO PROCEDURES PERMITTING STUDENTS TO BRANCH AT THEIR OWN OPTION WERE TESTED. THE FIRST HYPOTHESIS WAS THAT A FIXED-SEQUENCE PROGRAM WOULD BE LESS EFFECTIVE THAN THE SAME ITEMS CAST AS STATEMENTS IN TEXTBOOK FORMAT THROUGH WHICH THE STUDENT COULD SKIP AT HIS OWN OPTION. THE SECOND HYPOTHESIS WAS THAT PERFORMANCE ON A PROGRAM…

  20. UNIT 11.10 N-Terminal Sequence Analysis of Proteins and Peptides

    PubMed Central

    Speicher, Kaye D.; Gorman, Nicole; Speicher, David W.

    2009-01-01

    Automated N-terminal sequence analysis involves a series of chemical reactions that derivatize and remove one amino acid at a time from the N-terminal of purified peptides or intact proteins. At least several pmoles of a purified protein or 10 to 20 pmoles of a purified peptide with an unmodified N-terminal is required in order to obtain useful sequence information. In recent years the demand for N-terminal sequencing has decreased substantially as some applications for protein identification and characterization can now be more effectively performed using mass spectrometry. However, N-terminal sequencing remains the method of choice for verifying the N-terminal boundary of recombinant proteins, determining the N-terminal of protease-resistant domains, identifying proteins isolated from species where most of the genome has not yet been sequenced, and mapping modified or crosslinked sites in proteins that prove to be refractory to analysis by mass spectrometry. PMID:18429102

  1. Studies on the high-sulphur proteins of reduced Merino wool. Amino acid sequence of protein SCMKB-IIIB4

    PubMed Central

    Swart, L. S.; Haylett, T.

    1971-01-01

    The complete amino acid sequence of protein SCMKB-IIIB4 is presented. It is closely related to the sequence of protein SCMKB-IIIB3 (Haylett, Swart & Parris, 1971) differing in only four positions. The peptic and thermolysin peptides of protein SCMKB-IIIB4 were analysed by the dansyl–Edman method (Gray, 1967) and by tritium-labelling of C-terminal residues (Matsuo, Fujimoto & Tatsuno, 1966). This protein is the third member of a group of high-sulphur wool proteins with molecular weight of about 11400. It consists of 98 residues and has acetylalanine and carboxymethylcysteine as N- and C-terminal residues respectively. PMID:4942536

  2. Conservation of Shannon's redundancy for proteins. [information theory applied to amino acid sequences

    NASA Technical Reports Server (NTRS)

    Gatlin, L. L.

    1974-01-01

    Concepts of information theory are applied to examine various proteins in terms of their redundancy in natural originators such as animals and plants. The Monte Carlo method is used to derive information parameters for random protein sequences. Real protein sequence parameters are compared with the standard parameters of protein sequences having a specific length. The tendency of a chain to contain some amino acids more frequently than others and the tendency of a chain to contain certain amino acid pairs more frequently than other pairs are used as randomness measures of individual protein sequences. Non-periodic proteins are generally found to have random Shannon redundancies except in cases of constraints due to short chain length and genetic codes. Redundant characteristics of highly periodic proteins are discussed. A degree of periodicity parameter is derived.

  3. Structuring temporal sequences: comparison of models and factors of complexity.

    PubMed

    Essens, P

    1995-05-01

    Two stages for structuring tone sequences have been distinguished by Povel and Essens (1985). In the first, a mental clock segments a sequence into equal time units (clock model); in the second, intervals are specified in terms of subdivisions of these units. The present findings support the clock model in that it predicts human performance better than three other algorithmic models. Two further experiments in which clock and subdivision characteristics were varied did not support the hypothesized effect of the nature of the subdivisions on complexity. A model focusing on the variations in the beat-anchored envelopes of the tone clusters was proposed. Errors in reproduction suggest a dual-code representation comprising temporal and figural characteristics. The temporal part of the representation is based on the clock model but specifies, in addition, the metric of the level below the clock. The beat-tone-cluster envelope concept was proposed to specify the figural part. PMID:7596749

  4. Complete mitochondrial genome DNA sequence for two ophiuroids and a holothuroid: the utility of protein gene sequence and gene maps in the analyses of deep deuterostome phylogeny.

    PubMed

    Scouras, Andrea; Beckenbach, Karen; Arndt, Allan; Smith, Michael J

    2004-04-01

    The complete mitochondrial genome sequences have been determined for the holothuroid Cucumaria miniata and two ophiuroid species Ophiopholis aculeata and Ophiura lütkeni. In addition, the nucleotide sequence of the mitochondrial protein-coding genes for the asteroid Pisaster ochraceus has been completed. Maximum-likelihood and LogDet distance analyses of concatenated protein-coding sequences produced a series of trees that did not conclusively support generally accepted models of echinoderm phylogeny. The ophiuroid data consistently demonstrated accelerated nucleotide divergence rates and lack of stationarity. This confounds the phylogenetic analyses. Molecular investigations using individual protein-coding gene alignments demonstrated that the cytochrome b gene exhibits the least deviation in rate and stationarity and generated some trees consistent with proposed echinoderm phylogenies. Phylogenies based on echinoderm mitochondrial gene rearrangements also proved problematic because of extensive variation in gene order between and within classes. A comparison of the two distinctive ophiuroid mitochondrial gene orders supports the hypothesis that O. lütkeni has a more derived mitochondrial gene order versus O. aculeata. The variation in the echinoderm mitochondrial gene maps reinforces the limitations of the application of mitochondrial gene rearrangements as a global phylogenetic tool. PMID:15019608

  5. Identification of Disulfide Bonds in Protein Proteolytic Degradation Products Using de Novo-Protein Unique Sequence Tags Approach

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Purvine, Samuel O.; Smith, Richard D.

    2010-08-01

    Disulfide bonds are a form of posttranslational modification that often determines protein structure(s) and function(s). In this work, we report a mass spectrometry method for identification of disulfides in degradation products of proteins, and specifically endogenous peptides in the human blood plasma peptidome. LC-Fourier transform tandem mass spectrometry (FT MS/MS) was used for acquiring mass spectra that were de novo sequenced and then searched against the IPI human protein database. Through the use of unique sequence tags (UStags) we unambiguously correlated the spectra to specific database proteins. Examination of the UStags’ prefix and/or suffix sequences that contain cysteine(s) in conjunction with sequences of the UStags-specified database proteins is shown to enable the unambigious determination of disulfide bonds. Using this method, we identified the intermolecular and intramolecular disulfides in human blood plasma peptidome peptides that have molecular weights of up to ~10 kDa.

  6. Identification of disulfide bonds in protein proteolytic degradation products using de novo-protein unique sequence tags approach.

    PubMed

    Shen, Yufeng; Tolić, Nikola; Purvine, Samuel O; Smith, Richard D

    2010-08-01

    Disulfide bonds are a form of post-translational modification that often determines protein structure(s) and function(s). In this work, we report a mass spectrometry method for identification of disulfides in degradation products of proteins, specifically endogenous peptides in the human blood plasma peptidome. LC-Fourier transform tandem mass spectrometry (FT MS/MS) was used for acquiring mass spectra that were de novo sequenced and then searched against the IPI human protein database. Through the use of unique sequence tags (UStags), we unambiguously correlated the spectra to specific database proteins. Examination of the UStags' prefix and/or suffix sequences that contain cysteine(s) in conjunction with sequences of the UStags-specified database proteins is shown to enable the unambigious determination of disulfide bonds. Using this method, we identified the intermolecular and intramolecular disulfides in human blood plasma peptidome peptides that have molecular weights of up to approximately 10 kDa. PMID:20590115

  7. Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words.

    PubMed

    Santoni, Daniele; Felici, Giovanni; Vergni, Davide

    2016-02-21

    Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones. PMID:26656109

  8. Conformational selection underpins recognition of multiple DNA sequences by proteins and consequent functional actions.

    PubMed

    Naiya, Gitashri; Raha, Paromita; Mondal, Manas Kumar; Pal, Uttam; Saha, Rajesh; Chaudhuri, Susobhan; Batabyal, Subrata; Kumar Pal, Samir; Bhattacharyya, Dhananjay; Maiti, Nakul C; Roy, Siddhartha

    2016-08-21

    Recognition of multiple functional DNA sequences by a DNA-binding protein occurs widely in nature. The physico-chemical basis of this phenomenon is not well-understood. The E. coli gal repressor, a gene regulatory protein, binds two homologous but non-identical sixteen basepair sequences in the gal operon and interacts by protein-protein interaction to regulate gene expression. The two sites have nearly equal affinities for the Gal repressor. Spectroscopic studies of the Gal repressor bound to these two different DNA sequences detected significant conformational differences between them. Comprehensive single base-substitution and binding measurements were carried out on the two sequences to understand the nature of the two protein-DNA interfaces. Magnitudes of basepair-protein interaction energy show significant variation between homologous positions of the two DNA sequences. Magnitudes of variation are such that when summed over the whole sequence they largely cancel each other out, thus producing nearly equal net affinity. Modeling suggests significant alterations in the protein-DNA interface in the two complexes, which are consistent with conformational adaptation of the protein to different DNA sequences. The functional role of the two sequences was studied by substitution of one site by the other and vice versa. In both cases, substitution reduces repression in vivo. This suggests that naturally occurring DNA sequence variations play functional roles beyond merely acting as high-affinity anchoring points. We propose that two different pre-existing conformations in the conformational ensemble of the free protein are selected by two different DNA sequences for efficient sequence read-out and the conformational difference of the bound proteins leads to different functional roles. PMID:27426617

  9. 3D reconstruction software comparison for short sequences

    NASA Astrophysics Data System (ADS)

    Strupczewski, Adam; Czupryński, BłaŻej

    2014-11-01

    Large scale multiview reconstruction is recently a very popular area of research. There are many open source tools that can be downloaded and run on a personal computer. However, there are few, if any, comparisons between all the available software in terms of accuracy on small datasets that a single user can create. The typical datasets for testing of the software are archeological sites or cities, comprising thousands of images. This paper presents a comparison of currently available open source multiview reconstruction software for small datasets. It also compares the open source solutions with a simple structure from motion pipeline developed by the authors from scratch with the use of OpenCV and Eigen libraries.

  10. Quantitative Assessment of RNA-Protein Interactions with High Throughput Sequencing - RNA Affinity Profiling (HiTS-RAP)

    PubMed Central

    Ozer, Abdullah; Tome, Jacob M.; Friedman, Robin C.; Gheba, Dan; Schroth, Gary P.; Lis, John T.

    2016-01-01

    Because RNA-protein interactions play a central role in a wide-array of biological processes, methods that enable a quantitative assessment of these interactions in a high-throughput manner are in great demand. Recently, we developed the High Throughput Sequencing-RNA Affinity Profiling (HiTS-RAP) assay, which couples sequencing on an Illumina GAIIx with the quantitative assessment of one or several proteins’ interactions with millions of different RNAs in a single experiment. We have successfully used HiTS-RAP to analyze interactions of EGFP and NELF-E proteins with their corresponding canonical and mutant RNA aptamers. Here, we provide a detailed protocol for HiTS-RAP, which can be completed in about a month (8 days hands-on time) including the preparation and testing of recombinant proteins and DNA templates, clustering DNA templates on a flowcell, high-throughput sequencing and protein binding with GAIIx, and finally data analysis. We also highlight aspects of HiTS-RAP that can be further improved and points of comparison between HiTS-RAP and two other recently developed methods, RNA-MaP and RBNS. A successful HiTS-RAP experiment provides the sequence and binding curves for approximately 200 million RNAs in a single experiment. PMID:26182240

  11. Protein identities from 'Graphocephala atropunctata' expressed sequence tags: Expanding leafhopper vector biology

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Heat shock proteins and 44 protein sequences from the blue-green sharpshooter, BGSS, were produced and identified. The sequences were submitted and published under accession numbers: DQ445499-DQ445542, in the National Center for Biotechnology Information, NCBI, Public Database. The blue-green sharps...

  12. Sequences of a FK-506 binding protein from Edwardsiella ictaluri isolates

    Technology Transfer Automated Retrieval System (TEKTRAN)

    A FK-506 binding protein, a member of the immunophilin superfamily of Edwardsiella ictaluri was partially identified by the in-vivo-induced antigen technology. We further cloned and sequenced this FK-506 binding protein gene using a Universal GenomeWalker kit. The complete sequence consisted of 612 ...

  13. Interferon Consensus Sequence Binding Protein Confers Resistance against Yersinia enterocolitica

    PubMed Central

    Hein, Joachim; Kempf, Volkhard A. J.; Diebold, Joachim; Bücheler, Nicole; Preger, Sonja; Horak, Ivan; Sing, Andreas; Kramer, Uwe; Autenrieth, Ingo B.

    2000-01-01

    Interferon consensus sequence binding protein (ICSBP)-deficient mice display enhanced susceptibility to intracellular pathogens. At least two distinct immunoregulatory defects are responsible for this phenotype. First, diminished production of reactive oxygen intermediates in macrophages results in impaired intracellular killing of microorganisms. Second, defective early interleukin-12 (IL-12) production upon microbial challenge leads to a failure in gamma interferon (IFN-γ) induction and subsequently in T helper 1 immune responses. Here, we investigated the role of ICSBP in resistance against the extracellular bacterium Yersinia enterocolitica. ICSBP−/− mice failed to produce IL-12 and IFN-γ, but also IL-4, after Yersinia challenge. In addition, granuloma formation was highly disturbed in infected ICSBP−/− mice, leading to multiple necrotic abscesses in affected organs. Consequently, ICSBP−/− mice rapidly succumbed to acute Yersinia infection. In vitro treatment of spleen cells from ICSBP−/− mice with recombinant IL-12 (rIL-12) or rIL-18 in combination with a second stimulus resulted in IFN-γ induction. In experimental therapy of infected ICSBP−/− mice, we observed that administration of rIL-12 induced IFN-γ production which was associated with improved resistance to Yersinia. In contrast, treatment with rIL-18 failed to enhance endogenous IFN-γ production but nevertheless reduced bacterial burden in ICSBP−/− mice. Although cytokine therapy with rIL-12 or rIL-18 ameliorated the course of Yersinia infection in ICSBP−/− mice, both cytokines failed to completely restore impaired immunity. Taken together, the results indicate that the transcription factor ICSBP is essential for efficient host immune defense against Yersinia. These results are important for understanding the complex host immune responses in bacterial infections. PMID:10678954

  14. Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry.

    PubMed

    Asara, John M; Schweitzer, Mary H; Freimark, Lisa M; Phillips, Matthew; Cantley, Lewis C

    2007-04-13

    Fossilized bones from extinct taxa harbor the potential for obtaining protein or DNA sequences that could reveal evolutionary links to extant species. We used mass spectrometry to obtain protein sequences from bones of a 160,000- to 600,000-year-old extinct mastodon (Mammut americanum) and a 68-million-year-old dinosaur (Tyrannosaurus rex). The presence of T. rex sequences indicates that their peptide bonds were remarkably stable. Mass spectrometry can thus be used to determine unique sequences from ancient organisms from peptide fragmentation patterns, a valuable tool to study the evolution and adaptation of ancient taxa from which genomic sequences are unlikely to be obtained. PMID:17431180

  15. Seeking significance in three-dimensional protein structure comparisons.

    PubMed

    Mizuguchi, K; Go, N

    1995-06-01

    What is the significance of three-dimensional structural similarity? This fundamental question still remains unanswered in spite of advances in automatic structure comparison methods that have been made in the last few years. The answer to this question will give us a much deeper insight into the principles of protein architecture. PMID:7583636

  16. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    PubMed Central

    Sharma, Anuj; Manolakos, Elias S.

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  17. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    PubMed

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  18. PROMALS3D web server for accurate multiple protein sequence and structure alignments.

    PubMed

    Pei, Jimin; Tang, Ming; Grishin, Nick V

    2008-07-01

    Multiple sequence alignments are essential in computational sequence and structural analysis, with applications in homology detection, structure modeling, function prediction and phylogenetic analysis. We report PROMALS3D web server for constructing alignments for multiple protein sequences and/or structures using information from available 3D structures, database homologs and predicted secondary structures. PROMALS3D shows higher alignment accuracy than a number of other advanced methods. Input of PROMALS3D web server can be FASTA format protein sequences, PDB format protein structures and/or user-defined alignment constraints. The output page provides alignments with several formats, including a colored alignment augmented with useful information about sequence grouping, predicted secondary structures and consensus sequences. Intermediate results of sequence and structural database searches are also available. The PROMALS3D web server is available at: http://prodata.swmed.edu/promals3d/. PMID:18503087

  19. Rapid removal of unincorporated label and proteins from DNA sequencing reactions.

    PubMed

    Kaczorowski, T; Sektas, M

    1996-04-01

    This article presents a simple and rapid method for removal of unincorporated label and proteins from DNA sequencing reactions by using Wizard purification resin. This method can be successfully applied for preparation of end-labeled oligonucleotides free of unincorporated label, which is important in experiments (including DNA sequencing) when the level of background should be as low as possible. Also, this method is effective in removal of proteins from DNA sequencing reactions. PMID:8734430

  20. Different evolutionary patterns of SNPs between domains and unassigned regions in human protein-coding sequences.

    PubMed

    Pang, Erli; Wu, Xiaomei; Lin, Kui

    2016-06-01

    Protein evolution plays an important role in the evolution of each genome. Because of their functional nature, in general, most of their parts or sites are differently constrained selectively, particularly by purifying selection. Most previous studies on protein evolution considered individual proteins in their entirety or compared protein-coding sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each protein of a given genome. To this end, based on PfamA annotation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occurring within protein domains and those within unassigned regions. With these classifications, we found: the density of synonymous SNPs within domains is significantly greater than that of synonymous SNPs within unassigned regions; however, the density of non-synonymous SNPs shows the opposite pattern. We also found there are signatures of purifying selection on both the domain and unassigned regions. Furthermore, the selective strength on domains is significantly greater than that on unassigned regions. In addition, among all of the human protein sequences, there are 117 PfamA domains in which no SNPs are found. Our results highlight an important aspect of protein domains and may contribute to our understanding of protein evolution. PMID:26833483

  1. Close Sequence Comparisons are Sufficient to Identify Humancis-Regulatory Elements

    SciTech Connect

    Prabhakar, Shyam; Poulin, Francis; Shoukry, Malak; Afzal, Veena; Rubin, Edward M.; Couronne, Olivier; Pennacchio, Len A.

    2005-12-01

    Cross-species DNA sequence comparison is the primary method used to identify functional noncoding elements in human and other large genomes. However, little is known about the relative merits of evolutionarily close and distant sequence comparisons, due to the lack of a universal metric for sequence conservation, and also the paucity of empirically defined benchmark sets of cis-regulatory elements. To address this problem, we developed a general-purpose algorithm (Gumby) that detects slowly-evolving regions in primate, mammalian and more distant comparisons without requiring adjustment of parameters, and ranks conserved elements by P-value using Karlin-Altschul statistics. We benchmarked Gumby predictions against previously identified cis-regulatory elements at diverse genomic loci, and also tested numerous extremely conserved human-rodent sequences for transcriptional enhancer activity using reporter-gene assays in transgenic mice. Human regulatory elements were identified with acceptable sensitivity and specificity by comparison with 1-5 other eutherian mammals or 6 other simian primates. More distant comparisons (marsupial, avian, amphibian and fish) failed to identify many of the empirically defined functional noncoding elements. We derived an intuitive relationship between ancient and recent noncoding sequence conservation from whole genome comparative analysis, which explains some of these findings. Lastly, we determined that, in addition to strength of conservation, genomic location and/or density of surrounding conserved elements must also be considered in selecting candidate enhancers for testing at embryonic time points.

  2. Reconstruction of an ancestral Yersinia pestis genome and comparison with an ancient sequence

    PubMed Central

    2015-01-01

    Background We propose the computational reconstruction of a whole bacterial ancestral genome at the nucleotide scale, and its validation by a sequence of ancient DNA. This rare possibility is offered by an ancient sequence of the late middle ages plague agent. It has been hypothesized to be ancestral to extant Yersinia pestis strains based on the pattern of nucleotide substitutions. But the dynamics of indels, duplications, insertion sequences and rearrangements has impacted all genomes much more than the substitution process, which makes the ancestral reconstruction task challenging. Results We use a set of gene families from 13 Yersinia species, construct reconciled phylogenies for all of them, and determine gene orders in ancestral species. Gene trees integrate information from the sequence, the species tree and gene order. We reconstruct ancestral sequences for ancestral genic and intergenic regions, providing nearly a complete genome sequence for the ancestor, containing a chromosome and three plasmids. Conclusion The comparison of the ancestral and ancient sequences provides a unique opportunity to assess the quality of ancestral genome reconstruction methods. But the quality of the sequencing and assembly of the ancient sequence can also be questioned by this comparison. PMID:26450112

  3. Basal Murphy belt and Chilhowee Group -- Sequence stratigraphic comparison

    SciTech Connect

    Aylor, J.G. Jr. . Dept. of Geology)

    1994-03-01

    The lower Murphy belt in the central western Blue Ridge is interpreted to be correlative to the Early Cambrian Chilhowee Group of the westernmost Blue Ridge and Appalachian fold and thrust belt. Basal Murphy belt depositional sequence stratigraphy represents a second-order, type-2 transgressive systems tract initiated with deposition of lowstand turbidites of the Dean Formation. These transgressive deposits of the Nantahala and Brasstown Formations are interpreted as middle to outer continental shelf deposits. Cyclic and stacked third-order regressive, coarsening upwards sequences of the Nantahala Formation display an overall increase in feldspar content stratigraphically upsection. These transgressive siliciclastic deposits are interpreted to be conformably overlain by a carbonate highstand systems tract of the Murphy Marble. Palinspastic reconstruction indicates that the Nantahala and Brasstown Formations possibly represent a basinward extension of up to 3 km thick siliciclastic wedge. The wedge tapers to the southwest along the strike of the Murphy belt at 10[degree] and thins northwestward to 2 km in the Tennessee depocenter where it is represented by the Chilhowee Group. The Murphy belt basin is believed to represent a transitional rift-to-drift facies deposited on the lower plate of the southern Blue Ridge rift zone.

  4. Molecular evolution of the Escherichia coli chromosome. IV. Sequence comparisons.

    PubMed

    Milkman, R; Bridges, M M

    1993-03-01

    DNA sequences have been compared in a 4,400-bp region for Escherichia coli K12 and 36 ECOR strains. Discontinuities in degree of similarity, previously inferred, are confirmed in detail. Three clonal frames are described on the basis of the present local high-resolution data, as well as previous analyses of restriction fragment length polymorphism (RFLP) and of multilocus enzyme electrophoresis (MLEE) covering small regions more widely dispersed on the chromosome. These three approaches show important consistency. The data illustrate the fact that, in the limited context of intraspecific genomic sequence variation, clonality and homology are synonymous. Two estimable quantitative properties are defined: recency of common ancestry (the reciprocal of the log10 of the number of generations since the most recent common ancestor), and the number of nucleotide pairs over which a given recency of common ancestry applies. In principle, these parameters are measures of the degree and physical extent of homology. The small size of apparent recombinational replacements, together with the observation that they occasionally occur in discontinuous series, raises the question of whether they result from the superimposition of replacements of much larger size (as expected from an elementary interpretation of conjugation and transduction in experimental E. coli systems) or via an alternative mechanism. Length polymorphisms of several sorts are described. PMID:8095913

  5. Graph Theory In Protein Sequence Clustering And Tertiary Structural Matching

    NASA Astrophysics Data System (ADS)

    Abdullah, Rosni; Rashid, Nur'Aini Abdul; Othman, Fazilah

    2008-01-01

    The principle of graph theory which has been widely used in computer networks is now being adopted for work in protein clustering, protein structural matching, and protein folding and modeling. In this work, we present two case studies on the use of graph theory for protein clustering and tertiary structural matching. In protein clustering, we extended a clustering algorithm based on a maximal clique while in the protein tertiary structural matching we explored the bipartite graph matching algorithm. The results obtained in both the case studies will be presented.

  6. Hydrophobic Blocks Facilitate Lipid Compatibility and Translocon Recognition of Transmembrane Protein Sequences

    PubMed Central

    2016-01-01

    Biophysical hydrophobicity scales suggest that partitioning of a protein segment from an aqueous phase into a membrane is governed by its perceived segmental hydrophobicity but do not establish specifically (i) how the segment is identified in vivo for translocon-mediated insertion or (ii) whether the destination lipid bilayer is biochemically receptive to the inserted sequence. To examine the congruence between these dual requirements, we designed and synthesized a library of Lys-tagged peptides of a core length sufficient to span a bilayer but with varying patterns of sequence, each composed of nine Leu residues, nine Ser residues, and one (central) Trp residue. We found that peptides containing contiguous Leu residues (Leu-block peptides, e.g., LLLLLLLLLWSSSSSSSSS), in comparison to those containing discontinuous stretches of Leu residues (non-Leu-block peptides, e.g., SLSLLSLSSWSLLSLSLLS), displayed greater helicity (circular dichroism spectroscopy), traveled slower during sodium dodecyl sulfate–polyacrylamide gel electrophoresis, had longer reverse phase high-performance liquid chromatography retention times on a C-18 column, and were helical when reconstituted into 1-palmitoyl-2-oleoylglycero-3-phosphocholine liposomes, each observation indicating superior lipid compatibility when a Leu-block is present. These parameters were largely paralleled in a biological membrane insertion assay using microsomal membranes from dog pancreas endoplasmic reticulum, where we found only the Leu-block sequences successfully inserted; intriguingly, an amphipathic peptide (SLLSSLLSSWLLSSLLSSL; Leu face, Ser face) with biophysical properties similar to those of Leu-block peptides failed to insert. Our overall results identify local sequence lipid compatibility rather than average hydrophobicity as a principal determinant of transmembrane segment potential, while demonstrating that further subtleties of hydrophobic and helical patterning, such as circumferential hydrophobicity

  7. Hydrophobic blocks facilitate lipid compatibility and translocon recognition of transmembrane protein sequences.

    PubMed

    Stone, Tracy A; Schiller, Nina; von Heijne, Gunnar; Deber, Charles M

    2015-02-24

    Biophysical hydrophobicity scales suggest that partitioning of a protein segment from an aqueous phase into a membrane is governed by its perceived segmental hydrophobicity but do not establish specifically (i) how the segment is identified in vivo for translocon-mediated insertion or (ii) whether the destination lipid bilayer is biochemically receptive to the inserted sequence. To examine the congruence between these dual requirements, we designed and synthesized a library of Lys-tagged peptides of a core length sufficient to span a bilayer but with varying patterns of sequence, each composed of nine Leu residues, nine Ser residues, and one (central) Trp residue. We found that peptides containing contiguous Leu residues (Leu-block peptides, e.g., LLLLLLLLLWSSSSSSSSS), in comparison to those containing discontinuous stretches of Leu residues (non-Leu-block peptides, e.g., SLSLLSLSSWSLLSLSLLS), displayed greater helicity (circular dichroism spectroscopy), traveled slower during sodium dodecyl sulfate-polyacrylamide gel electrophoresis, had longer reverse phase high-performance liquid chromatography retention times on a C-18 column, and were helical when reconstituted into 1-palmitoyl-2-oleoylglycero-3-phosphocholine liposomes, each observation indicating superior lipid compatibility when a Leu-block is present. These parameters were largely paralleled in a biological membrane insertion assay using microsomal membranes from dog pancreas endoplasmic reticulum, where we found only the Leu-block sequences successfully inserted; intriguingly, an amphipathic peptide (SLLSSLLSSWLLSSLLSSL; Leu face, Ser face) with biophysical properties similar to those of Leu-block peptides failed to insert. Our overall results identify local sequence lipid compatibility rather than average hydrophobicity as a principal determinant of transmembrane segment potential, while demonstrating that further subtleties of hydrophobic and helical patterning, such as circumferential hydrophobicity in

  8. Nucleotide sequence variation of the envelope protein gene identifies two distinct genotypes of yellow fever virus.

    PubMed Central

    Chang, G J; Cropp, B C; Kinney, R M; Trent, D W; Gubler, D J

    1995-01-01

    The evolution of yellow fever virus over 67 years was investigated by comparing the nucleotide sequences of the envelope (E) protein genes of 20 viruses isolated in Africa, the Caribbean, and South America. Uniformly weighted parsimony algorithm analysis defined two major evolutionary yellow fever virus lineages designated E genotypes I and II. E genotype I contained viruses isolated from East and Central Africa. E genotype II viruses were divided into two sublineages: IIA viruses from West Africa and IIB viruses from America, except for a 1979 virus isolated from Trinidad (TRINID79A). Unique signature patterns were identified at 111 nucleotide and 12 amino acid positions within the yellow fever virus E gene by signature pattern analysis. Yellow fever viruses from East and Central Africa contained unique signatures at 60 nucleotide and five amino acid positions, those from West Africa contained unique signatures at 25 nucleotide and two amino acid positions, and viruses from America contained such signatures at 30 nucleotide and five amino acid positions in the E gene. The dissemination of yellow fever viruses from Africa to the Americas is supported by the close genetic relatedness of genotype IIA and IIB viruses and genetic evidence of a possible second introduction of yellow fever virus from West Africa, as illustrated by the TRINID79A virus isolate. The E protein genes of American IIB yellow fever viruses had higher frequencies of amino acid substitutions than did genes of yellow fever viruses of genotypes I and IIA on the basis of comparisons with a consensus amino acid sequence for the yellow fever E gene. The great variation in the E proteins of American yellow fever virus probably results from positive selection imposed by virus interaction with different species of mosquitoes or nonhuman primates in the Americas. PMID:7637022

  9. Molecular cloning and sequence analysis of expansins--a highly conserved, multigene family of proteins that mediate cell wall extension in plants.

    PubMed Central

    Shcherban, T Y; Shi, J; Durachko, D M; Guiltinan, M J; McQueen-Mason, S J; Shieh, M; Cosgrove, D J

    1995-01-01

    Expansins are unusual proteins discovered by virtue of their ability to mediate cell wall extension in plants. We identified cDNA clones for two cucumber expansins on the basis of peptide sequences of proteins purified from cucumber hypocotyls. The expansin cDNAs encode related proteins with signal peptides predicted to direct protein secretion to the cell wall. Northern blot analysis showed moderate transcript abundance in the growing region of the hypocotyl and no detectable transcripts in the nongrowing region. Rice and Arabidopsis expansin cDNAs were identified from collections of anonymous cDNAs (expressed sequence tags). Sequence comparisons indicate at least four distinct expansin cDNAs in rice and at least six in Arabidopsis. Expansins are highly conserved in size and sequence (60-87% amino acid sequence identity and 75-95% similarity between any pairwise comparison), and phylogenetic trees indicate that this multigene family formed before the evolutionary divergence of monocotyledons and dicotyledons. Sequence and motif analyses show no similarities to known functional domains that might account for expansin action on wall extension. A series of highly conserved tryptophans may function in expansin binding to cellulose or other glycans. The high conservation of this multigene family indicates that the mechanism by which expansins promote wall extensin tolerates little variation in protein structure. Images Fig. 2 PMID:7568110

  10. Snake venom. The amino acid sequence of protein A from Dendroaspis polylepis polylepis (black mamba) venom.

    PubMed

    Joubert, F J; Strydom, D J

    1980-12-01

    Protein A from Dendroaspis polylepis polylepis venom comprises 81 amino acids, including ten half-cystine residues. The complete primary structures of protein A and its variant A' were elucidated. The sequences of proteins A and A', which differ in a single position, show no homology with various neurotoxins and non-neurotoxic proteins and represent a new type of elapid venom protein. PMID:7461607

  11. Investigation of the protein osteocalcin of Camelops hesternus: Sequence, structure and phylogenetic implications

    NASA Astrophysics Data System (ADS)

    Humpula, James F.; Ostrom, Peggy H.; Gandhi, Hasand; Strahler, John R.; Walker, Angela K.; Stafford, Thomas W.; Smith, James J.; Voorhies, Michael R.; George Corner, R.; Andrews, Phillip C.

    2007-12-01

    Ancient DNA sequences offer an extraordinary opportunity to unravel the evolutionary history of ancient organisms. Protein sequences offer another reservoir of genetic information that has recently become tractable through the application of mass spectrometric techniques. The extent to which ancient protein sequences resolve phylogenetic relationships, however, has not been explored. We determined the osteocalcin amino acid sequence from the bone of an extinct Camelid (21 ka, Camelops hesternus) excavated from Isleta Cave, New Mexico and three bones of extant camelids: bactrian camel ( Camelus bactrianus); dromedary camel ( Camelus dromedarius) and guanaco ( Llama guanacoe) for a diagenetic and phylogenetic assessment. There was no difference in sequence among the four taxa. Structural attributes observed in both modern and ancient osteocalcin include a post-translation modification, Hyp 9, deamidation of Gln 35 and Gln 39, and oxidation of Met 36. Carbamylation of the N-terminus in ancient osteocalcin may result in blockage and explain previous difficulties in sequencing ancient proteins via Edman degradation. A phylogenetic analysis using osteocalcin sequences of 25 vertebrate taxa was conducted to explore osteocalcin protein evolution and the utility of osteocalcin sequences for delineating phylogenetic relationships. The maximum likelihood tree closely reflected generally recognized taxonomic relationships. For example, maximum likelihood analysis recovered rodents, birds and, within hominins, the Homo-Pan-Gorilla trichotomy. Within Artiodactyla, character state analysis showed that a substitution of Pro 4 for His 4 defines the Capra-Ovis clade within Artiodactyla. Homoplasy in our analysis indicated that osteocalcin evolution is not a perfect indicator of species evolution. Limited sequence availability prevented assigning functional significance to sequence changes. Our preliminary analysis of osteocalcin evolution represents an initial step towards a

  12. Comparison of Sequencing (Barcode Region) and Sequence-Tagged-Site PCR for Blastocystis Subtyping

    PubMed Central

    2013-01-01

    Blastocystis is the most common nonfungal microeukaryote of the human intestinal tract and comprises numerous subtypes (STs), nine of which have been found in humans (ST1 to ST9). While efforts continue to explore the relationship between human health status and subtypes, no consensus regarding subtyping methodology exists. It has been speculated that differences detected in subtype distribution in various cohorts may to some extent reflect different approaches. Blastocystis subtypes have been determined primarily in one of two ways: (i) sequencing of small subunit rRNA gene (SSU-rDNA) PCR products and (ii) PCR with subtype-specific sequence-tagged-site (STS) diagnostic primers. Here, STS primers were evaluated against a panel of samples (n = 58) already subtyped by SSU-rDNA sequencing (barcode region), including subtypes for which STS primers are not available, and a small panel of DNAs from four other eukaryotes often present in feces (n = 18). Although the STS primers appeared to be highly specific, their sensitivity was only moderate, and the results indicated that some infections may go undetected when this method is used. False-negative STS results were not linked exclusively to certain subtypes or alleles, and evidence of substantial genetic variation in STS loci was obtained. Since the majority of DNAs included here were extracted from feces, it is possible that STS primers may generally work better with DNAs extracted from Blastocystis cultures. In conclusion, due to its higher applicability and sensitivity, and since sequence information is useful for other forms of research, SSU-rDNA barcoding is recommended as the method of choice for Blastocystis subtyping. PMID:23115257

  13. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    NASA Astrophysics Data System (ADS)

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-01

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

  14. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    SciTech Connect

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-28

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

  15. Full validation of therapeutic antibody sequences by middle-up mass measurements and middle-down protein sequencing

    PubMed Central

    Resemann, Anja; Jabs, Wolfgang; Wiechmann, Anja; Wagner, Elsa; Colas, Olivier; Evers, Waltraud; Belau, Eckhard; Vorwerg, Lars; Evans, Catherine; Beck, Alain; Suckau, Detlev

    2016-01-01

    ABSTRACT The regulatory bodies request full sequence data assessment both for innovator and biosimilar monoclonal antibodies (mAbs). Full sequence coverage is typically used to verify the integrity of the analytical data obtained following the combination of multiple LC-MS/MS datasets from orthogonal protease digests (so called “bottom-up” approaches). Top-down or middle-down mass spectrometric approaches have the potential to minimize artifacts, reduce overall analysis time and provide orthogonality to this traditional approach. In this work we report a new combined approach involving middle-up LC-QTOF and middle-down LC-MALDI in-source decay (ISD) mass spectrometry. This was applied to cetuximab, panitumumab and natalizumab, selected as representative US Food and Drug Administration- and European Medicines Agency-approved mAbs. The goal was to unambiguously confirm their reference sequences and examine the general applicability of this approach. Furthermore, a new measure for assessing the integrity and validity of results from middle-down approaches is introduced – the “Sequence Validation Percentage.” Full sequence data assessment of the 3 antibodies was achieved enabling all 3 sequences to be fully validated by a combination of middle-up molecular weight determination and middle-down protein sequencing. Three errors in the reference amino acid sequence of natalizumab, causing a cumulative mass shift of only −2 Da in the natalizumab Fd domain, were corrected as a result of this work. PMID:26760197

  16. Full validation of therapeutic antibody sequences by middle-up mass measurements and middle-down protein sequencing.

    PubMed

    Resemann, Anja; Jabs, Wolfgang; Wiechmann, Anja; Wagner, Elsa; Colas, Olivier; Evers, Waltraud; Belau, Eckhard; Vorwerg, Lars; Evans, Catherine; Beck, Alain; Suckau, Detlev

    2016-01-01

    The regulatory bodies request full sequence data assessment both for innovator and biosimilar monoclonal antibodies (mAbs). Full sequence coverage is typically used to verify the integrity of the analytical data obtained following the combination of multiple LC-MS/MS datasets from orthogonal protease digests (so called "bottom-up" approaches). Top-down or middle-down mass spectrometric approaches have the potential to minimize artifacts, reduce overall analysis time and provide orthogonality to this traditional approach. In this work we report a new combined approach involving middle-up LC-QTOF and middle-down LC-MALDI in-source decay (ISD) mass spectrometry. This was applied to cetuximab, panitumumab and natalizumab, selected as representative US Food and Drug Administration- and European Medicines Agency-approved mAbs. The goal was to unambiguously confirm their reference sequences and examine the general applicability of this approach. Furthermore, a new measure for assessing the integrity and validity of results from middle-down approaches is introduced - the "Sequence Validation Percentage." Full sequence data assessment of the 3 antibodies was achieved enabling all 3 sequences to be fully validated by a combination of middle-up molecular weight determination and middle-down protein sequencing. Three errors in the reference amino acid sequence of natalizumab, causing a cumulative mass shift of only -2 Da in the natalizumab Fd domain, were corrected as a result of this work. PMID:26760197

  17. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures.

    PubMed

    Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier

    2016-01-01

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. PMID:26590254

  18. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures

    PubMed Central

    Lua, Rhonald C.; Wilson, Stephen J.; Konecki, Daniel M.; Wilkins, Angela D.; Venner, Eric; Morgan, Daniel H.; Lichtarge, Olivier

    2016-01-01

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. PMID:26590254

  19. In Silico Characterization of Pectate Lyase Protein Sequences from Different Source Organisms

    PubMed Central

    Dubey, Amit Kumar; Yadav, Sangeeta; Kumar, Manish; Singh, Vinay Kumar; Sarangi, Bijaya Ketan; Yadav, Dinesh

    2010-01-01

    A total of 121 protein sequences of pectate lyases were subjected to homology search, multiple sequence alignment, phylogenetic tree construction, and motif analysis. The phylogenetic tree constructed revealed different clusters based on different source organisms representing bacterial, fungal, plant, and nematode pectate lyases. The multiple accessions of bacterial, fungal, nematode, and plant pectate lyase protein sequences were placed closely revealing a sequence level similarity. The multiple sequence alignment of these pectate lyase protein sequences from different source organisms showed conserved regions at different stretches with maximum homology from amino acid residues 439–467, 715–816, and 829–910 which could be used for designing degenerate primers or probes specific for pectate lyases. The motif analysis revealed a conserved Pec_Lyase_C domain uniformly observed in all pectate lyases irrespective of variable sources suggesting its possible role in structural and enzymatic functions. PMID:21048874

  20. Using evolutionary sequence variation to make inferences about protein structure and function

    NASA Astrophysics Data System (ADS)

    Colwell, Lucy

    2015-03-01

    The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. The explosive growth in the number of available protein sequences raises the possibility of using the natural variation present in homologous protein sequences to infer these constraints and thus identify residues that control different protein phenotypes. Because in many cases phenotypic changes are controlled by more than one amino acid, the mutations that separate one phenotype from another may not be independent, requiring us to understand the correlation structure of the data. To address this we build a maximum entropy probability model for the protein sequence. The parameters of the inferred model are constrained by the statistics of a large sequence alignment. Pairs of sequence positions with the strongest interactions accurately predict contacts in protein tertiary structure, enabling all atom structural models to be constructed. We describe development of a theoretical inference framework that enables the relationship between the amount of available input data and the reliability of structural predictions to be better understood.

  1. Comparative sequence analysis of double stranded RNA binding protein encoding gene of parapoxviruses from Indian camels.

    PubMed

    Nagarajan, G; Swami, Shelesh Kumar; Dahiya, Shyam Singh; Sivakumar, G; Tuteja, F C; Narnaware, S D; Mehta, S C; Singh, Raghvendar; Patil, N V

    2014-03-01

    The dsRNA binding protein (RBP) encoding gene of parapoxviruses (PPVs) from the Dromedary camels, inhabitating different geographical region of Rajasthan, India were amplified by polymerase chain reaction using the primers of pseudocowpoxvirus (PCPV) from Finnish reindeer and cloned into pGEM-T for sequence analysis. Analysis of RBP encoding gene revealed that PPV DNA from Bikaner shared 98.3% and 76.6% sequence identity at the amino acid level, with Pali and Udaipur PPV DNA, respectively. Reference strains of Bovine papular stomatitis virus (BPSV) and PCPV (reindeer PCPV and human PCPV) shared 52.8% and 86.9% amino acid identity with RBP gene of camel PPVs from Bikaner, respectively. But different strains of orf virus (ORFV) from different geographical areas of the world shared 69.5-71.7% amino acid identity with RBP gene of camel PPVs from Bikaner. These findings indicate that the camel PPVs described are closely related to bovine PPV (PCPV) in comparison to caprine and ovine PPV (ORFV). PMID:25685494

  2. Secure distributed genome analysis for GWAS and sequence comparison computation

    PubMed Central

    2015-01-01

    Background The rapid increase in the availability and volume of genomic data makes significant advances in biomedical research possible, but sharing of genomic data poses challenges due to the highly sensitive nature of such data. To address the challenges, a competition for secure distributed processing of genomic data was organized by the iDASH research center. Methods In this work we propose techniques for securing computation with real-life genomic data for minor allele frequency and chi-squared statistics computation, as well as distance computation between two genomic sequences, as specified by the iDASH competition tasks. We put forward novel optimizations, including a generalization of a version of mergesort, which might be of independent interest. Results We provide implementation results of our techniques based on secret sharing that demonstrate practicality of the suggested protocols and also report on performance improvements due to our optimization techniques. Conclusions This work describes our techniques, findings, and experimental results developed and obtained as part of iDASH 2015 research competition to secure real-life genomic computations and shows feasibility of securely computing with genomic data in practice. PMID:26733307

  3. Characterization of DNA-protein interactions using high-throughput sequencing data from pulldown experiments

    NASA Astrophysics Data System (ADS)

    Moreland, Blythe; Oman, Kenji; Curfman, John; Yan, Pearlly; Bundschuh, Ralf

    Methyl-binding domain (MBD) protein pulldown experiments have been a valuable tool in measuring the levels of methylated CpG dinucleotides. Due to the frequent use of this technique, high-throughput sequencing data sets are available that allow a detailed quantitative characterization of the underlying interaction between methylated DNA and MBD proteins. Analyzing such data sets, we first found that two such proteins cannot bind closer to each other than 2 bp, consistent with structural models of the DNA-protein interaction. Second, the large amount of sequencing data allowed us to find rather weak but nevertheless clearly statistically significant sequence preferences for several bases around the required CpG. These results demonstrate that pulldown sequencing is a high-precision tool in characterizing DNA-protein interactions. This material is based upon work supported by the National Science Foundation under Grant No. DMR-1410172.

  4. Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies

    PubMed Central

    Torkamaneh, Davoud; Laroche, Jérôme; Belzile, François

    2016-01-01

    Next-generation sequencing (NGS) has revolutionized plant and animal research in many ways including new methods of high throughput genotyping. Genotyping-by-sequencing (GBS) has been demonstrated to be a robust and cost-effective genotyping method capable of producing thousands to millions of SNPs across a wide range of species. Undoubtedly, the greatest barrier to its broader use is the challenge of data analysis. Herein we describe a comprehensive comparison of seven GBS bioinformatics pipelines developed to process raw GBS sequence data into SNP genotypes. We compared five pipelines requiring a reference genome (TASSEL-GBS v1& v2, Stacks, IGST, and Fast-GBS) and two de novo pipelines that do not require a reference genome (UNEAK and Stacks). Using Illumina sequence data from a set of 24 re-sequenced soybean lines, we performed SNP calling with these pipelines and compared the GBS SNP calls with the re-sequencing data to assess their accuracy. The number of SNPs called without a reference genome was lower (13k to 24k) than with a reference genome (25k to 54k SNPs) while accuracy was high (92.3 to 98.7%) for all but one pipeline (TASSEL-GBSv1, 76.1%). Among pipelines offering a high accuracy (>95%), Fast-GBS called the greatest number of polymorphisms (close to 35,000 SNPs + Indels) and yielded the highest accuracy (98.7%). Using Ion Torrent sequence data for the same 24 lines, we compared the performance of Fast-GBS with that of TASSEL-GBSv2. It again called more polymorphisms (25.8K vs 22.9K) and these proved more accurate (95.2 vs 91.1%). Typically, SNP catalogues called from the same sequencing data using different pipelines resulted in highly overlapping SNP catalogues (79–92% overlap). In contrast, overlap between SNP catalogues obtained using the same pipeline but different sequencing technologies was less extensive (~50–70%). PMID:27547936

  5. Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies.

    PubMed

    Torkamaneh, Davoud; Laroche, Jérôme; Belzile, François

    2016-01-01

    Next-generation sequencing (NGS) has revolutionized plant and animal research in many ways including new methods of high throughput genotyping. Genotyping-by-sequencing (GBS) has been demonstrated to be a robust and cost-effective genotyping method capable of producing thousands to millions of SNPs across a wide range of species. Undoubtedly, the greatest barrier to its broader use is the challenge of data analysis. Herein we describe a comprehensive comparison of seven GBS bioinformatics pipelines developed to process raw GBS sequence data into SNP genotypes. We compared five pipelines requiring a reference genome (TASSEL-GBS v1& v2, Stacks, IGST, and Fast-GBS) and two de novo pipelines that do not require a reference genome (UNEAK and Stacks). Using Illumina sequence data from a set of 24 re-sequenced soybean lines, we performed SNP calling with these pipelines and compared the GBS SNP calls with the re-sequencing data to assess their accuracy. The number of SNPs called without a reference genome was lower (13k to 24k) than with a reference genome (25k to 54k SNPs) while accuracy was high (92.3 to 98.7%) for all but one pipeline (TASSEL-GBSv1, 76.1%). Among pipelines offering a high accuracy (>95%), Fast-GBS called the greatest number of polymorphisms (close to 35,000 SNPs + Indels) and yielded the highest accuracy (98.7%). Using Ion Torrent sequence data for the same 24 lines, we compared the performance of Fast-GBS with that of TASSEL-GBSv2. It again called more polymorphisms (25.8K vs 22.9K) and these proved more accurate (95.2 vs 91.1%). Typically, SNP catalogues called from the same sequencing data using different pipelines resulted in highly overlapping SNP catalogues (79-92% overlap). In contrast, overlap between SNP catalogues obtained using the same pipeline but different sequencing technologies was less extensive (~50-70%). PMID:27547936

  6. Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals

    PubMed Central

    2014-01-01

    Background Protein subcellular localization is a central problem in understanding cell biology and has been the focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despite its extensive use for tasks such as the prediction of transcription factor binding sites. Results Here, we flip the problem around, and present a proof of concept for the idea that the lack of sequence conservation can be a novel feature for localization prediction. We show that for yeast, mammal and plant datasets, evolutionary sequence divergence alone has significant power to identify sequences with N-terminal sorting sequences. Moreover sequence divergence is nearly as effective when computed on automatically defined ortholog sets as on hand curated ones. Unfortunately, sequence divergence did not necessarily increase classification performance when combined with some traditional sequence features such as amino acid composition. However a post-hoc analysis of the proteins in which sequence divergence changes the prediction yielded some proteins with atypical (i.e. not MPP-cleaved) matrix targeting signals as well as a few misannotations. Conclusion We report the results of the first quantitative study of the effectiveness of evolutionary sequence divergence as a feature for protein subcellular localization prediction. We show that divergence is indeed useful for prediction, but it is not trivial to improve overall accuracy simply by adding this feature to classical sequence features. Nevertheless we argue that sequence divergence is a promising feature and show anecdotal examples in which it succeeds where other features fail. PMID:24438075

  7. Fast computational methods for predicting protein structure from primary amino acid sequence

    DOEpatents

    Agarwal, Pratul Kumar

    2011-07-19

    The present invention provides a method utilizing primary amino acid sequence of a protein, energy minimization, molecular dynamics and protein vibrational modes to predict three-dimensional structure of a protein. The present invention also determines possible intermediates in the protein folding pathway. The present invention has important applications to the design of novel drugs as well as protein engineering. The present invention predicts the three-dimensional structure of a protein independent of size of the protein, overcoming a significant limitation in the prior art.

  8. Proteins comparison through probabilistic optimal structure local alignment

    PubMed Central

    Micale, Giovanni; Pulvirenti, Alfredo; Giugno, Rosalba; Ferro, Alfredo

    2014-01-01

    Multiple local structure comparison helps to identify common structural motifs or conserved binding sites in 3D structures in distantly related proteins. Since there is no best way to compare structures and evaluate the alignment, a wide variety of techniques and different similarity scoring schemes have been proposed. Existing algorithms usually compute the best superposition of two structures or attempt to solve it as an optimization problem in a simpler setting (e.g., considering contact maps or distance matrices). Here, we present PROPOSAL (PROteins comparison through Probabilistic Optimal Structure local ALignment), a stochastic algorithm based on iterative sampling for multiple local alignment of protein structures. Our method can efficiently find conserved motifs across a set of protein structures. Only the distances between all pairs of residues in the structures are computed. To show the accuracy and the effectiveness of PROPOSAL we tested it on a few families of protein structures. We also compared PROPOSAL with two state-of-the-art tools for pairwise local alignment on a dataset of manually annotated motifs. PROPOSAL is available as a Java 2D standalone application or a command line program at http://ferrolab.dmi.unict.it/proposal/proposal.html. PMID:25228906

  9. Self-organizing fuzzy graphs for structure-based comparison of protein pockets.

    PubMed

    Reisen, Felix; Weisel, Martin; Kriegl, Jan M; Schneider, Gisbert

    2010-12-01

    Patterns of receptor-ligand interaction can be conserved in functionally equivalent proteins even in the absence of sequence homology. Therefore, structural comparison of ligand-binding pockets and their pharmacophoric features allow for the characterization of so-called "orphan" proteins with known three-dimensional structure but unknown function, and predict ligand promiscuity of binding pockets. We present an algorithm for rapid pocket comparison (PoLiMorph), in which protein pockets are represented by self-organizing graphs that fill the volume of the cavity. Vertices in these three-dimensional frameworks contain information about the local ligand-receptor interaction potential coded by fuzzy property labels. For framework matching, we developed a fast heuristic based on the maximum dispersion problem, as an alternative to techniques utilizing clique detection or geometric hashing algorithms. A sophisticated scoring function was applied that incorporates knowledge about property distributions and ligand-receptor interaction patterns. In an all-against-all virtual screening experiment with 207 pocket frameworks extracted from a subset of PDBbind, PoLiMorph correctly assigned 81% of 69 distinct structural classes and demonstrated sustained ability to group pockets accommodating the same ligand chemotype. We determined a score threshold that indicates "true" pocket similarity with high reliability, which not only supports structure-based drug design but also allows for sequence-independent studies of the proteome. PMID:20883038

  10. Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison.

    PubMed

    Li, Yushuang; Xiao, Wenli

    2016-01-01

    This paper constructed a novel injection from a DNA sequence to a 3D graph, named circular helix-like curve (CHC). The presented graphical representation is available for visualizing characterizations of a single DNA sequence and identifying similarities and differences among several DNAs. A 12-dimensional vector extracted from CHC, as a numerical characterization of CHC, was applied to analyze phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Successful experiments illustrated that CHC is an effective tool of biological sequence analysis and comparison. PMID:27403205

  11. Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison

    PubMed Central

    Li, Yushuang

    2016-01-01

    This paper constructed a novel injection from a DNA sequence to a 3D graph, named circular helix-like curve (CHC). The presented graphical representation is available for visualizing characterizations of a single DNA sequence and identifying similarities and differences among several DNAs. A 12-dimensional vector extracted from CHC, as a numerical characterization of CHC, was applied to analyze phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Successful experiments illustrated that CHC is an effective tool of biological sequence analysis and comparison. PMID:27403205

  12. Correlation between Protein Sequence Similarity and Crystallization Reagents in the Biological Macromolecule Crystallization Database

    PubMed Central

    Lu, Hui-Meng; Yin, Da-Chuan; Liu, Yong-Ming; Guo, Wei-Hong; Zhou, Ren-Bin

    2012-01-01

    The protein structural entries grew far slower than the sequence entries. This is partly due to the bottleneck in obtaining diffraction quality protein crystals for structural determination using X-ray crystallography. The first step to achieve protein crystallization is to find out suitable chemical reagents. However, it is not an easy task. Exhausting trial and error tests of numerous combinations of different reagents mixed with the protein solution are usually necessary to screen out the pursuing crystallization conditions. Therefore, any attempts to help find suitable reagents for protein crystallization are helpful. In this paper, an analysis of the relationship between the protein sequence similarity and the crystallization reagents according to the information from the existing databases is presented. We extracted information of reagents and sequences from the Biological Macromolecule Crystallization Database (BMCD) and the Protein Data Bank (PDB) database, classified the proteins into different clusters according to the sequence similarity, and statistically analyzed the relationship between the sequence similarity and the crystallization reagents. The results showed that there is a pronounced positive correlation between them. Therefore, according to the correlation, prediction of feasible chemical reagents that are suitable to be used in crystallization screens for a specific protein is possible. PMID:22949812

  13. Ribosomal DNA ITS-1 and ITS-2 sequence comparisons as a tool for predicting genetic relatedness.

    PubMed

    Coleman, A W; Mai, J C

    1997-08-01

    The determination of the secondary structure of the internal transcribed spacer (ITS) regions separating nuclear ribosomal RNA genes of Chlorophytes has improved the fidelity of alignment of nuclear ribosomal ITS sequences from related organisms. Application of this information to sequences from green algae and plants suggested that a subset of the ITS-2 positions is relatively conserved. Organisms that can mate are identical at all of these 116 positions, or differ by at most, one nucleotide change. Here we sequenced and compared the ITS-1 and ITS-2 of 40 green flagellates in search of the nearest relative to Chlamydomonas reinhardtii. The analysis clearly revealed one unique candidate, C. incerta. Several ancillary benefits of the analysis included the identification of mislabelled cultures, the resolution of confusion concerning C. smithii, the discovery of misidentified sequences in GenBank derived from a green algal contaminant, and an overview of evolutionary relationships among the Volvocales, which is congruent with that derived from rDNA gene sequence comparisons but improves upon its resolution. The study further delineates the taxonomic level at which ITS sequences, in comparison to ribosomal gene sequences, are most useful in systematic and other studies. PMID:9236277

  14. The cleavable pre-sequence of an imported chloroplast protein directs attached polypeptides into yeast mitochondria

    PubMed Central

    Hurt, Eduard C.; Soltanifar, Nouchine; Goldschmidt-Clermont, Michel; Rochaix, Jean-David; Schatz, Gottfried

    1986-01-01

    The cleavable pre-sequences of imported chloroplast and mitochondrial proteins have several features in common. This structural similarity prompted us to test whether a chloroplast pre-sequence (`transit peptide') can also be decoded by the mitochondrial import machinery. In the green alga, Chlamydomonas reinhardtii, the small subunit of ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) (a chloroplast protein) is nuclear-encoded and synthesized in the cytosol with a transient pre-sequence of 45 residues. The 31 amino-terminal residues of this chloroplast pre-sequence were fused to mouse dihydrofolate reductase (a cytosolic protein) and to yeast cytochrome oxidase subunit IV (an imported mitochondrial protein) from which the authentic pre-sequence had been removed. The chloroplast pre-sequence transported both attached proteins into the yeast mitochondrial matrix or inner membrane, although it functioned less efficiently than an authentic mitochondrial pre-sequence. We conclude that mitochondrial and chloroplast pre-sequences perform their function by a similar mechanism. ImagesFig. 3.Fig. 4.Fig. 5.Fig. 6. PMID:16453686

  15. Contributions of the Prion Protein Sequence, Strain, and Environment to the Species Barrier.

    PubMed

    Sharma, Aditi; Bruce, Kathryn L; Chen, Buxin; Gyoneva, Stefka; Behrens, Sven H; Bommarius, Andreas S; Chernoff, Yury O

    2016-01-15

    Amyloid propagation requires high levels of sequence specificity so that only molecules with very high sequence identity can form cross-β-sheet structures of sufficient stringency for incorporation into the amyloid fibril. This sequence specificity presents a barrier to the transmission of prions between two species with divergent sequences, termed a species barrier. Here we study the relative effects of protein sequence, seed conformation, and environment on the species barrier strength and specificity for the yeast prion protein Sup35p from three closely related species of the Saccharomyces sensu stricto group; namely, Saccharomyces cerevisiae, Saccharomyces bayanus, and Saccharomyces paradoxus. Through in vivo plasmid shuffle experiments, we show that the major characteristics of the transmission barrier and conformational fidelity are determined by the protein sequence rather than by the cellular environment. In vitro data confirm that the kinetics and structural preferences of aggregation of the S. paradoxus and S. bayanus proteins are influenced by anions in accordance with their positions in the Hofmeister series, as observed previously for S. cerevisiae. However, the specificity of the species barrier is primarily affected by the sequence and the type of anion present during the formation of the initial seed, whereas anions present during the seeded aggregation process typically influence kinetics rather than the specificity of prion conversion. Therefore, our work shows that the protein sequence and the conformation variant (strain) of the prion seed are the primary determinants of cross-species prion specificity both in vivo and in vitro. PMID:26565023

  16. Comparison of alignment software for genome-wide bisulphite sequence data

    PubMed Central

    Chatterjee, Aniruddha; Stockwell, Peter A.; Rodger, Euan J.; Morison, Ian M.

    2012-01-01

    Recent advances in next generation sequencing (NGS) technology now provide the opportunity to rapidly interrogate the methylation status of the genome. However, there are challenges in handling and interpretation of the methylation sequence data because of its large volume and the consequences of bisulphite modification. We sequenced reduced representation human genomes on the Illumina platform and efficiently mapped and visualized the data with different pipelines and software packages. We examined three pipelines for aligning bisulphite converted sequencing reads and compared their performance. We also comment on pre-processing and quality control of Illumina data. This comparison highlights differences in methods for NGS data processing and provides guidance to advance sequence-based methylation data analysis for molecular biologists. PMID:22344695

  17. A general function of noncoding polynucleotide sequences. Mass binding of transconformational proteins.

    PubMed

    Zuckerkandl, E

    1981-05-22

    It is proposed that a general function of noncoding DNA and RNA sequences in higher organisms (intergenic and intervening sequences) is to provide multiple binding sites over long stretches of polynucleotide for certain types of regulatory proteins. Through the building up or abolishing of high-order structures, these proteins either sequester sites for the control of, e.g., transcription or make the sites available to local molecular signals. If this is to take place, the existence of a 'c-value paradox' becomes a requirement. Multiple binding sites for a given protein may recur in the form of a sequence 'motif' that is variable within certain limits. Noncoding sequences of the chickens ovalbumin gene furnish an appropriate example of a sequence motif. GAAAATT. Its improbably high frequency and significant periodicity are both absent from the coding sequences of the same gene and from the noncoding sequences of a differently controlled gene in the same organisms, the preproinsulin gene. This distribution of a sequence motif is in keeping with the concepts outlined. Low specificity of sequences that bind protein is likely to be compatible with highly specific conformational changes. PMID:6789141

  18. Using the Relevance Vector Machine Model Combined with Local Phase Quantization to Predict Protein-Protein Interactions from Protein Sequences.

    PubMed

    An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Fang, Yu-Hong; Zhao, Yu-Jun; Zhang, Ming

    2016-01-01

    We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM) model and Local Phase Quantization (LPQ) to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. We perform 5-fold cross-validation experiments on Yeast and Human datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the Yeast dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research. PMID:27314023

  19. Using the Relevance Vector Machine Model Combined with Local Phase Quantization to Predict Protein-Protein Interactions from Protein Sequences

    PubMed Central

    An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Fang, Yu-Hong; Zhao, Yu-Jun; Zhang, Ming

    2016-01-01

    We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM) model and Local Phase Quantization (LPQ) to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. We perform 5-fold cross-validation experiments on Yeast and Human datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the Yeast dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research. PMID:27314023

  20. Collecting, comparing, and computing sequences: the making of Margaret O. Dayhoff's Atlas of Protein Sequence and Structure, 1954-1965.

    PubMed

    Strasser, Bruno J

    2010-01-01

    Collecting, comparing, and computing molecular sequences are among the most prevalent practices in contemporary biological research. They represent a specific way of producing knowledge. This paper explores the historical development of these practices, focusing on the work of Margaret O. Dayhoff, Richard V. Eck, and Robert S. Ledley, who produced the first computer-based collection of protein sequences, published in book format in 1965 as the Atlas of Protein Sequence and Structure. While these practices are generally associated with the rise of molecular evolution in the 1960s, this paper shows that they grew out of research agendas from the previous decade, including the biochemical investigation of the relations between the structures and function of proteins and the theoretical attempt to decipher the genetic code. It also shows how computers became essential for the handling and analysis of sequence data. Finally, this paper reflects on the relationships between experimenting and collecting as two distinct "ways of knowing" that were essential for the transformation of the life sciences in the twentieth century. PMID:20665074

  1. Purification of a Zn-binding phloem protein with sequence identity to chitin-binding proteins.

    PubMed Central

    Taylor, K C; Albrigo, L G; Chase, C D

    1996-01-01

    In citrus blight, a decline disorder of unknown etiology, the tree canopy exhibits symptoms of Zn deficiency while Zn accumulates in the trunk phloem. We have purified a Zn-binding protein (ZBP) from phloem tissue of healthy and blight-affected citrus (Citrus sinensis [L.] Osbeck on Citrus jambhiri [L.]). The molecular weight of the ZBP was estimated to be 5000 by size-exclusion chromatography and sodium dodecyl sulfate-polyacrylamide gel electrophoresis. Ion-exchange chromatography at pH 8.0 demonstrated the 5-kD ZBP to be anionic. A partial N-terminal amino acid sequence revealed a cysteine-, glycine-rich domain with 45 to 80% identity with the chitin-binding domain of hevein, wheat germ agglutinin, and several class I chitinases. That the abundance of this protein increased 2.5-fold in association with Zn accumulation in the phloem is characteristic of citrus blight. Tissue mass changes of the phloem suggests that altered tissue structure accompanies blight. Phloem accumulation of the 5-kD ZBP may be in response to wounding or other stress of blight-affected citrus. PMID:8742339

  2. Discovery of active proteins directly from combinatorial randomized protein libraries without display, purification or sequencing: identification of novel zinc finger proteins

    PubMed Central

    Hughes, Marcus D.; Zhang, Zhan-Ren; Sutherland, Andrew J.; Santos, Albert F.; Hine, Anna V.

    2005-01-01

    We have successfully linked protein library screening directly with the identification of active proteins, without the need for individual purification, display technologies or physical linkage between the protein and its encoding sequence. By using ‘MAX’ randomization we have rapidly constructed 60 overlapping gene libraries that encode zinc finger proteins, randomized variously at the three principal DNA-contacting residues. Expression and screening of the libraries against five possible target DNA sequences generated data points covering a potential 40 000 individual interactions. Comparative analysis of the resulting data enabled direct identification of active proteins. Accuracy of this library analysis methodology was confirmed by both in vitro and in vivo analyses of identified proteins to yield novel zinc finger proteins that bind to their target sequences with high affinity, as indicated by low nanomolar apparent dissociation constants. PMID:15722478

  3. Mining and comparison of haplotype-based expressed sequence tag single nucleotide polymorphisms among citrus cultivars

    Technology Transfer Automated Retrieval System (TEKTRAN)

    In this paper, haplotype-based SNPs were mined out of publicly available citrus expressed sequence tags (ESTs) from different citrus cultivars (genotypes) individually and collectively for comparison. There were a total of 567,297 ESTs belonging to 27 cultivars in varying numbers and consequentially...

  4. Genomic sequence comparison of eif(iso)4E between Arabidopsis and melon

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Eukaryotic initiation factors (eifs) bind to mRNA and initiate translation in plants. Mutations in eifs condition recessively inherited virus resistances. While coding regions among eifs have been compared both within and among species, comparisons among flanking genomic sequences are lacking. We ...

  5. Comparison and quantitative verification of mapping algorithms for whole genome bisulfite sequencing

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Coupling bisulfite conversion with next-generation sequencing (Bisulfite-seq) enables genome-wide measurement of DNA methylation, but poses unique challenges for mapping. However, despite a proliferation of Bisulfite-seq mapping tools, no systematic comparison of their genomic coverage and quantitat...

  6. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder

    PubMed Central

    Lorenzo, J. Ramiro; Alonso, Leonardo G.; Sánchez, Ignacio E.

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage “Protein and nucleic acid structure and sequence analysis”. PMID:26674530

  7. EvoTol: a protein-sequence based evolutionary intolerance framework for disease-gene prioritization

    PubMed Central

    Rackham, Owen J. L.; Shihab, Hashem A.; Johnson, Michael R.; Petretto, Enrico

    2015-01-01

    Methods to interpret personal genome sequences are increasingly required. Here, we report a novel framework (EvoTol) to identify disease-causing genes using patient sequence data from within protein coding-regions. EvoTol quantifies a gene's intolerance to mutation using evolutionary conservation of protein sequences and can incorporate tissue-specific gene expression data. We apply this framework to the analysis of whole-exome sequence data in epilepsy and congenital heart disease, and demonstrate EvoTol's ability to identify known disease-causing genes is unmatched by competing methods. Application of EvoTol to the human interactome revealed networks enriched for genes intolerant to protein sequence variation, informing novel polygenic contributions to human disease. PMID:25550428

  8. eMatchSite: Sequence Order-Independent Structure Alignments of Ligand Binding Pockets in Protein Models

    PubMed Central

    Brylinski, Michal

    2014-01-01

    Detecting similarities between ligand binding sites in the absence of global homology between target proteins has been recognized as one of the critical components of modern drug discovery. Local binding site alignments can be constructed using sequence order-independent techniques, however, to achieve a high accuracy, many current algorithms for binding site comparison require high-quality experimental protein structures, preferably in the bound conformational state. This, in turn, complicates proteome scale applications, where only various quality structure models are available for the majority of gene products. To improve the state-of-the-art, we developed eMatchSite, a new method for constructing sequence order-independent alignments of ligand binding sites in protein models. Large-scale benchmarking calculations using adenine-binding pockets in crystal structures demonstrate that eMatchSite generates accurate alignments for almost three times more protein pairs than SOIPPA. More importantly, eMatchSite offers a high tolerance to structural distortions in ligand binding regions in protein models. For example, the percentage of correctly aligned pairs of adenine-binding sites in weakly homologous protein models is only 4–9% lower than those aligned using crystal structures. This represents a significant improvement over other algorithms, e.g. the performance of eMatchSite in recognizing similar binding sites is 6% and 13% higher than that of SiteEngine using high- and moderate-quality protein models, respectively. Constructing biologically correct alignments using predicted ligand binding sites in protein models opens up the possibility to investigate drug-protein interaction networks for complete proteomes with prospective systems-level applications in polypharmacology and rational drug repositioning. eMatchSite is freely available to the academic community as a web-server and a stand-alone software distribution at http://www.brylinski.org/ematchsite. PMID

  9. [Recombinant proteins containing amino acid sequences of two ectatomin chains].

    PubMed

    Esipov, R S; Gurevich, A I; Kaiushin, A L; Korosteleva, M D; Miroshnikov, A I; Shevchenko, L V; Pluzhnikov, K A; Grishin, E V

    1997-12-01

    Artificial genes for chains A and B of ectatomin, an Ectatomma tuberculatum ant toxin, were obtained by chemical and enzymic synthesis and cloned into new plasmid vectors. Expression plasmids with the genes of hybrid proteins were constructed containing human interleukin-3 or its terminal 63-mer fragment as well as chains A and B of ectatomin, which are linked via a region containing the cleavage site of specific protease, enterokinase (hybrid proteins IL3ETOXA, IL3ETOXB, ILETOXA, and ILETOXB). Escherichia coli producer strains providing a high yield of IL3ETOXA and IL3ETOXB proteins as inclusion bodies were obtained. PMID:9499370

  10. Truly Absorbed Microbial Protein Synthesis, Rumen Bypass Protein, Endogenous Protein, and Total Metabolizable Protein from Starchy and Protein-Rich Raw Materials: Model Comparison and Predictions.

    PubMed

    Parand, Ehsan; Vakili, Alireza; Mesgaran, Mohsen Danesh; van Duinkerken, Gert; Yu, Peiqiang

    2015-07-29

    This study was carried out to measure truly absorbed microbial protein synthesis, rumen bypass protein, and endogenous protein loss, as well as total metabolizable protein, from starchy and protein-rich raw feed materials with model comparisons. Predictions by the DVE2010 system as a more mechanistic model were compared with those of two other models, DVE1994 and NRC-2001, that are frequently used in common international feeding practice. DVE1994 predictions for intestinally digestible rumen undegradable protein (ARUP) for starchy concentrates were higher (27 vs 18 g/kg DM, p < 0.05, SEM = 1.2) than predictions by the NRC-2001, whereas there was no difference in predictions for ARUP from protein concentrates among the three models. DVE2010 and NRC-2001 had highest estimations of intestinally digestible microbial protein for starchy (92 g/kg DM in DVE2010 vs 46 g/kg DM in NRC-2001 and 67 g/kg DM in DVE1994, p < 0.05 SEM = 4) and protein concentrates (69 g/kg DM in NRC-2001 vs 31 g/kg DM in DVE1994 and 49 g/kg DM in DVE2010, p < 0.05 SEM = 4), respectively. Potential protein supplies predicted by tested models from starchy and protein concentrates are widely different, and comparable direct measurements are needed to evaluate the actual ability of different models to predict the potential protein supply to dairy cows from different feedstuffs. PMID:26118653

  11. A second rhodopsin-like protein in Cyanophora paradoxa: gene sequence and protein expression in a cell-free system.

    PubMed

    Frassanito, Anna Maria; Barsanti, Laura; Passarelli, Vincenzo; Evangelista, Valtere; Gualtieri, Paolo

    2013-08-01

    Here we report the identification and expression of a second rhodopsin-like protein in the alga Cyanophora paradoxa (Glaucophyta), named Cyanophopsin_2. This new protein was identified due to a serendipity event, since the RACE reaction performed to complete the sequence of Cyanophopsin_1, (the first rhodopsin-like protein of C. paradoxa identified in 2009 by our group), amplified a 619 bp sequence corresponding to a portion of a new gene of the same protein family. The full sequence consists of 1175 bp consisting of 849 bp coding DNA sequence and 4 introns of 326 bp. The protein is characterized by an N-terminal region of 47 amino acids, followed by a region with 7 α-helices of 213 amino acids and a C-terminal region of 22 amino acids. This protein showed high identity with Cyanophopsin_1 and other rhodopsin-like proteins of Archea, Bacteria, Fungi and Algae. Cyanophosin_2 (CpR2) was expressed in a cell-free expression system, and characterized by means of absorption spectroscopy. PMID:23851421

  12. Sequence-related human proteins cluster by degree of evolutionary conservation

    NASA Astrophysics Data System (ADS)

    Mrowka, Ralf; Patzak, Andreas; Herzel, Hanspeter; Holste, Dirk

    2004-11-01

    Gene duplication followed by adaptive evolution is thought to be a central mechanism for the emergence of novel genes. To illuminate the contribution of duplicated protein-coding sequences to the complexity of the human genome, we study the connectivity of pairwise sequence-related human proteins and construct a network (N) of linked protein sequences with shared similarities. We find that (i) the connectivity distribution P(k) for k sequence-related proteins decays as a power law P(k)˜k-γ with γ≈1.2 , (ii) the top rank of N consists of a single large cluster of proteins (≈70%) , while bottom ranks consist of multiple isolated clusters, and (iii) structural characteristics of N show both a high degree of clustering and an intermediate connectivity (“small-world” features). We gain further insight into structural properties of N by studying the relationship between the connectivity distribution and the phylogenetic conservation of proteins in bacteria, plants, invertebrates, and vertebrates. We find that (iv) the proportion of sequence-related proteins increases with increasing extent of evolutionary conservation. Our results support that small-world network properties constitute a footprint of an evolutionary mechanism and extend the traditional interpretation of protein families.

  13. Beyond Linear Sequence Comparisons: The use of genome-levelcharacters for phylogenetic reconstruction

    SciTech Connect

    Boore, Jeffrey L.

    2004-11-27

    Although the phylogenetic relationships of many organisms have been convincingly resolved by the comparisons of nucleotide or amino acid sequences, others have remained equivocal despite great effort. Now that large-scale genome sequencing projects are sampling many lineages, it is becoming feasible to compare large data sets of genome-level features and to develop this as a tool for phylogenetic reconstruction that has advantages over conventional sequence comparisons. Although it is unlikely that these will address a large number of evolutionary branch points across the broad tree of life due to the infeasibility of such sampling, they have great potential for convincingly resolving many critical, contested relationships for which no other data seems promising. However, it is important that we recognize potential pitfalls, establish reasonable standards for acceptance, and employ rigorous methodology to guard against a return to earlier days of scenario-driven evolutionary reconstructions.

  14. Improved detection of helix-turn-helix DNA-binding motifs in protein sequences.

    PubMed Central

    Dodd, I B; Egan, J B

    1990-01-01

    We present an update of our method for systematic detection and evaluation of potential helix-turn-helix DNA-binding motifs in protein sequences [Dodd, I. and Egan, J. B. (1987) J. Mol. Biol. 194, 557-564]. The new method is considerably more powerful, detecting approximately 50% more likely helix-turn-helix sequences without an increase in false predictions. This improvement is due almost entirely to the use of a much larger reference set of 91 presumed helix-turn-helix sequences. The scoring matrix derived from this reference set has been calibrated against a large protein sequence database so that the score obtained by a sequence can be used to give a practical estimation of the probability that the sequence is a helix-turn-helix motif. PMID:2402433

  15. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction

    PubMed Central

    Cui, Xuefeng; Lu, Zhiwu; Wang, Sheng; Jing-Yan Wang, Jim; Gao, Xin

    2016-01-01

    Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence–structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM–HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods. Availability and implementation: Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx. Contact: xin.gao@kaust.edu.sa Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307635

  16. Ribosomal proteins and expressed sequence tags from Lysiphlebus testaceipes(Hymenoptera: Aphidiidae)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    A dataset containing 101 putative ribosomal protein (RP) sequences is provided for the aphid parasitoid, Lysiphlebus testaceipes. These data were obtained as a subset from a cDNA library constructed from adult L. testaceipes, and represent one of the largest complete sets of cytoplasmic RP sequence...

  17. Dr. Sanger's Apprentice: A Computer-Aided Instruction to Protein Sequencing.

    ERIC Educational Resources Information Center

    Schmidt, Thomas G.; Place, Allen R.

    1985-01-01

    Modeled after the program "Mastermind," this program teaches students the art of protein sequencing. The program (written in Turbo Pascal for the IBM PC, requiring 128K, a graphics adapter, and an 8070 mathematics coprocessor) generates a polypeptide whose sequence and length can be user-defined (for practice) or computer-generated (for grading).…

  18. Detection of Weakly Conserved Ancestral Mammalian RegulatorySequences by Primate Comparisons

    SciTech Connect

    Wang, Qian-fei; Prabhakar, Shyam; Chanan, Sumita; Cheng,Jan-Fang; Rubin, Edward M.; Boffelli, Dario

    2006-06-01

    Genomic comparisons between human and distant, non-primatemammals are commonly used to identify cis-regulatory elements based onconstrained sequence evolution. However, these methods fail to detectcryptic functional elements, which are too weakly conserved among mammalsto distinguish from nonfunctional DNA. To address this problem, weexplored the potential of deep intra-primate sequence comparisons. Wesequenced the orthologs of 558 kb of human genomic sequence, coveringmultiple loci involved in cholesterol homeostasis, in 6 nonhumanprimates. Our analysis identified 6 noncoding DNA elements displayingsignificant conservation among primates, but undetectable in more distantcomparisons. In vitro and in vivo tests revealed that at least three ofthese 6 elements have regulatory function. Notably, the mouse orthologsof these three functional human sequences had regulatory activity despitetheir lack of significant sequence conservation, indicating that they arecryptic ancestral cis-regulatory elements. These regulatory elementscould still be detected in a smaller set of three primate speciesincluding human, rhesus and marmoset. Since the human and rhesus genomesequences are already available, and the marmoset genome is activelybeing sequenced, the primate-specific conservation analysis describedhere can be applied in the near future on a whole-genome scale, tocomplement the annotation provided by more distant speciescomparisons.

  19. Ab initio detection of fuzzy amino acid tandem repeats in protein sequences

    PubMed Central

    2012-01-01

    Background Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats). Results In this paper we present PTRStalker, a new algorithm for ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. In the reported results we show that by feeding PTRStalker with amino acid sequences from the UniProtKB/Swiss-Prot database we detect novel tandemly repeated structures not captured by other state-of-the-art tools. Experiments with membrane proteins indicate that PTRStalker can detect global symmetries in the primary structure which are then reflected in the tertiary structure. Conclusions PTRStalker is able to detect fuzzy tandem repeating structures in protein sequences, with performance beyond the current state-of-the art. Such a tool may be a valuable support to investigating protein structural properties when tertiary X-ray data is not available. PMID:22536906

  20. Optimal sequence selection in proteins of known structure by simulated evolution.

    PubMed Central

    Hellinga, H W; Richards, F M

    1994-01-01

    Rational design of protein structure requires the identification of optimal sequences to carry out a particular function within a given backbone structure. A general solution to this problem requires that a potential function describing the energy of the system as a function of its atomic coordinates be minimized simultaneously over all available sequences and their three-dimensional atomic configurations. Here we present a method that explicitly minimizes a semiempirical potential function simultaneously in these two spaces, using a simulated annealing approach. The method takes the fixed three-dimensional coordinates of a protein backbone and stochastically generates possible sequences through the introduction of random mutations. The corresponding three-dimensional coordinates are constructed for each sequence by "redecorating" the backbone coordinates of the original structure with the corresponding side chains. These are then allowed to vary in their structure by random rotations around free torsional angles to generate a stochastic walk in configurational space. We have named this method protein simulated evolution, because, in loose analogy with natural selection, it randomly selects for allowed solutions in the sequence of a protein subject to the "selective pressure" of a potential function. Energies predicted by this method for sequences of a small group of residues in the hydrophobic core of the phage lambda cI repressor correlate well with experimentally determined biological activities. This "genetic selection by computer" approach has potential applications in protein engineering, rational protein design, and structure-based drug discovery. PMID:8016069

  1. Use of synthetic signal sequences to explore the protein export machinery.

    PubMed

    Clérico, Eugenia M; Maki, Jenny L; Gierasch, Lila M

    2008-01-01

    The information for correct localization of newly synthesized proteins in both prokaryotes and eukaryotes resides in self-contained, often transportable targeting sequences. Of these, signal sequences specify that a protein should be secreted from a cell or incorporated into the cytoplasmic membrane. A central puzzle is presented by the lack of primary structural homology among signal sequences, although they share common features in their sequences. Synthetic signal peptides have enabled a wide range of studies of how these "zipcodes" for protein secretion are decoded and used to target proteins to the protein machinery that facilitates their translocation across and integration into membranes. We review research on how the information in signal sequences enables their passenger proteins to be correctly and efficiently localized. Synthetic signal peptides have made possible binding and crosslinking studies to explore how selectivity is achieved in recognition by the signal sequence-binding receptors, signal recognition particle, or SRP, which functions in all organisms, and SecA, which functions in prokaryotes and some organelles of prokaryotic origins. While progress has been made, the absence of atomic resolution structures for complexes of signal peptides and their receptors has definitely left many questions to be answered in the future. PMID:17918185

  2. Nucleotide sequence of the tcml gene (ribosomal protein L3) of Saccharomyces cerevisiae.

    PubMed Central

    Schultz, L D; Friesen, J D

    1983-01-01

    The yeast tcml gene, which codes for ribosomal protein L3, has been isolated by using recombinant DNA and genetic complementation. The DNA fragment carrying this gene has been subcloned and we have determined its DNA sequence. The 20 amino acid residues at the amino terminus as inferred from the nucleotide sequence agreed exactly with the amino acid sequence data. The amino acid composition of the encoded protein agreed with that determined for purified ribosomal protein L3. Codon usage in the tcml gene was strongly biased in the direction found for several other abundant Saccharomyces cerevisiae proteins. The tcml gene has no introns, which appears to be atypical of ribosomal protein structural genes. PMID:6305925

  3. Nonlinear signal analysis to understand the dynamics of the protein sequences

    NASA Astrophysics Data System (ADS)

    Angadi, S.; Kulkarni, A.

    2008-10-01

    Recurrence plots are a useful tool to identify structure in a data set in a time resolved way qualitatively. Recurrence plots and its quantification has become an important research tool in the analysis of nonlinear dynamical systems. In the present work, we utilize the recurrence property to study the protein sequences. The sequences that we analyze belong to two distinct classes, viz., soluble proteins and proteins that form inclusion bodies when over expressed in Escherichia coli. We use Kyte-Doolittle hydrophobicity scale in the analysis. We study the underlying dynamics and extract the information which codes the essential class of a protein using simple statistical and global characteristics based features as well as some advanced features based on recurrence quantification. The extracted features are used in probability estimation using Gaussian Process Classification technique. The results give meaningful insights to the level of understanding the protein sequence dynamics.

  4. Homology analyses of the protein sequences of fatty acid synthases from chicken liver, rat mammary gland, and yeast

    SciTech Connect

    Chang, Soo-Ik ); Hammes, G.G. )

    1989-11-01

    Homology analyses of the protein sequences of chicken liver and rat mammary gland fatty acid synthases were carried out. The amino acid sequences of the chicken and rat enzymes are 67% identical. If conservative substitutions are allowed, 78% of the amino acids are matched. A region of low homologies exists between the functional domains, in particular around amino acid residues 1059-1264 of the chicken enzyme. Homologies between the active sites of chicken and rat and of chicken and yeast enzymes have been analyzed by an alignment method. A high degree of homology exists between the active sites of the chicken and rat enzymes. However, the chicken and yeast enzymes show a lower degree of homology. The DADPH-binding dinucleotide folds of the {beta}-ketoacyl reductase and the enoyl reductase sites were identified by comparison with a known consensus sequence for the DADP- and FAD-binding dinucleotide folds. The active sites of all of the enzymes are primarily in hydrophobic regions of the protein. This study suggests that the genes for the functional domains of fatty acid synthase were originally separated, and these genes were connected to each other by using different connecting nucleotide sequences in different species. An alternative explanation for the differences in rat and chicken is a common ancestry and mutations in the joining regions during evolution.

  5. Draft versus finished sequence data for DNA and protein diagnostic signature development

    SciTech Connect

    Gardner, S N; Lam, M W; Smith, J R; Torres, C L; Slezak, T R

    2004-10-29

    Sequencing pathogen genomes is costly, demanding careful allocation of limited sequencing resources. We built a computational Sequencing Analysis Pipeline (SAP) to guide decisions regarding the amount of genomic sequencing necessary to develop high-quality diagnostic DNA and protein signatures. SAP uses simulations to estimate the number of target genomes and close phylogenetic relatives (near neighbors, or NNs) to sequence. We use SAP to assess whether draft data is sufficient or finished sequencing is required using Marburg and variola virus sequences. Simulations indicate that intermediate to high quality draft with error rates of 10{sup -3}-10{sup -5} ({approx} 8x coverage) of target organisms is suitable for DNA signature prediction. Low quality draft with error rates of {approx} 1% (3x to 6x coverage) of target isolates is inadequate for DNA signature prediction, although low quality draft of NNs is sufficient, as long as the target genomes are of high quality. For protein signature prediction, sequencing errors in target genomes substantially reduce the detection of amino acid sequence conservation, even if the draft is of high quality. In summary, high quality draft of target and low quality draft of NNs appears to be a cost-effective investment for DNA signature prediction, but may lead to underestimation of predicted protein signatures.

  6. Importance Sampling of Word Patterns in DNA and Protein Sequences

    PubMed Central

    Chan, Hock Peng; Chen, Louis H.Y.

    2010-01-01

    Abstract Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: palindromes and inverted repeats, patterns arising from position-specific weight matrices (PSWMs), and co-occurrences of pairs of motifs. PMID:21128856

  7. A Primary Sequence Analysis of the ARGONAUTE Protein Family in Plants

    PubMed Central

    Rodríguez-Leal, Daniel; Castillo-Cobián, Amanda; Rodríguez-Arévalo, Isaac; Vielle-Calzada, Jean-Philippe

    2016-01-01

    Small RNA (sRNA)-mediated gene silencing represents a conserved regulatory mechanism controlling a wide diversity of developmental processes through interactions of sRNAs with proteins of the ARGONAUTE (AGO) family. On the basis of a large phylogenetic analysis that includes 206 AGO genes belonging to 23 plant species, AGO genes group into four clades corresponding to the phylogenetic distribution proposed for the ten family members of Arabidopsis thaliana. A primary analysis of the corresponding protein sequences resulted in 50 sequences of amino acids (blocks) conserved across their linear length. Protein members of the AGO4/6/8/9 and AGO1/10 clades are more conserved than members of the AGO5 and AGO2/3/7 clades. In addition to blocks containing components of the PIWI, PAZ, and DUF1785 domains, members of the AGO2/3/7 and AGO4/6/8/9 clades possess other consensus block sequences that are exclusive of members within these clades, suggesting unforeseen functional specialization revealed by their primary sequence. We also show that AGO proteins of animal and plant kingdoms share linear sequences of blocks that include motifs involved in posttranslational modifications such as those regulating AGO2 in humans and the PIWI protein AUBERGINE in Drosophila. Our results open possibilities for exploring new structural and functional aspects related to the evolution of AGO proteins within the plant kingdom, and their convergence with analogous proteins in mammals and invertebrates.

  8. Adhesive Proteins of Stalked and Acorn Barnacles Display Homology with Low Sequence Similarities

    PubMed Central

    Jonker, Jaimie-Leigh; Abram, Florence; Pires, Elisabete; Varela Coelho, Ana; Grunwald, Ingo; Power, Anne Marie

    2014-01-01

    Barnacle adhesion underwater is an important phenomenon to understand for the prevention of biofouling and potential biotechnological innovations, yet so far, identifying what makes barnacle glue proteins ‘sticky’ has proved elusive. Examination of a broad range of species within the barnacles may be instructive to identify conserved adhesive domains. We add to extensive information from the acorn barnacles (order Sessilia) by providing the first protein analysis of a stalked barnacle adhesive, Lepas anatifera (order Lepadiformes). It was possible to separate the L. anatifera adhesive into at least 10 protein bands using SDS-PAGE. Intense bands were present at approximately 30, 70, 90 and 110 kilodaltons (kDa). Mass spectrometry for protein identification was followed by de novo sequencing which detected 52 peptides of 7–16 amino acids in length. None of the peptides matched published or unpublished transcriptome sequences, but some amino acid sequence similarity was apparent between L. anatifera and closely-related Dosima fascicularis. Antibodies against two acorn barnacle proteins (ab-cp-52k and ab-cp-68k) showed cross-reactivity in the adhesive glands of L. anatifera. We also analysed the similarity of adhesive proteins across several barnacle taxa, including Pollicipes pollicipes (a stalked barnacle in the order Scalpelliformes). Sequence alignment of published expressed sequence tags clearly indicated that P. pollicipes possesses homologues for the 19 kDa and 100 kDa proteins in acorn barnacles. Homology aside, sequence similarity in amino acid and gene sequences tended to decline as taxonomic distance increased, with minimum similarities of 18–26%, depending on the gene. The results indicate that some adhesive proteins (e.g. 100 kDa) are more conserved within barnacles than others (20 kDa). PMID:25295513

  9. Comparison of Dixon Sequences for Estimation of Percent Breast Fibroglandular Tissue

    PubMed Central

    Ledger, Araminta E. W.; Scurr, Erica D.; Hughes, Julie; Macdonald, Alison; Wallace, Toni; Thomas, Karen; Wilson, Robin; Leach, Martin O.; Schmidt, Maria A.

    2016-01-01

    Objectives To evaluate sources of error in the Magnetic Resonance Imaging (MRI) measurement of percent fibroglandular tissue (%FGT) using two-point Dixon sequences for fat-water separation. Methods Ten female volunteers (median age: 31 yrs, range: 23–50 yrs) gave informed consent following Research Ethics Committee approval. Each volunteer was scanned twice following repositioning to enable an estimation of measurement repeatability from high-resolution gradient-echo (GRE) proton-density (PD)-weighted Dixon sequences. Differences in measures of %FGT attributable to resolution, T1 weighting and sequence type were assessed by comparison of this Dixon sequence with low-resolution GRE PD-weighted Dixon data, and against gradient-echo (GRE) or spin-echo (SE) based T1-weighted Dixon datasets, respectively. Results %FGT measurement from high-resolution PD-weighted Dixon sequences had a coefficient of repeatability of ±4.3%. There was no significant difference in %FGT between high-resolution and low-resolution PD-weighted data. Values of %FGT from GRE and SE T1-weighted data were strongly correlated with that derived from PD-weighted data (r = 0.995 and 0.96, respectively). However, both sequences exhibited higher mean %FGT by 2.9% (p < 0.0001) and 12.6% (p < 0.0001), respectively, in comparison with PD-weighted data; the increase in %FGT from the SE T1-weighted sequence was significantly larger at lower breast densities. Conclusion Although measurement of %FGT at low resolution is feasible, T1 weighting and sequence type impact on the accuracy of Dixon-based %FGT measurements; Dixon MRI protocols for %FGT measurement should be carefully considered, particularly for longitudinal or multi-centre studies. PMID:27011312

  10. Extraction of high quality k-words for alignment-free sequence comparison.

    PubMed

    Gunasinghe, Upuli; Alahakoon, Damminda; Bedingfield, Susan

    2014-10-01

    The weighted Euclidean distance (D(2)) is one of the earliest dissimilarity measures used for alignment free comparison of biological sequences. This distance measure and its variants have been used in numerous applications due to its fast computation, and many variants of it have been subsequently introduced. The D(2) distance measure is based on the count of k-words in the two sequences that are compared. Traditionally, all k-words are compared when computing the distance. In this paper we show that similar accuracy in sequence comparison can be achieved by using a selected subset of k-words. We introduce a term variance based quality measure for identifying the important k-words. We demonstrate the application of the proposed technique in phylogeny reconstruction and show that up to 99% of the k-words can be filtered out for certain datasets, resulting in faster sequence comparison. The paper also presents an exploratory analysis based evaluation of optimal k-word values and discusses the impact of using subsets of k-words in such optimal instances. PMID:24846728

  11. Sequence-based feature prediction and annotation of proteins

    PubMed Central

    Juncker, Agnieszka S; Jensen, Lars J; Pierleoni, Andrea; Bernsel, Andreas; Tress, Michael L; Bork, Peer; von Heijne, Gunnar; Valencia, Alfonso; Ouzounis, Christos A; Casadio, Rita; Brunak, Søren

    2009-01-01

    A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome. PMID:19226438

  12. N-terminal sequence of some ribosome-inactivating proteins.

    PubMed

    Montecucchi, P C; Lazzarini, A M; Barbieri, L; Stirpe, F; Soria, M; Lappi, D

    1989-04-01

    The N-terminal portion of some type 1 ribosome-inactivating proteins (RIPs) isolated from the seeds of Gelonium multiflorum, Momordica charantia, Bryonia dioica, Saponaria officinalis and from the leaves of Saponaria officinalis are reported in the present paper. Their relationship with other RIPs is discussed. PMID:2753596

  13. Relating sequence encoded information to form and function of intrinsically disordered proteins

    PubMed Central

    Das, Rahul K.; Ruff, Kiersten M.; Pappu, Rohit V.

    2015-01-01

    Intrinsically disordered proteins (IDPs) showcase the importance of conformational plasticity and heterogeneity in protein function. We summarize recent advances that connect information encoded in IDP sequences to their conformational properties and functions. We focus on insights obtained through a combination of atomistic simulations and biophysical measurements that are synthesized into a coherent framework using polymer physics theories. PMID:25863585

  14. Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software

    PubMed Central

    Nakano, Shogo; Asano, Yasuhisa

    2015-01-01

    Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs. PMID:25645341

  15. Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software.

    PubMed

    Nakano, Shogo; Asano, Yasuhisa

    2015-01-01

    Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs. PMID:25645341

  16. Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software

    NASA Astrophysics Data System (ADS)

    Nakano, Shogo; Asano, Yasuhisa

    2015-02-01

    Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs.

  17. Preparative Protein Production from Inclusion Bodies and Crystallization: A Seven-Week Biochemistry Sequence

    ERIC Educational Resources Information Center

    Peterson, Megan J.; Snyder, W. Kalani; Westerman, Shelley; McFarland, Benjamin J.

    2011-01-01

    We describe how to produce and purify proteins from "Escherichia coli" inclusion bodies by adapting versatile, preparative-scale techniques to the undergraduate laboratory schedule. This 7-week sequence of experiments fits into an annual cycle of research activity in biochemistry courses. Recombinant proteins are expressed as inclusion bodies,…

  18. Elucidation of the sequence of canine (pro)-calcitonin. A molecular biological and protein chemical approach.

    PubMed

    Mol, J A; Kwant, M M; Arnold, I C; Hazewinkel, H A

    1991-09-01

    From the canine thyroid gland a calcitonin (CT) immunoreactive peptide was purified by successive aqueous acid acetone extraction, gel filtration and HPLC. Gas-phase sequencing of the purified peptide showed that the first 25 amino acids had 65% sequence homology with the amino-terminus of the human CT prohormone. A canine cDNA library was then made from the thyroid gland. A plasmid was isolated containing a sequence that is homologous to part of exon 3, and the complete sequence of exon 4 of the human mRNA encoding preproCT. From this cDNA the amino acid sequence of canine CT is predicted. In comparison with well-known CT sequences of other species, the strongest homology exists with bovine, porcine and ovine CT. PMID:1758974

  19. Effect of single-point sequence alterations on the aggregationpropensity of a model protein

    SciTech Connect

    Bratko, Dusan; Cellmer, Troy; Prausnitz, John M.; Blanch, Harvey W.

    2005-10-07

    Sequences of contemporary proteins are believed to have evolved through process that optimized their overall fitness including their resistance to deleterious aggregation. Biotechnological processing may expose therapeutic proteins to conditions that are much more conducive to aggregation than those encountered in a cellular environment. An important task of protein engineering is to identify alternative sequences that would protect proteins when processed at high concentrations without altering their native structure associated with specific biological function. Our computational studies exploit parallel tempering simulations of coarse-grained model proteins to demonstrate that isolated amino-acid residue substitutions can result in significant changes in the aggregation resistance of the protein in a crowded environment while retaining protein structure in isolation. A thermodynamic analysis of protein clusters subject to competing processes of folding and association shows that moderate mutations can produce effects similar to those caused by changes in system conditions, including temperature, concentration, and solvent composition that affect the aggregation propensity. The range of conditions where a protein can resist aggregation can therefore be tuned by sequence alterations although the protein generally may retain its generic ability for aggregation.

  20. A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences.

    PubMed

    Yu, Jia-Feng; Dou, Xiang-Hua; Wang, Hong-Bo; Sun, Xiao; Zhao, Hui-Ying; Wang, Ji-Hua

    2015-06-22

    The composition and sequence order of amino acid residues are the two most important characteristics to describe a protein sequence. Graphical representations facilitate visualization of biological sequences and produce biologically useful numerical descriptors. In this paper, we propose a novel cylindrical representation by placing the 20 amino acid residue types in a circle and sequence positions along the z axis. This representation allows visualization of the composition and sequence order of amino acids at the same time. Ten numerical descriptors and one weighted numerical descriptor have been developed to quantitatively describe intrinsic properties of protein sequences on the basis of the cylindrical model. Their applications to similarity/dissimilarity analysis of nine ND5 proteins indicated that these numerical descriptors are more effective than several classical numerical matrices. Thus, the cylindrical representation obtained here provides a new useful tool for visualizing and charactering protein sequences. An online server is available at http://biophy.dzu.edu.cn:8080/CNumD/input.jsp . PMID:25945398

  1. Cloning and sequence of the gene for heat shock protein 60 from Chlamydia trachomatis and immunological reactivity of the protein.

    PubMed Central

    Cerrone, M C; Ma, J J; Stephens, R S

    1991-01-01

    We isolated and sequenced the gene for the chlamydial heat shock protein 60 (HSP-60) from a Chlamydia trachomatis genomic library by molecular genetic methods. The DNA sequence derived revealed an operon-like gene structure with two open reading frames encoding an 11,122- and a 57,956-Da protein. The translated amino acid sequence of the larger open reading frame showed a high degree of homology with known sequences for HSP-60 from several bacterial species as well as with plant and human sequences. By using the determined nucleotide sequence, fragments of the gene were cloned into the plasmid vector pGEX for expression as fusion proteins consisting of glutathione S-transferase and peptide portions of the chlamydial HSP-60. HSP-60 antigenic identity was confirmed by an immunoblot with anti-HSP-60 rabbit serum. Sera from patients that exhibited both high antichlamydial titers and reactivity to chlamydial HSP-60 showed reactivity on immunoblots to two fusion proteins that represented portions of the carboxyl-terminal half of the molecule, whereas fusion proteins defining the amino-terminal half were nonreactive. No reactivity with the fusion proteins was seen with sera from patients that had been previously screened as nonreactive to native chlamydial HSP-60 but which had high antichlamydial titers. Sera from noninfected control subjects also exhibited no reactivity. Definition of recognized HSP-60 epitopes may provide a predictive screen for those patients with C. trachomatis infections who may develop damaging sequelae, as well as providing tools for the study of immunopathogenic mechanisms of Chlamydia-induced disease. Images PMID:1987066

  2. Expanding the nitrogen regulatory protein superfamily: Homology detection at below random sequence identity.

    PubMed

    Kinch, Lisa N; Grishin, Nick V

    2002-07-01

    Nitrogen regulatory (PII) proteins are signal transduction molecules involved in controlling nitrogen metabolism in prokaryots. PII proteins integrate the signals of intracellular nitrogen and carbon status into the control of enzymes involved in nitrogen assimilation. Using elaborate sequence similarity detection schemes, we show that five clusters of orthologs (COGs) and several small divergent protein groups belong to the PII superfamily and predict their structure to be a (betaalphabeta)(2) ferredoxin-like fold. Proteins from the newly emerged PII superfamily are present in all major phylogenetic lineages. The PII homologs are quite diverse, with below random (as low as 1%) pairwise sequence identities between some members of distant groups. Despite this sequence diversity, evidence suggests that the different subfamilies retain the PII trimeric structure important for ligand-binding site formation and maintain a conservation of conservations at residue positions important for PII function. Because most of the orthologous groups within the PII superfamily are composed entirely of hypothetical proteins, our remote homology-based structure prediction provides the only information about them. Analogous to structural genomics efforts, such prediction gives clues to the biological roles of these proteins and allows us to hypothesize about locations of functional sites on model structures or rationalize about available experimental information. For instance, conserved residues in one of the families map in close proximity to each other on PII structure, allowing for a possible metal-binding site in the proteins coded by the locus known to affect sensitivity to divalent metal ions. Presented analysis pushes the limits of sequence similarity searches and exemplifies one of the extreme cases of reliable sequence-based structure prediction. In conjunction with structural genomics efforts to shed light on protein function, our strategies make it possible to detect

  3. Sequence and structural implications of a bovine corneal keratan sulfate proteoglycan core protein. Protein 37B represents bovine lumican and proteins 37A and 25 are unique

    NASA Technical Reports Server (NTRS)

    Funderburgh, J. L.; Funderburgh, M. L.; Brown, S. J.; Vergnes, J. P.; Hassell, J. R.; Mann, M. M.; Conrad, G. W.; Spooner, B. S. (Principal Investigator)

    1993-01-01

    Amino acid sequence from tryptic peptides of three different bovine corneal keratan sulfate proteoglycan (KSPG) core proteins (designated 37A, 37B, and 25) showed similarities to the sequence of a chicken KSPG core protein lumican. Bovine lumican cDNA was isolated from a bovine corneal expression library by screening with chicken lumican cDNA. The bovine cDNA codes for a 342-amino acid protein, M(r) 38,712, containing amino acid sequences identified in the 37B KSPG core protein. The bovine lumican is 68% identical to chicken lumican, with an 83% identity excluding the N-terminal 40 amino acids. Location of 6 cysteine and 4 consensus N-glycosylation sites in the bovine sequence were identical to those in chicken lumican. Bovine lumican had about 50% identity to bovine fibromodulin and 20% identity to bovine decorin and biglycan. About two-thirds of the lumican protein consists of a series of 10 amino acid leucine-rich repeats that occur in regions of calculated high beta-hydrophobic moment, suggesting that the leucine-rich repeats contribute to beta-sheet formation in these proteins. Sequences obtained from 37A and 25 core proteins were absent in bovine lumican, thus predicting a unique primary structure and separate mRNA for each of the three bovine KSPG core proteins.

  4. ESPript/ENDscript: extracting and rendering sequence and 3D information from atomic structures of proteins

    PubMed Central

    Gouet, Patrice; Robert, Xavier; Courcelle, Emmanuel

    2003-01-01

    The fortran program ESPript was created in 1993, to display on a PostScript figure multiple sequence alignments adorned with secondary structure elements. A web server was made available in 1999 and ESPript has been linked to three major web tools: ProDom which identifies protein domains, PredictProtein which predicts secondary structure elements and NPS@ which runs sequence alignment programs. A web server named ENDscript was created in 2002 to facilitate the generation of ESPript figures containing a large amount of information. ENDscript uses programs such as BLAST, Clustal and PHYLODENDRON to work on protein sequences and such as DSSP, CNS and MOLSCRIPT to work on protein coordinates. It enables the creation, from a single Protein Data Bank identifier, of a multiple sequence alignment figure adorned with secondary structure elements of each sequence of known 3D structure. Similar 3D structures are superimposed in turn with the program PROFIT and a final figure is drawn with BOBSCRIPT, which shows sequence and structure conservation along the Cα trace of the query. ESPript and ENDscript are available at http://genopole.toulouse.inra.fr/ESPript. PMID:12824317

  5. M2SG: mapping human disease-related genetic variants to protein sequences and genomic loci

    PubMed Central

    Ji, Renkai; Cong, Qian; Li, Wenlin; Grishin, Nick V.

    2013-01-01

    Summary: Online Mendelian Inheritance in Man (OMIM) is a manually curated compendium of human genetic variants and the corresponding phenotypes, mostly human diseases. Instead of directly documenting the native sequences for gene entries, OMIM links its entries to protein and DNA sequences in other databases. However, because of the existence of gene isoforms and errors in OMIM records, mapping a specific OMIM mutation to its corresponding protein sequence is not trivial. Combining computer programs and extensive manual curation of OMIM full-text descriptions and original literature, we mapped 98% of OMIM amino acid substitutions (AASs) and all SwissProt Variant (SwissVar) disease-related AASs to reference sequences and confidently mapped 99.96% of all AASs to the genomic loci. Based on the results, we developed an online database and interactive web server (M2SG) to (i) retrieve the mapped OMIM and SwissVar variants for a given protein sequence; and (ii) obtain related proteins and mutations for an input disease phenotype. This database will be useful for analyzing sequences, understanding the effect of mutations, identifying important genetic variations and designing experiments on a protein of interest. Availability and implementation: The database and web server are freely available at http://prodata.swmed.edu/M2S/mut2seq.cgi. Contact: grishin@chop.swmed.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24002112

  6. A Multiple-Sequence Variant of the Multiple-Baseline Design: A Strategy for Analysis of Sequence Effects and Treatment Comparison.

    ERIC Educational Resources Information Center

    Noell, George H.; Gresham, Frank M.

    2001-01-01

    Describes design logic and potential uses of a variant of the multiple-baseline design. The multiple-baseline multiple-sequence (MBL-MS) consists of multiple-baseline designs that are interlaced with one another and include all possible sequences of treatments. The MBL-MS design appears to be primarily useful for comparison of treatments taking…

  7. Protein design by optimization of a sequence-structure quality function.

    PubMed

    Brenner, S E; Berry, A

    1994-01-01

    An automated procedure for protein design by optimization of a sequence-structure quality has been developed. The method selects a statistically optimal sequence for a particular structure, on the assumption that such a protein will adopt the desired structure. We present two optimization algorithms: one provides an exact optimization while the other uses a combinatorial technique for comparatively rapid results. Both are suitable for massively parallel computers. A prototype system was used to design sequences which should adopt the four-helix bundle conformation of myohemerythrin. These appear satisfactory to secondary structure and profile analysis. Detailed inspection reveals that the sequences are generally plausible but, as expected, lack some specific structural features. The design parameters provide some insight into the general determinants of protein structure. PMID:7584417

  8. Protein design by optimization of a sequence-structure quality function

    SciTech Connect

    Brenner, S.E.; Berry, A.

    1994-12-31

    An automated procedure for protein design by optimization of a sequence-structure quality has been developed. The method selects a statistically optimal sequence for a particular structure, on the assumption that such a protein will adopt the desired structure. We present two optimization algorithms: one provides an exact optimization while the other uses a combinatorial technique for comparatively rapid results. Both are suitable for massively parallel computers. A prototype system was used to design sequences which should adopt the four-helix bundle conformation of myohemerythrin. These appear satisfactory to secondary structure and profile analysis. Detailed inspection reveals that the sequences are generally plausible but, as expected, lack some specific structural features. The design parameters provide some insight into the general determinants of protein structure.

  9. Comparison between optimized GRE and RARE sequences for 19F MRI studies

    NASA Astrophysics Data System (ADS)

    Soffientini, Chiara D.; Mastropietro, Alfonso; Caffini, Matteo; Cocco, Sara; Zucca, Ileana; Scotti, Alessandro; Baselli, Giuseppe; Bruzzone, Maria Grazia

    2014-03-01

    In 19F-MRI studies limiting factors are the presence of a low signal due to the low concentration of 19F-nuclei, necessary for biological applications, and the inherent low sensitivity of MRI. Hence, acquiring images using the pulse sequence with the best signal to noise ratio (SNR) by optimizing the acquisition parameters specifically to a 19F compound is a core issue. In 19F-MRI, multiple-spin-echo (RARE) and gradient-echo (GRE) are the two most frequently used pulse sequence families; therefore we performed an optimization study of GRE pulse sequences based on numerical simulations and experimental acquisitions on fluorinated compounds. We compared GRE performance to an optimized RARE sequence. Images were acquired on a 7T MRI preclinical scanner on phantoms containing different fluorinated compounds. Actual relaxation times (T1, T2, T2*) were evaluated in order to predict SNR dependence on sequence parameters. Experimental comparisons between spoiled GRE and RARE, obtained at a fixed acquisition time and in steady state condition, showed RARE sequence outperforming the spoiled GRE (up to 406% higher). Conversely, the use of the unbalanced-SSFP showed a significant increase in SNR compared to RARE (up to 28% higher). Moreover, this sequence (as GRE in general) was confirmed to be virtually insensitive to T1 and T2 relaxation times, after proper optimization, thus improving marker independence from the biological environment. These results confirm the efficacy of the proposed optimization tool and foster further investigation addressing in-vivo applicability.

  10. Hydrogen Exchange Mass Spectrometry of Related Proteins with Divergent Sequences: A Comparative Study of HIV-1 Nef Allelic Variants

    NASA Astrophysics Data System (ADS)

    Wales, Thomas E.; Poe, Jerrod A.; Emert-Sedlak, Lori; Morgan, Christopher R.; Smithgall, Thomas E.; Engen, John R.

    2016-03-01

    Hydrogen exchange mass spectrometry can be used to compare the conformation and dynamics of proteins that are similar in tertiary structure. If relative deuterium levels are measured, differences in sequence, deuterium forward- and back-exchange, peptide retention time, and protease digestion patterns all complicate the data analysis. We illustrate what can be learned from such data sets by analyzing five variants (Consensus G2E, SF2, NL4-3, ELI, and LTNP4) of the HIV-1 Nef protein, both alone and when bound to the human Hck SH3 domain. Regions with similar sequence could be compared between variants. Although much of the hydrogen exchange features were preserved across the five proteins, the kinetics of Nef binding to Hck SH3 were not the same. These observations may be related to biological function, particularly for ELI Nef where we also observed an impaired ability to downregulate CD4 surface presentation. The data illustrate some of the caveats that must be considered for comparison experiments and provide a framework for investigations of other protein relatives, families, and superfamilies with HX MS.

  11. Strategies in protein sequencing and characterization: multi-enzyme digestion coupled with alternate CID/ETD tandem mass spectrometry.

    PubMed

    Nardiello, Donatella; Palermo, Carmen; Natale, Anna; Quinto, Maurizio; Centonze, Diego

    2015-01-01

    A strategy based on a simultaneous multi-enzyme digestion coupled with electron transfer dissociation (ETD) and collision-induced dissociation (CID) was developed for protein sequencing and characterization, as a valid alternative platform in ion-trap based proteomics. The effect of different proteolytic procedures using chymotrypsin, trypsin, a combination of both, and Lys-C, was carefully evaluated in terms of number of identified peptides, protein coverage, and score distribution. A systematic comparison between CID and ETD is shown for the analysis of peptides originating from the in-solution digestion of standard caseins. The best results were achieved with a trypsin/chymotrypsin mix combined with CID and ETD operating in alternating mode. A post-database search validation of MS/MS dataset was performed, then, the matched peptides were cross checked by the evaluation of ion scores, rank, number of experimental product ions, and their relative abundances in the MS/MS spectrum. By integrated CID/ETD experiments, high quality-spectra have been obtained, thus allowing a confirmation of spectral information and an increase of accuracy in peptide sequence assignments. Overlapping peptides, produced throughout the proteins, reduce the ambiguity in mapping modifications between natural variants and animal species, and allow the characterization of post translational modifications. The advantages of using the enzymatic mix trypsin/chymotrypsin were confirmed by the nanoLC and CID/ETD tandem mass spectrometry of goat milk proteins, previously separated by two-dimensional gel electrophoresis. PMID:25479873

  12. Hydrogen Exchange Mass Spectrometry of Related Proteins with Divergent Sequences: A Comparative Study of HIV-1 Nef Allelic Variants.

    PubMed

    Wales, Thomas E; Poe, Jerrod A; Emert-Sedlak, Lori; Morgan, Christopher R; Smithgall, Thomas E; Engen, John R

    2016-06-01

    Hydrogen exchange mass spectrometry can be used to compare the conformation and dynamics of proteins that are similar in tertiary structure. If relative deuterium levels are measured, differences in sequence, deuterium forward- and back-exchange, peptide retention time, and protease digestion patterns all complicate the data analysis. We illustrate what can be learned from such data sets by analyzing five variants (Consensus G2E, SF2, NL4-3, ELI, and LTNP4) of the HIV-1 Nef protein, both alone and when bound to the human Hck SH3 domain. Regions with similar sequence could be compared between variants. Although much of the hydrogen exchange features were preserved across the five proteins, the kinetics of Nef binding to Hck SH3 were not the same. These observations may be related to biological function, particularly for ELI Nef where we also observed an impaired ability to downregulate CD4 surface presentation. The data illustrate some of the caveats that must be considered for comparison experiments and provide a framework for investigations of other protein relatives, families, and superfamilies with HX MS. Graphical Abstract ᅟ. PMID:27032648

  13. Hydrogen Exchange Mass Spectrometry of Related Proteins with Divergent Sequences: A Comparative Study of HIV-1 Nef Allelic Variants

    NASA Astrophysics Data System (ADS)

    Wales, Thomas E.; Poe, Jerrod A.; Emert-Sedlak, Lori; Morgan, Christopher R.; Smithgall, Thomas E.; Engen, John R.

    2016-06-01

    Hydrogen exchange mass spectrometry can be used to compare the conformation and dynamics of proteins that are similar in tertiary structure. If relative deuterium levels are measured, differences in sequence, deuterium forward- and back-exchange, peptide retention time, and protease digestion patterns all complicate the data analysis. We illustrate what can be learned from such data sets by analyzing five variants (Consensus G2E, SF2, NL4-3, ELI, and LTNP4) of the HIV-1 Nef protein, both alone and when bound to the human Hck SH3 domain. Regions with similar sequence could be compared between variants. Although much of the hydrogen exchange features were preserved across the five proteins, the kinetics of Nef binding to Hck SH3 were not the same. These observations may be related to biological function, particularly for ELI Nef where we also observed an impaired ability to downregulate CD4 surface presentation. The data illustrate some of the caveats that must be considered for comparison experiments and provide a framework for investigations of other protein relatives, families, and superfamilies with HX MS.

  14. Sequence heuristics to encode phase behaviour in intrinsically disordered protein polymers

    NASA Astrophysics Data System (ADS)

    Quiroz, Felipe García; Chilkoti, Ashutosh

    2015-11-01

    Proteins and synthetic polymers that undergo aqueous phase transitions mediate self-assembly in nature and in man-made material systems. Yet little is known about how the phase behaviour of a protein is encoded in its amino acid sequence. Here, by synthesizing intrinsically disordered, repeat proteins to test motifs that we hypothesized would encode phase behaviour, we show that the proteins can be designed to exhibit tunable lower or upper critical solution temperature (LCST and UCST, respectively) transitions in physiological solutions. We also show that mutation of key residues at the repeat level abolishes phase behaviour or encodes an orthogonal transition. Furthermore, we provide heuristics to identify, at the proteome level, proteins that might exhibit phase behaviour and to design novel protein polymers consisting of biologically active peptide repeats that exhibit LCST or UCST transitions. These findings set the foundation for the prediction and encoding of phase behaviour at the sequence level.

  15. Sequence heuristics to encode phase behaviour in intrinsically disordered protein polymers

    PubMed Central

    Quiroz, Felipe García; Chilkoti, Ashutosh

    2015-01-01

    Proteins and synthetic polymers that undergo aqueous phase transitions mediate self-assembly in nature and in man-made material systems. Yet little is known about how the phase behaviour of a protein is encoded in its amino acid sequence. Here, by synthesizing intrinsically disordered, repeat proteins to test motifs that we hypothesized would encode phase behaviour, we show that the proteins can be designed to exhibit tunable lower or upper critical solution temperature (LCST and UCST, respectively) transitions in physiological solutions. We also show that mutation of key residues at the repeat level abolishes phase behaviour or encodes an orthogonal transition. Furthermore, we provide heuristics to identify, at the proteome level, proteins that might exhibit phase behaviour and to design novel protein polymers consisting of biologically active peptide repeats that exhibit LCST or UCST transitions. These findings set the foundation for the prediction and encoding of phase behaviour at the sequence level. PMID:26390327

  16. Sequence heuristics to encode phase behaviour in intrinsically disordered protein polymers.

    PubMed

    Quiroz, Felipe García; Chilkoti, Ashutosh

    2015-11-01

    Proteins and synthetic polymers that undergo aqueous phase transitions mediate self-assembly in nature and in man-made material systems. Yet little is known about how the phase behaviour of a protein is encoded in its amino acid sequence. Here, by synthesizing intrinsically disordered, repeat proteins to test motifs that we hypothesized would encode phase behaviour, we show that the proteins can be designed to exhibit tunable lower or upper critical solution temperature (LCST and UCST, respectively) transitions in physiological solutions. We also show that mutation of key residues at the repeat level abolishes phase behaviour or encodes an orthogonal transition. Furthermore, we provide heuristics to identify, at the proteome level, proteins that might exhibit phase behaviour and to design novel protein polymers consisting of biologically active peptide repeats that exhibit LCST or UCST transitions. These findings set the foundation for the prediction and encoding of phase behaviour at the sequence level. PMID:26390327

  17. Thermodynamic features characterizing good and bad folding sequences obtained using a simplified off-lattice protein model

    NASA Astrophysics Data System (ADS)

    Amatori, A.; Ferkinghoff-Borg, J.; Tiana, G.; Broglia, R. A.

    2006-06-01

    The thermodynamics of the small SH3 protein domain is studied by means of a simplified model where each beadlike amino acid interacts with the others through a contact potential controlled by a 20×20 random matrix. Good folding sequences, characterized by a low native energy, display three main thermodynamical ensembles, namely, a coil-like ensemble, an unfolded globule, and a folded ensemble (plus two other states, frozen and random coils, populated only at extreme temperatures). Interestingly, the unfolded globule has some regions already structured. Poorly designed sequences, on the other hand, display a wide transition from the random coil to a frozen state. The comparison with the analytic theory of heteropolymers is discussed.

  18. The iceLogo web server and SOAP service for determining protein consensus sequences.

    PubMed

    Maddelein, Davy; Colaert, Niklaas; Buchanan, Iain; Hulstaert, Niels; Gevaert, Kris; Martens, Lennart

    2015-07-01

    The iceLogo web server and SOAP service implement the previously published iceLogo algorithm. iceLogo builds on probability theory to visualize protein consensus sequences in a format resembling sequence logos. Peptide sequences are compared against a reference sequence set that can be tailored to the studied system and the used protocol. As such, not only over- but also underrepresented residues can be visualized in a statistically sound manner, which further allows the user to easily analyse and interpret conserved sequence patterns in proteins. The web application and SOAP service can be found free and open to all users without the need for a login on http://iomics.ugent.be/icelogoserver/main.html. PMID:25897125

  19. Definition and Analysis of a System for the Automated Comparison of Curriculum Sequencing Algorithms in Adaptive Distance Learning

    ERIC Educational Resources Information Center

    Limongelli, Carla; Sciarrone, Filippo; Temperini, Marco; Vaste, Giulia

    2011-01-01

    LS-Lab provides automatic support to comparison/evaluation of the Learning Object Sequences produced by different Curriculum Sequencing Algorithms. Through this framework a teacher can verify the correspondence between the behaviour of different sequencing algorithms and her pedagogical preferences. In fact the teacher can compare algorithms…

  20. Enzyme sequence similarity improves the reaction alignment method for cross-species pathway comparison

    SciTech Connect

    Ovacik, Meric A.; Androulakis, Ioannis P.

    2013-09-15

    Pathway-based information has become an important source of information for both establishing evolutionary relationships and understanding the mode of action of a chemical or pharmaceutical among species. Cross-species comparison of pathways can address two broad questions: comparison in order to inform evolutionary relationships and to extrapolate species differences used in a number of different applications including drug and toxicity testing. Cross-species comparison of metabolic pathways is complex as there are multiple features of a pathway that can be modeled and compared. Among the various methods that have been proposed, reaction alignment has emerged as the most successful at predicting phylogenetic relationships based on NCBI taxonomy. We propose an improvement of the reaction alignment method by accounting for sequence similarity in addition to reaction alignment method. Using nine species, including human and some model organisms and test species, we evaluate the standard and improved comparison methods by analyzing glycolysis and citrate cycle pathways conservation. In addition, we demonstrate how organism comparison can be conducted by accounting for the cumulative information retrieved from nine pathways in central metabolism as well as a more complete study involving 36 pathways common in all nine species. Our results indicate that reaction alignment with enzyme sequence similarity results in a more accurate representation of pathway specific cross-species similarities and differences based on NCBI taxonomy.

  1. DNA linking number change induced by sequence-specific DNA-binding proteins

    PubMed Central

    Chen, Bo; Xiao, Yazhong; Liu, Chang; Li, Chenzhong; Leng, Fenfei

    2010-01-01

    Sequence-specific DNA-binding proteins play a key role in many fundamental biological processes, such as transcription, DNA replication and recombination. Very often, these DNA-binding proteins introduce structural changes to the target DNA-binding sites including DNA bending, twisting or untwisting and wrapping, which in many cases induce a linking number change (ΔLk) to the DNA-binding site. Due to the lack of a feasible approach, ΔLk induced by sequence-specific DNA-binding proteins has not been fully explored. In this paper we successfully constructed a series of DNA plasmids that carry many tandem copies of a DNA-binding site for one sequence-specific DNA-binding protein, such as λ O, LacI, GalR, CRP and AraC. In this case, the protein-induced ΔLk was greatly amplified and can be measured experimentally. Indeed, not only were we able to simultaneously determine the protein-induced ΔLk and the DNA-binding constant for λ O and GalR, but also we demonstrated that the protein-induced ΔLk is an intrinsic property for these sequence-specific DNA-binding proteins. Our results also showed that protein-mediated DNA looping by AraC and LacI can induce a ΔLk to the plasmid DNA templates. Furthermore, we demonstrated that the protein-induced ΔLk does not correlate with the protein-induced DNA bending by the DNA-binding proteins. PMID:20185570

  2. Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies.

    PubMed

    Rickert, Keith W; Grinberg, Luba; Woods, Robert M; Wilson, Susan; Bowen, Michael A; Baca, Manuel

    2016-04-01

    The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3-5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material. PMID:26852694

  3. Sequence analysis and protein import studies of an outer chloroplast envelope polypeptide.

    PubMed Central

    Salomon, M; Fischer, K; Flügge, U I; Soll, J

    1990-01-01

    A chloroplast outer envelope membrane protein was cloned and sequenced and from the sequence it was possible to deduce a polypeptide of 6.7 kDa. It has only one membrane-spanning region; the C terminus extends into the cytosol, whereas the N terminus is exposed to the space between the two envelope membranes. The protein was synthesized in an in vitro transcription-translation system to study its routing into isolated chloroplasts. The import studies revealed that the 6.7-kDa protein followed a different and heretofore undescribed translocation pathway in the respect that (i) it does not have a cleavable transit sequence, (ii) it does not require ATP hydrolysis for import, and (iii) protease-sensitive components that are responsible for recognition of precursor proteins destined for the inside of the chloroplasts are not involved in routing the 6.7-kDa polypeptide to the outer chloroplast envelope. Images PMID:2377616

  4. A nuclear protein associated with human cancer cells binds preferentially to a human repetitive DNA sequence

    SciTech Connect

    Gao, J. ); Law, M.L.; Puck, T.T. Univ. of Colorado Health Sciences Center, Denver )

    1989-11-01

    A protein (Rp66) of 66 kDa was shown by DNA-binding protein blot assay to bind to a human repetitive DNA sequence (low-repeat sequences; LRS) in each of 10 transformed human cell lines examined. This protein-DNA interaction was not observed in 11 normal human cell cultures or in the Chinese hamster cell line CHO-K1. Gel retardation assay confirmed the specificity of the protein-DNA binding between Rp66 and LRS. In a histiocytic lymphoma human cell line, U937, that can be induced to differentiate in the presence of phorbol ester, this binding disappeared after cell differentiation. These together with other results cited suggest a regulatory role for these repetitive sequences in the human genome, with particular application to cancer.

  5. Hypothesis/review: the structural basis of sweetness perception of sweet-tasting plant proteins can be deduced from sequence analysis.

    PubMed

    Wintjens, René; Viet, Tran Melody Vu Ngoc; Mbosso, Emmanuel; Huet, Joëlle

    2011-10-01

    Human perception of sweetness, behind the felt pleasure, is thought to play a role as an indicator of energy density of foods. For humans, only a small number of plant proteins taste sweet. As non-caloric sweeteners, these plant proteins have attracted attention as candidates for the control of obesity, oral health and diabetic management. Significant advances have been made in the characterization of the sweet-tasting plant proteins, as well as their binding interactions with the appropriate receptors. The elucidation of sweet-taste receptor gene sequences represents an important step towards the understanding of sweet taste perception. However, many questions on the molecular basis of sweet-taste elicitation by plant proteins remain unanswered. In particular, why homologues of these proteins do not elicit similar responses? This question is discussed in this report, on the basis of available sequences and structures of sweet-tasting proteins, as well as of sweetness-sensing receptors. A simple procedure based on sequence comparisons between sweet-tasting protein and its homologous counterparts was proposed to identify critical residues for sweetness elicitation. The open question on the physiological function of sweet-tasting plant proteins is also considered. In particular, this review leads us to suggest that sweet-tasting proteins may interact with taste receptor in a serendipity manner. PMID:21889040

  6. Prediction of antibiotic resistance proteins from sequence-derived properties irrespective of sequence similarity.

    PubMed

    Zhang, H L; Lin, H H; Tao, L; Ma, X H; Dai, J L; Jia, J; Cao, Z W

    2008-09-01

    Increasing antibiotic resistance has become a worldwide challenge to the clinical treatment of infectious diseases. The identification of antibiotic resistance proteins (ARPs) would be helpful in the discovery of new therapeutic targets and the design of novel drugs to control the potential spread of antibiotic resistance. In this work, a support vector machine (SVM)-based ARP prediction system was developed using 1308 ARPs and 15587 non-ARPs. Its performance was evaluated using 313 ARPs and 7156 non-ARPs. The computed prediction accuracy was 88.5% for ARPs and 99.2% for non-ARPs. A potential application of this method is the identification of ARPs non-homologous to proteins of known function. Further genome screening found that ca. 3.5% and 3.2% of proteins in Escherichia coli and Staphylococcus aureus, respectively, are potential ARPs. These results suggest the usefulness of SVMs for facilitating the identification of ARPs. The software can be accessed at SARPI (Server for Antibiotic Resistance Protein Identification). PMID:18583101

  7. Approaching a complete repository of sequence-verified protein-encoding clones for Saccharomyces cerevisiae

    PubMed Central

    Hu, Yanhui; Rolfs, Andreas; Bhullar, Bhupinder; Murthy, Tellamraju V. S.; Zhu, Cong; Berger, Michael F.; Camargo, Anamaria A.; Kelley, Fontina; McCarron, Seamus; Jepson, Daniel; Richardson, Aaron; Raphael, Jacob; Moreira, Donna; Taycher, Elena; Zuo, Dongmei; Mohr, Stephanie; Kane, Michael F.; Williamson, Janice; Simpson, Andrew; Bulyk, Martha L.; Harlow, Edward; Marsischky, Gerald; Kolodner, Richard D.; LaBaer, Joshua

    2007-01-01

    The availability of an annotated genome sequence for the yeast Saccharomyces cerevisiae has made possible the proteome-scale study of protein function and protein–protein interactions. These studies rely on availability of cloned open reading frame (ORF) collections that can be used for cell-free or cell-based protein expression. Several yeast ORF collections are available, but their use and data interpretation can be hindered by reliance on now out-of-date annotations, the inflexible presence of N- or C-terminal tags, and/or the unknown presence of mutations introduced during the cloning process. High-throughput biochemical and genetic analyses would benefit from a “gold standard” (fully sequence-verified, high-quality) ORF collection, which allows for high confidence in and reproducibility of experimental results. Here, we describe Yeast FLEXGene, a S. cerevisiae protein-coding clone collection that covers over 5000 predicted protein-coding sequences. The clone set covers 87% of the current S. cerevisiae genome annotation and includes full sequencing of each ORF insert. Availability of this collection makes possible a wide variety of studies from purified proteins to mutation suppression analysis, which should contribute to a global understanding of yeast protein function. PMID:17322287

  8. N-Terminal Amino Acid Sequence Determination of Proteins by N-Terminal Dimethyl Labeling: Pitfalls and Advantages When Compared with Edman Degradation Sequence Analysis.

    PubMed

    Chang, Elizabeth; Pourmal, Sergei; Zhou, Chun; Kumar, Rupesh; Teplova, Marianna; Pavletich, Nikola P; Marians, Kenneth J; Erdjument-Bromage, Hediye

    2016-07-01

    In recent history, alternative approaches to Edman sequencing have been investigated, and to this end, the Association of Biomolecular Resource Facilities (ABRF) Protein Sequencing Research Group (PSRG) initiated studies in 2014 and 2015, looking into bottom-up and top-down N-terminal (Nt) dimethyl derivatization of standard quantities of intact proteins with the aim to determine Nt sequence information. We have expanded this initiative and used low picomole amounts of myoglobin to determine the efficiency of Nt-dimethylation. Application of this approach on protein domains, generated by limited proteolysis of overexpressed proteins, confirms that it is a universal labeling technique and is very sensitive when compared with Edman sequencing. Finally, we compared Edman sequencing and Nt-dimethylation of the same polypeptide fragments; results confirm that there is agreement in the identity of the Nt amino acid sequence between these 2 methods. PMID:27006647

  9. N-Terminal Amino Acid Sequence Determination of Proteins by N-Terminal Dimethyl Labeling: Pitfalls and Advantages When Compared with Edman Degradation Sequence Analysis

    PubMed Central

    Chang, Elizabeth; Pourmal, Sergei; Zhou, Chun; Kumar, Rupesh; Teplova, Marianna; Pavletich, Nikola P.; Marians, Kenneth J.

    2016-01-01

    In recent history, alternative approaches to Edman sequencing have been investigated, and to this end, the Association of Biomolecular Resource Facilities (ABRF) Protein Sequencing Research Group (PSRG) initiated studies in 2014 and 2015, looking into bottom-up and top-down N-terminal (Nt) dimethyl derivatization of standard quantities of intact proteins with the aim to determine Nt sequence information. We have expanded this initiative and used low picomole amounts of myoglobin to determine the efficiency of Nt-dimethylation. Application of this approach on protein domains, generated by limited proteolysis of overexpressed proteins, confirms that it is a universal labeling technique and is very sensitive when compared with Edman sequencing. Finally, we compared Edman sequencing and Nt-dimethylation of the same polypeptide fragments; results confirm that there is agreement in the identity of the Nt amino acid sequence between these 2 methods. PMID:27006647

  10. OrfPredictor: predicting protein-coding regions in EST-derived sequences.

    PubMed

    Min, Xiang Jia; Butler, Gregory; Storms, Reginald; Tsang, Adrian

    2005-07-01

    OrfPredictor is a web server designed for identifying protein-coding regions in expressed sequence tag (EST)-derived sequences. For query sequences with a hit in BLASTX, the program predicts the coding regions based on the translation reading frames identified in BLASTX alignments, otherwise, it predicts the most probable coding region based on the intrinsic signals of the query sequences. The output is the predicted peptide sequences in the FASTA format, and a definition line that includes the query ID, the translation reading frame and the nucleotide positions where the coding region begins and ends. OrfPredictor facilitates the annotation of EST-derived sequences, particularly, for large-scale EST projects. OrfPredictor is available at https://fungalgenome.concordia.ca/tools/OrfPredictor.html. PMID:15980561

  11. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega

    PubMed Central

    Sievers, Fabian; Wilm, Andreas; Dineen, David; Gibson, Toby J; Karplus, Kevin; Li, Weizhong; Lopez, Rodrigo; McWilliam, Hamish; Remmert, Michael; Söding, Johannes; Thompson, Julie D; Higgins, Desmond G

    2011-01-01

    Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam. PMID:21988835

  12. pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

    SciTech Connect

    Wu, Changjun; Kalyanaraman, Anantharaman; Cannon, William R.

    2012-09-15

    Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problemparticularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.

  13. A machine learning strategy to identify candidate binding sites in human protein-coding sequence

    PubMed Central

    Down, Thomas; Leong, Bernard; Hubbard, Tim JP

    2006-01-01

    Background The splicing of RNA transcripts is thought to be partly promoted and regulated by sequences embedded within exons. Known sequences include binding sites for SR proteins, which are thought to mediate interactions between splicing factors bound to the 5' and 3' splice sites. It would be useful to identify further candidate sequences, however identifying them computationally is hard since exon sequences are also constrained by their functional role in coding for proteins. Results This strategy identified a collection of motifs including several previously reported splice enhancer elements. Although only trained on coding exons, the model discriminates both coding and non-coding exons from intragenic sequence. Conclusion We have trained a computational model able to detect signals in coding exons which seem to be orthogonal to the sequences' primary function of coding for proteins. We believe that many of the motifs detected here represent binding sites for both previously unrecognized proteins which influence RNA splicing as well as other regulatory elements. PMID:17002805

  14. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins

    SciTech Connect

    Sawle, Lucas; Ghosh, Kingshuk

    2015-08-28

    A general formalism to compute configurational properties of proteins and other heteropolymers with an arbitrary sequence of charges and non-uniform excluded volume interaction is presented. A variational approach is utilized to predict average distance between any two monomers in the chain. The presented analytical model, for the first time, explicitly incorporates the role of sequence charge distribution to determine relative sizes between two sequences that vary not only in total charge composition but also in charge decoration (even when charge composition is fixed). Furthermore, the formalism is general enough to allow variation in excluded volume interactions between two monomers. Model predictions are benchmarked against the all-atom Monte Carlo studies of Das and Pappu [Proc. Natl. Acad. Sci. U. S. A. 110, 13392 (2013)] for 30 different synthetic sequences of polyampholytes. These sequences possess an equal number of glutamic acid (E) and lysine (K) residues but differ in the patterning within the sequence. Without any fit parameter, the model captures the strong sequence dependence of the simulated values of the radius of gyration with a correlation coefficient of R{sup 2} = 0.9. The model is then applied to real proteins to compare the unfolded state dimensions of 540 orthologous pairs of thermophilic and mesophilic proteins. The excluded volume parameters are assumed similar under denatured conditions, and only electrostatic effects encoded in the sequence are accounted for. With these assumptions, thermophilic proteins are found—with high statistical significance—to have more compact disordered ensemble compared to their mesophilic counterparts. The method presented here, due to its analytical nature, is capable of making such high throughput analysis of multiple proteins and will have broad applications in proteomic studies as well as in other heteropolymeric systems.

  15. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins

    NASA Astrophysics Data System (ADS)

    Sawle, Lucas; Ghosh, Kingshuk

    2015-08-01

    A general formalism to compute configurational properties of proteins and other heteropolymers with an arbitrary sequence of charges and non-uniform excluded volume interaction is presented. A variational approach is utilized to predict average distance between any two monomers in the chain. The presented analytical model, for the first time, explicitly incorporates the role of sequence charge distribution to determine relative sizes between two sequences that vary not only in total charge composition but also in charge decoration (even when charge composition is fixed). Furthermore, the formalism is general enough to allow variation in excluded volume interactions between two monomers. Model predictions are benchmarked against the all-atom Monte Carlo studies of Das and Pappu [Proc. Natl. Acad. Sci. U. S. A. 110, 13392 (2013)] for 30 different synthetic sequences of polyampholytes. These sequences possess an equal number of glutamic acid (E) and lysine (K) residues but differ in the patterning within the sequence. Without any fit parameter, the model captures the strong sequence dependence of the simulated values of the radius of gyration with a correlation coefficient of R2 = 0.9. The model is then applied to real proteins to compare the unfolded state dimensions of 540 orthologous pairs of thermophilic and mesophilic proteins. The excluded volume parameters are assumed similar under denatured conditions, and only electrostatic effects encoded in the sequence are accounted for. With these assumptions, thermophilic proteins are found—with high statistical significance—to have more compact disordered ensemble compared to their mesophilic counterparts. The method presented here, due to its analytical nature, is capable of making such high throughput analysis of multiple proteins and will have broad applications in proteomic studies as well as in other heteropolymeric systems.

  16. Characterizing informative sequence descriptors and predicting binding affinities of heterodimeric protein complexes

    PubMed Central

    2015-01-01

    Background Protein-protein interactions (PPIs) are involved in various biological processes, and underlying mechanism of the interactions plays a crucial role in therapeutics and protein engineering. Most machine learning approaches have been developed for predicting the binding affinity of protein-protein complexes based on structure and functional information. This work aims to predict the binding affinity of heterodimeric protein complexes from sequences only. Results This work proposes a support vector machine (SVM) based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes based on the prediction of their binding affinity. SVM-BAC identified 14 of 580 sequence descriptors (physicochemical, energetic and conformational properties of the 20 amino acids) to classify 216 heterodimeric protein complexes into low and high binding affinity. SVM-BAC yielded the training accuracy, sensitivity, specificity, AUC and test accuracy of 85.80%, 0.89, 0.83, 0.86 and 83.33%, respectively, better than existing machine learning algorithms. The 14 features and support vector regression were further used to estimate the binding affinities (Pkd) of 200 heterodimeric protein complexes. Prediction performance of a Jackknife test was the correlation coefficient of 0.34 and mean absolute error of 1.4. We further analyze three informative physicochemical properties according to their contribution to prediction performance. Results reveal that the following properties are effective in predicting the binding affinity of heterodimeric protein complexes: apparent partition energy based on buried molar fractions, relations between chemical structure and biological activity in principal component analysis IV, and normalized frequency of beta turn. Conclusions The proposed sequence-based prediction method SVM-BAC uses an optimal feature selection method to identify 14 informative features to classify and predict binding affinity of heterodimeric protein

  17. Assessing a novel approach for predicting local 3D protein structures from sequence.

    PubMed

    Benros, Cristina; de Brevern, Alexandre G; Etchebest, Catherine; Hazout, Serge

    2006-03-01

    We developed a novel approach for predicting local protein structure from sequence. It relies on the Hybrid Protein Model (HPM), an unsupervised clustering method we previously developed. This model learns three-dimensional protein fragments encoded into a structural alphabet of 16 protein blocks (PBs). Here, we focused on 11-residue fragments encoded as a series of seven PBs and used HPM to cluster them according to their local similarities. We thus built a library of 120 overlapping prototypes (mean fragments from each cluster), with good three-dimensional local approximation, i.e., a mean accuracy of 1.61 A Calpha root-mean-square distance. Our prediction method is intended to optimize the exploitation of the sequence-structure relations deduced from this library of long protein fragments. This was achieved by setting up a system of 120 experts, each defined by logistic regression to optimize the discrimination from sequence of a given prototype relative to the others. For a target sequence window, the experts computed probabilities of sequence-structure compatibility for the prototypes and ranked them, proposing the top scorers as structural candidates. Predictions were defined as successful when a prototype <2.5 A from the true local structure was found among those proposed. Our strategy yielded a prediction rate of 51.2% for an average of 4.2 candidates per sequence window. We also proposed a confidence index to estimate prediction quality. Our approach predicts from sequence alone and will thus provide valuable information for proteins without structural homologs. Candidates will also contribute to global structure prediction by fragment assembly. PMID:16385557

  18. Silkmoth chorion proteins: sequence analysis of the products of a multigene family.

    PubMed Central

    Regier, J C; Kafatos, F C; Goodfliesh, R; Hood, L

    1978-01-01

    Five polypeptide components have been isolated from the eggshell (chorions) of a silkmoth. Two are homogeneous on sodium dodecyl sulfate and isoelectric focusing gels, and three contain predominantly two proteins each. Amino acid analyses show that all five components are similar to each other. These proteins have been sequenced from the amino terminus. Homogeneous components yielded single sequences; heterogeneous components yielded two residues at some positions, consistent with their containing two major electrophoretic components. Striking similarities are apparent among all these sequences. These similarities can be increased dramatically by separating each of the three protein mixtures into two sequences and introducing a small number of gaps or insertions. This is due in part to bringing into register a portion that contains short repeating subunits found in all sequences. All proteins are also characterized by a region of high cysteine content near the amino terminus followed by a longer low-cysteine region. The data suggest that these proteins share a common evolutionary origin and are encoded by a multigene family. Images PMID:272655

  19. Protein location prediction using atomic composition and global features of the amino acid sequence

    SciTech Connect

    Cherian, Betsy Sheena; Nair, Achuthsankar S.

    2010-01-22

    Subcellular location of protein is constructive information in determining its function, screening for drug candidates, vaccine design, annotation of gene products and in selecting relevant proteins for further studies. Computational prediction of subcellular localization deals with predicting the location of a protein from its amino acid sequence. For a computational localization prediction method to be more accurate, it should exploit all possible relevant biological features that contribute to the subcellular localization. In this work, we extracted the biological features from the full length protein sequence to incorporate more biological information. A new biological feature, distribution of atomic composition is effectively used with, multiple physiochemical properties, amino acid composition, three part amino acid composition, and sequence similarity for predicting the subcellular location of the protein. Support Vector Machines are designed for four modules and prediction is made by a weighted voting system. Our system makes prediction with an accuracy of 100, 82.47, 88.81 for self-consistency test, jackknife test and independent data test respectively. Our results provide evidence that the prediction based on the biological features derived from the full length amino acid sequence gives better accuracy than those derived from N-terminal alone. Considering the features as a distribution within the entire sequence will bring out underlying property distribution to a greater detail to enhance the prediction accuracy.

  20. Cloud Computing for Protein-Ligand Binding Site Comparison

    PubMed Central

    2013-01-01

    The proteome-wide analysis of protein-ligand binding sites and their interactions with ligands is important in structure-based drug design and in understanding ligand cross reactivity and toxicity. The well-known and commonly used software, SMAP, has been designed for 3D ligand binding site comparison and similarity searching of a structural proteome. SMAP can also predict drug side effects and reassign existing drugs to new indications. However, the computing scale of SMAP is limited. We have developed a high availability, high performance system that expands the comparison scale of SMAP. This cloud computing service, called Cloud-PLBS, combines the SMAP and Hadoop frameworks and is deployed on a virtual cloud computing platform. To handle the vast amount of experimental data on protein-ligand binding site pairs, Cloud-PLBS exploits the MapReduce paradigm as a management and parallelizing tool. Cloud-PLBS provides a web portal and scalability through which biologists can address a wide range of computer-intensive questions in biology and drug discovery. PMID:23762824

  1. Cloud computing for protein-ligand binding site comparison.

    PubMed

    Hung, Che-Lun; Hua, Guan-Jie

    2013-01-01

    The proteome-wide analysis of protein-ligand binding sites and their interactions with ligands is important in structure-based drug design and in understanding ligand cross reactivity and toxicity. The well-known and commonly used software, SMAP, has been designed for 3D ligand binding site comparison and similarity searching of a structural proteome. SMAP can also predict drug side effects and reassign existing drugs to new indications. However, the computing scale of SMAP is limited. We have developed a high availability, high performance system that expands the comparison scale of SMAP. This cloud computing service, called Cloud-PLBS, combines the SMAP and Hadoop frameworks and is deployed on a virtual cloud computing platform. To handle the vast amount of experimental data on protein-ligand binding site pairs, Cloud-PLBS exploits the MapReduce paradigm as a management and parallelizing tool. Cloud-PLBS provides a web portal and scalability through which biologists can address a wide range of computer-intensive questions in biology and drug discovery. PMID:23762824

  2. Using CATH-Gene3D to Analyze the Sequence, Structure, and Function of Proteins.

    PubMed

    Sillitoe, Ian; Lewis, Tony; Orengo, Christine

    2015-01-01

    The CATH database is a classification of protein structures found in the Protein Data Bank (PDB). Protein structures are chopped into individual units of structural domains, and these domains are grouped together into superfamilies if there is sufficient evidence that they have diverged from a common ancestor during the process of evolution. A sister resource, Gene3D, extends this information by scanning sequence profiles of these CATH domain superfamilies against many millions of known proteins to identify related sequences. Thus the combined CATH-Gene3D resource provides confident predictions of the likely structural fold, domain organisation, and evolutionary relatives of these proteins. In addition, this resource incorporates annotations from a large number of external databases such as known enzyme active sites, GO molecular functions, physical interactions, and mutations. This unit details how to access and understand the information contained within the CATH-Gene3D Web pages, the downloadable data files, and the remotely accessible Web services. PMID:26087950

  3. Computational Framework for Prediction of Peptide Sequences That May Mediate Multiple Protein Interactions in Cancer-Associated Hub Proteins

    PubMed Central

    Sarkar, Debasree; Patra, Piya; Ghosh, Abhirupa; Saha, Sudipto

    2016-01-01

    A considerable proportion of protein-protein interactions (PPIs) in the cell are estimated to be mediated by very short peptide segments that approximately conform to specific sequence patterns known as linear motifs (LMs), often present in the disordered regions in the eukaryotic proteins. These peptides have been found to interact with low affinity and are able bind to multiple interactors, thus playing an important role in the PPI networks involving date hubs. In this work, PPI data and de novo motif identification based method (MEME) were used to identify such peptides in three cancer-associated hub proteins—MYC, APC and MDM2. The peptides corresponding to the significant LMs identified for each hub protein were aligned, the overlapping regions across these peptides being termed as overlapping linear peptides (OLPs). These OLPs were thus predicted to be responsible for multiple PPIs of the corresponding hub proteins and a scoring system was developed to rank them. We predicted six OLPs in MYC and five OLPs in MDM2 that scored higher than OLP predictions from randomly generated protein sets. Two OLP sequences from the C-terminal of MYC were predicted to bind with FBXW7, component of an E3 ubiquitin-protein ligase complex involved in proteasomal degradation of MYC. Similarly, we identified peptides in the C-terminal of MDM2 interacting with FKBP3, which has a specific role in auto-ubiquitinylation of MDM2. The peptide sequences predicted in MYC and MDM2 look promising for designing orthosteric inhibitors against possible disease-associated PPIs. Since these OLPs can interact with other proteins as well, these inhibitors should be specific to the targeted interactor to prevent undesired side-effects. This computational framework has been designed to predict and rank the peptide regions that may mediate multiple PPIs and can be applied to other disease-associated date hub proteins for prediction of novel therapeutic targets of small molecule PPI modulators. PMID

  4. Nucleotide sequence and characterization of peb4A encoding an antigenic protein in Campylobacter jejuni.

    PubMed

    Burucoa, C; Frémaux, C; Pei, Z; Tummuru, M; Blaser, M J; Cenatiempo, Y; Fauchère, J L

    1995-01-01

    The 29-kDa protein PEB4, a major antigen of Campylobacter jejuni, is present in all C. jejuni strains tested and elicits an antibody response in infected patients. By screening a lambda gt11 library of chromosomal DNA fragments of C. jejuni strain 81-176 in Escherichia coli Y1090 cells with antibody raised against purified PEB4, a recombinant phage with a 2-kb insert expressing an immunoreactive protein of 29 kDa was isolated. DNA sequence analysis revealed that the insert contains two complete open reading frames ORF-A and ORF-B. ORF-A (peb4A) encodes a 273-residue protein with a calculated molecular mass of 30,460 daltons. The deduced amino acid sequence, composition and pl of the recombinant mature protein are similar to those determined for purified PEB4. The first 21 residues resemble a signal peptide. Gene bank searches indicated 33.7% identity with protein export protein PrsA of Bacillus subtilis and 23.8% identity with protease maturation protein precursor PrtM of Lactococcus lactis. PCR experiments indicate that peb4A is highly conserved among C. jejuni strains. ORF-B begins 2 bp after the last codon of peb4A and encodes a putative protein of 353 residues with 63.4% identity with E. coli fructose 1,6-biphosphate aldolase. The sequence arrangement suggests that these two genes form an operon. PMID:8525063

  5. A Comparison of the First Two Sequenced Chloroplast Genomes in Asteraceae: Lettuce and Sunflower

    SciTech Connect

    Timme, Ruth E.; Kuehl, Jennifer V.; Boore, Jeffrey L.; Jansen, Robert K.

    2006-01-20

    Asteraceae is the second largest family of plants, with over 20,000 species. For the past few decades, numerous phylogenetic studies have contributed to our understanding of the evolutionary relationships within this family, including comparisons of the fast evolving chloroplast gene, ndhF, rbcL, as well as non-coding DNA from the trnL intron plus the trnLtrnF intergenic spacer, matK, and, with lesser resolution, psbA-trnH. This culminated in a study by Panero and Funk in 2002 that used over 13,000 bp per taxon for the largest taxonomic revision of Asteraceae in over a hundred years. Still, some uncertainties remain, and it would be very useful to have more information on the relative rates of sequence evolution among various genes and on genome structure as a potential set of phylogenetic characters to help guide future phylogenetic structures. By way of contributing to this, we report the first two complete chloroplast genome sequences from members of the Asteraceae, those of Helianthus annuus and Lactuca sativa. These plants belong to two distantly related subfamilies, Asteroideae and Cichorioideae, respectively. In addition to these, there is only one other published chloroplast genome sequence for any plant within the larger group called Eusterids II, that of Panax ginseng (Araliaceae, 156,318 bps, AY582139). Early chloroplast genome mapping studies demonstrated that H. annuus and L. sativa share a 22 kb inversion relative to members of the subfamily Barnadesioideae. By comparison to outgroups, this inversion was shown to be derived, indicating that the Asteroideae and Cichorioideae are more closely related than either is to the Barnadesioideae. Later sequencing study found that taxa that share this 22 kb inversion also contain within this region a second, smaller, 3.3 kb inversion. These sequences also enable an analysis of patterns of shared repeats in the genomes at fine level and of RNA editing by comparison to available EST sequences. In addition, since

  6. Orpinomyces cellulase CelE protein and coding sequences

    DOEpatents

    Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong

    2000-08-29

    A CDNA designated celE cloned from Orpinomyces PC-2 encodes a polypeptide (CelE) of 477 amino acids. CelE is highly homologous to CelB of Orpinomyces (72.3% identity) and Neocallimastix (67.9% identity), and like them, it has a non-catalytic repeated peptide domain (NCRPD) at the C-terminal end. The catalytic domain of CelE is homologous to glycosyl hydrolases of Family 5, found in several anaerobic bacteria. The gene of celE is devoid of introns. The recombinant proteins CelE and CelB of Orpinomyces PC-2 randomly hydrolyze carboxymethylcellulose and cello-oligosaccharides in the pattern of endoglucanases.

  7. Myelin protein zero gene sequencing diagnoses Charcot-Marie-Tooth Type 1B disease

    SciTech Connect

    Su, Y.; Zhang, H.; Madrid, R.

    1994-09-01

    Charcot-Marie-Tooth disease (CMT), the most common genetic neuropathy, affects about 1 in 2600 people in Norway and is found worldwide. CMT Type 1 (CMT1) has slow nerve conduction with demyelinated Schwann cells. Autosomal dominant CMT Type 1B (CMT1B) results from mutations in the myelin protein zero gene which directs the synthesis of more than half of all Schwann cell protein. This gene was mapped to the chromosome 1q22-1q23.1 borderline by fluorescence in situ hybridization. The first 7 of 7 reported CMT1B mutations are unique. Thus the most effective means to identify CMT1B mutations in at-risk family members and fetuses is to sequence the entire coding sequence in dominant or sporadic CMT patients without the CMT1A duplication. Of the 19 primers used in 16 pars to uniquely amplify the entire MPZ coding sequence, 6 primer pairs were used to amplify and sequence the 6 exons. The DyeDeoxy Terminator cycle sequencing method used with four different color fluorescent lables was superior to manual sequencing because it sequences more bases unambiguously from extracted genomic DNA samples within 24 hours. This protocol was used to test 28 CMT and Dejerine-Sottas patients without CMT1A gene duplication. Sequencing MPZ gene-specific amplified fragments identified 9 polymorphic sites within the 6 exons that encode the 248 amino acid MPZ protein. The large number of major CMT1B mutations identified by single strand sequencing are being verified by reverse strand sequencing and when possible, by restriction enzyme analysis. This protocol can be used to distringuish CMT1B patients from othre CMT phenotypes and to determine the CMT1B status of relatives both presymptomatically and prenatally.

  8. Integrating mRNA and protein sequencing enables the detection and quantitative profiling of natural protein sequence variants of Populus trichocarpa

    SciTech Connect

    Abraham, Paul E.; Wang, Xiaojing; Ranjan, Priya; Zhang, Bing; Tuskan, Gerald A.; Robert L. Hettich; Nookaew, Intawat

    2015-10-20

    The availability of next-generation sequencing technologies has rapidly transformed our ability to link genotypes to phenotypes, and as such, promises to facilitate the dissection of genetic contribution to complex traits. Although discoveries of genetic associations will further our understanding of biology, once candidate variants have been identified, investigators are faced with the challenge of characterizing the functional effects on proteins encoded by such genes. Here we show how next-generation RNA sequencing data can be exploited to construct genotype-specific protein sequence databases, which provide a clearer picture of the molecular toolbox underlying cellular and organismal processes and their variation in a natural population. For this study, we used two individual genotypes (DENA-17-3 and VNDL-27-4) from a recent genome wide association (GWA) study of Populus trichocarpa, an obligate outcrosser that exhibits tremendous phenotypic variation across the natural population. This strategy allowed us to comprehensively catalogue proteins containing single amino acid polymorphisms (SAAPs) and insertions and deletions (INDELS). Based on large-scale identification of SAAPs, we profiled the frequency of 128 types of naturally occurring amino acid substitutions, with a subset of SAAPs occurring in regions of the genome having strong polymorphism patterns consistent with recent positive and/or divergent selection. In addition, we were able to explore the diploid landscape of Populus at the proteome-level, allowing the characterization of heterozygous variants.

  9. Integrating mRNA and protein sequencing enables the detection and quantitative profiling of natural protein sequence variants of Populus trichocarpa

    DOE PAGESBeta

    Abraham, Paul E.; Wang, Xiaojing; Ranjan, Priya; Zhang, Bing; Tuskan, Gerald A.; Robert L. Hettich; Nookaew, Intawat

    2015-10-20

    The availability of next-generation sequencing technologies has rapidly transformed our ability to link genotypes to phenotypes, and as such, promises to facilitate the dissection of genetic contribution to complex traits. Although discoveries of genetic associations will further our understanding of biology, once candidate variants have been identified, investigators are faced with the challenge of characterizing the functional effects on proteins encoded by such genes. Here we show how next-generation RNA sequencing data can be exploited to construct genotype-specific protein sequence databases, which provide a clearer picture of the molecular toolbox underlying cellular and organismal processes and their variation in amore » natural population. For this study, we used two individual genotypes (DENA-17-3 and VNDL-27-4) from a recent genome wide association (GWA) study of Populus trichocarpa, an obligate outcrosser that exhibits tremendous phenotypic variation across the natural population. This strategy allowed us to comprehensively catalogue proteins containing single amino acid polymorphisms (SAAPs) and insertions and deletions (INDELS). Based on large-scale identification of SAAPs, we profiled the frequency of 128 types of naturally occurring amino acid substitutions, with a subset of SAAPs occurring in regions of the genome having strong polymorphism patterns consistent with recent positive and/or divergent selection. In addition, we were able to explore the diploid landscape of Populus at the proteome-level, allowing the characterization of heterozygous variants.« less

  10. HUGE: a database for human large proteins identified in the Kazusa cDNA sequencing project.

    PubMed

    Kikuno, R; Nagase, T; Suyama, M; Waki, M; Hirosawa, M; Ohara, O

    2000-01-01

    HUGE is a database for human large proteins newly identified in the Kazusa cDNA project, the aim of which is to predict the primary structure of proteins from the sequences of human large cDNAs (>4 kb). In particular, cDNA clones capable of coding for large proteins (>50 kDa) are the current targets of the project. HUGE contains >1100 cDNA sequences and detailed information obtained through analysis of the sequences of cDNAs and the predicted proteins. Besides an increase in the number of cDNA entries, the amount of experimental data for expression profiling has been largely increased and data on chromosomal locations have been newly added. All of the protein-coding regions were examined by GeneMark analysis, and the results of a motif/domain search of each predicted protein sequence against the Pfam database have been newly added. HUGE is available through the WWW at http://www.kazusa.or.jp/huge PMID:10592264

  11. RTA, a candidate G protein-coupled receptor: cloning, sequencing, and tissue distribution.

    PubMed Central

    Ross, P C; Figler, R A; Corjay, M H; Barber, C M; Adam, N; Harcus, D R; Lynch, K R

    1990-01-01

    Genomic and cDNA clones, encoding a protein that is a member of the guanine nucleotide-binding regulatory protein (G protein)-coupled receptor superfamily, were isolated by screening rat genomic and thoracic aorta cDNA libraries with an oligonucleotide encoding a highly conserved region of the M1 muscarinic acetylcholine receptor. Sequence analyses of these clones showed that they encode a 343-amino acid protein (named RTA). The RTA gene is single copy, as demonstrated by restriction mapping and Southern blotting of genomic clones and rat genomic DNA. Sequence analysis of the genomic clone further showed that the RTA gene has an intron interrupting the region encoding the amino terminus of the protein. RTA RNA sequences are relatively abundant throughout the gut, vas deferens, uterus, and aorta but are only barely detectable (on Northern blots) in liver, kidney, lung, and salivary gland. In the rat brain, RTA sequences are markedly abundant in the cerebellum. RTA is most closely related to the mas oncogene (34% identity), which has been suggested to be a forebrain angiotensin receptor. We cannot detect angiotensin binding to the RTA protein after introducing the cognate cDNA or mRNA into COS cells or Xenopus oocytes, respectively, nor can we detect an electrophysiologic response in the oocyte after application of angiotensin peptides. We conclude that RTA is not an angiotensin receptor; to date, we have been unable to identify its ligand. Images PMID:2109324

  12. Recognition sequences and structural elements contribute to shedding susceptibility of membrane proteins.

    PubMed Central

    Althoff, K; Müllberg, J; Aasland, D; Voltz, N; Kallen, K; Grötzinger, J; Rose-John, S

    2001-01-01

    Although regulated ectodomain shedding affects a large panel of structurally and functionally unrelated proteins, little is known about the mechanisms controlling this process. Despite a lack of sequence similarities around cleavage sites, most proteins are shed in response to the stimulation of protein kinase C by phorbol esters. The signal-transducing receptor subunit gp130 is not a substrate of the regulated shedding machinery. We generated several chimaeric proteins of gp130 and the proteins tumour necrosis factor alpha (TNF-alpha), transforming growth factor alpha (TGF-alpha) and interleukin 6 receptor (IL-6R), which are known to be subject to shedding. By exchanging small peptide sequences of gp130 for cleavage-site peptides of TNF-alpha, TGF-alpha and IL-6R we showed that these short sequences conferred susceptibility to spontaneous and phorbol-ester-induced shedding of gp130. Importantly, these chimaeric gp130 proteins were functional, as shown by the phosphorylation of gp130 and the activation of signal transduction and activators of transcription 3 ('STAT3') on stimulation with cytokine. To investigate minimal requirements for shedding, truncated cleavage-site peptides of IL-6R were inserted into gp130. The resulting chimaeras were susceptible to shedding and showed the same cleavage pattern as observed in the chimaeras containing the complete IL-6R cleavage site. Surprisingly, we could also generate cleavable chimaeras by exchanging the juxtamembrane sequence of gp130 for the corresponding region of leukaemia inhibitory factor ('LIF') receptor, a protein that like gp130 is not subject to regulated or spontaneous shedding. Thus it seems that there is no minimal consensus shedding sequence. We speculate that structural changes allow the access of the protease to a membrane-proximal region, leading to shedding of the protein. PMID:11171064

  13. Comparison of three tests for estimating gastroenteral protein loss

    SciTech Connect

    Glaubitti, D.; Marx, M.; Weller, H.

    1984-01-01

    A decisive step in the diagnosis of exudative gastroenteropathy which shows a pathologically increased transfer of plasma proteins into the stomach or intestine is the measurement of fecal radioactivity after intravenous administration of radionuclide-labeled large organic compounds or of small inorganic compounds attaching themselves to plasma proteins within the patient. In 24 patients (12 men and women each) aged 40 to 66 years, the gastroenteral protein loss was estimated after intravenous injection of Cr-51 chloride, Cr-51 human serum albumin, or Fe-59 iron dextran. Each test lasted 6 days. There was an interval of 2 weeks between 2 tests. The feces were collected completely within the test period for determination of radioactivity. External probe counting over liver, spleen, right kidney, and thyroid was performed daily up to 10 days. The results obtained with Cr-51 chloride presented the largest range whereas the test with Fe-59 iron dextran exhibited both the smallest deviation from the mean value and the lowest normal range. During the tests for gastroenteral protein loss external probe counting demonstrated no distinct tendency to a more rapid radionuclide loss from liver, spleen, and kidney in the patients suffering from exudative gastroenteropathy when compared with healthy subjects. The authors conclude that the most suitable test to estimate gastroenteral protein loss is the Fe-59 iron dextran test although Fe-59 iron dextran is not available commercially and causes a higher radiation burden than the other tests do. In second place, the Cr-51 chloride test should be used, the radiopharmaceutical of which is less expensive and has no significant disadvantage in comparison with Cr-51 human serum albumin.

  14. A comparison of protein extraction methods suitable for gel-based proteomic studies of aphid proteins.

    PubMed

    Cilia, M; Fish, T; Yang, X; McLaughlin, M; Thannhauser, T W; Gray, S

    2009-09-01

    Protein extraction methods can vary widely in reproducibility and in representation of the total proteome, yet there are limited data comparing protein isolation methods. The methodical comparison of protein isolation methods is the first critical step for proteomic studies. To address this, we compared three methods for isolation, purification, and solubilization of insect proteins. The aphid Schizaphis graminum, an agricultural pest, was the source of insect tissue. Proteins were extracted using TCA in acetone (TCA-acetone), phenol, or multi-detergents in a chaotrope solution. Extracted proteins were solubilized in a multiple chaotrope solution and examined using 1-D and 2-D electrophoresis and compared directly using 2-D Difference Gel Electrophoresis (2-D DIGE). Mass spectrometry was used to identify proteins from each extraction type. We were unable to ascribe the differences in the proteins extracted to particular physical characteristics, cell location, or biological function. The TCA-acetone extraction yielded the greatest amount of protein from aphid tissues. Each extraction method isolated a unique subset of the aphid proteome. The TCA-acetone method was explored further for its quantitative reliability using 2-D DIGE. Principal component analysis showed that little of the variation in the data was a result of technical issues, thus demonstrating that the TCA-acetone extraction is a reliable method for preparing aphid proteins for a quantitative proteomics experiment. These data suggest that although the TCA-acetone method is a suitable method for quantitative aphid proteomics, a combination of extraction approaches is recommended for increasing proteome coverage when using gel-based separation techniques. PMID:19721822

  15. Unique graphical representation of protein sequences based on nucleotide triplet codons

    NASA Astrophysics Data System (ADS)

    Randić, Milan; Zupan, Jure; Balaban, Alexandru T.

    2004-10-01

    We consider a graphical representation of proteins as an alternative to the usual representation of proteins as a sequence listing the natural amino acids. The approach is based on a graphical representation of triplets of DNA in which the interior of a square or the interior of a tetrahedron is used to accommodate 64 sites for the 64 codons. By associating a zigzag curve and various matrices with a protein, just as was the case with graphical representation of DNA, one can construct selected invariants to serve as protein descriptors. The approach is illustrated on the A-chain of human insulin.

  16. Modifications in the purification protocol of Celosia cristata antiviral proteins lead to protein that can be N-terminally sequenced.

    PubMed

    Gholizadeh, Ashraf; Kapoor, H C

    2004-12-01

    Plants antiviral proteins are being used as anticancer agents and inhibit other viral diseases in humans. We modified the purification protocol of the two N-terminally blocked antiviral glycoproteins, CCP-25 and CCP-27, purified from the leaves of Celosia cristata. This not only gave rise to single pure samples with few steps of purification but also resulted in N-terminally free proteins. The extra purity of the samples was analyzed by reverse phase HPLC. Deglycosylation studies of CCP-25 with PNGase F enzyme revealed that its asparagine or asparagine-linked glycon contents are negligible. Partial N-terminal sequence of the CCP-25 showed the sequence (ANDIS), which seems to be conserved among plant antiviral proteins. PMID:15579125

  17. Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns

    PubMed Central

    2007-01-01

    We have converted genome-encoded protein sequences into musical notes to reveal auditory patterns without compromising musicality. We derived a reduced range of 13 base notes by pairing similar amino acids and distinguishing them using variations of three-note chords and codon distribution to dictate rhythm. The conversion will help make genomic coding sequences more approachable for the general public, young children, and vision-impaired scientists. PMID:17477882

  18. Cloning and sequencing of a cDNA encoding a taste-modifying protein, miraculin.

    PubMed

    Masuda, Y; Nirasawa, S; Nakaya, K; Kurihara, Y

    1995-08-19

    A cDNA clone encoding a taste-modifying protein, miraculin (MIR), was isolated and sequenced. The encoded precursor to MIR was composed of 220 amino acid (aa) residues, including a possible signal sequence of 29 aa. Northern blot analysis showed that the mRNA encoding MIR was already expressed in fruits of Richadella dulcifica at 3 weeks after pollination and was present specifically in the pulp. PMID:7665074

  19. Ancient origin for Hawaiian Drosophilinae inferred from protein comparisons.

    PubMed Central

    Beverley, S M; Wilson, A C

    1985-01-01

    Immunological comparisons of a larval hemolymph protein enabled us to build a tree relating major groups of drosophiline flies in Hawaii to one another and to continental flies. The tree agrees in topology with that based on internal anatomy. Relative rate tests suggest that evolution of hemolymph proteins has been about as fast in Hawaii as on continents. Since the absolute rate of evolution of hemolymph proteins in continental flies is known, one can erect an approximate time scale for Hawaiian fly evolution. According to this scale, the Hawaiian fly fauna stems from a colonist that landed on the archipelago about 42 million years ago-i.e., before any of the present islands harboring drosophilines formed. This date fits with the geological history of the archipelago, which has witnessed the sequential rise and erosion of many islands during the past 70 million years. We discuss the bearing of the molecular time scale on views about rates of organismal evolution in the Hawaiian flies. PMID:3860822

  20. alpha. -Amylase of Clostridium thermosulfurogenes EM1: Nucleotide sequence of the gene, processing of the enzyme, and comparison to other. alpha. -amylases

    SciTech Connect

    Bahl, H.; Burchhardt, G.; Spreinat, A.; Haeckel, K.; Wienecke, A.; Antranikian, G.; Schmidt, B. )

    1991-05-01

    The nucleotide sequence of the {alpha}-amylase gene (amyA) from Clostridium thermosulfurogenes EM1 cloned in Escherichia coli was determined. The reading frame of the gene consisted of 2,121 bp. Comparison of the DNA sequence data with the amino acid sequence of the N terminus of the purified secreted protein of C. thermosulfurogenes Em1 suggested that the {alpha}-amylase is translated form mRNA as a secretory precursor with a signal peptide of 27 amino acid residues. The deduced amino acid sequence of the mature {alpha}-amylase contained 679 residues, resulting in a protein with a molecular mass of 75,112 Da. In E. coli the enzyme was transported to the periplasmic space and the signal peptide was cleaved at exactly the same site between two alanine residues. Comparison of the amino acid sequence of the C. thermosulfurogenes EM1 {alpha}-amylase with those from other bacterial and eukaryotic {alpha}-amylases showed several homologous regions, probably in the enzymatically functioning regions. The tentative Ca{sup 2+}-binding site (consensus region I) of this Ca{sub 2+}-independent enzyme showed only limited homology. The deduced amino acid sequence of a second obviously truncated open reading frame showed significant homology to the malG gene product of E. coli. Comparison of the {alpha}-amylase gene region of C. thermosulfurogenes EM1 (DSM3896) with the {beta}-amylase gene region of C. thermosulfurogenes (ATCC 33743) indicated that both genes have been exchanged with each other at identical sites in the chromosomes of these strains.

  1. SIMAP—the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage

    PubMed Central

    Arnold, Roland; Goldenberg, Florian; Mewes, Hans-Werner; Rattei, Thomas

    2014-01-01

    The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith–Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads. PMID:24165881

  2. Hfqs in Bacillus anthracis: Role of protein sequence variation in the structure and function of proteins in the Hfq family.

    PubMed

    Vrentas, Catherine; Ghirlando, Rodolfo; Keefer, Andrea; Hu, Zonglin; Tomczak, Aurelie; Gittis, Apostolos G; Murthi, Athulaprabha; Garboczi, David N; Gottesman, Susan; Leppla, Stephen H

    2015-11-01

    Hfq proteins in Gram-negative bacteria play important roles in bacterial physiology and virulence, mediated by binding of the Hfq hexamer to small RNAs and/or mRNAs to post-transcriptionally regulate gene expression. However, the physiological role of Hfqs in Gram-positive bacteria is less clear. Bacillus anthracis, the causative agent of anthrax, uniquely expresses three distinct Hfq proteins, two from the chromosome (Hfq1, Hfq2) and one from its pXO1 virulence plasmid (Hfq3). The protein sequences of Hfq1 and 3 are evolutionarily distinct from those of Hfq2 and of Hfqs found in other Bacilli. Here, the quaternary structure of each B. anthracis Hfq protein, as produced heterologously in Escherichia coli, was characterized. While Hfq2 adopts the expected hexamer structure, Hfq1 does not form similarly stable hexamers in vitro. The impact on the monomer-hexamer equilibrium of varying Hfq C-terminal tail length and other sequence differences among the Hfqs was examined, and a sequence region of the Hfq proteins that was involved in hexamer formation was identified. It was found that, in addition to the distinct higher-order structures of the Hfq homologs, they give rise to different phenotypes. Hfq1 has a disruptive effect on the function of E. coli Hfq in vivo, while Hfq3 expression at high levels is toxic to E. coli but also partially complements Hfq function in E. coli. These results set the stage for future studies of the roles of these proteins in B. anthracis physiology and for the identification of sequence determinants of phenotypic complementation. PMID:26271475

  3. PDP-CON: prediction of domain/linker residues in protein sequences using a consensus approach.

    PubMed

    Chatterjee, Piyali; Basu, Subhadip; Zubek, Julian; Kundu, Mahantapas; Nasipuri, Mita; Plewczynski, Dariusz

    2016-04-01

    The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers-decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron-were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use. PMID:26969678

  4. Comparative ribosomal protein sequence analyses of a phylogenetically defined genus, Pseudomonas, and its relatives.

    PubMed

    Ochi, K

    1995-04-01

    I analyzed various families of ribosomal proteins obtained from selected species belonging to the genus Pseudomonas sensu stricto and allied organisms which were previously classified in the genus Pseudomonas. Partial amino acid sequencing of L30 preparations revealed that the strains which I examined could be divided into three clusters. The first cluster, which was assigned to the genus Pseudomonas sensu stricto, included Pseudomonas aeruginosa, Pseudomonas putida, Pseudomonas mendocina, and Pseudomonas fluorescens. The second cluster included Burkholderia pickettii and Burkholderia plantarii. The third cluster, which was a deeply branching cluster in the stem of gram-negative bacteria, included Brevundimonas diminuta and Brevundimonas vesicularis. Despite the different levels of conservation of the N-terminal sequences of ribosomal protein families (the highest level of similarity was 74% for L27 proteins and the lowest level of similarity was 42% for L30 proteins), similar phylogenetic trees were constructed by using data obtained from sequence analyses of various ribosomal protein families, including the S20, S21, L27, L29, L31, L32, and L33 protein families. Thus, I demonstrated the efficacy of ribosomal protein analysis in bacterial taxonomy. PMID:7727274

  5. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.

    PubMed

    Neuwald, Andrew F; Altschul, Stephen F

    2016-05-01

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/. PMID:27192614

  6. Sequence analysis of the gene for the glucan-binding protein of Streptococcus mutans Ingbritt.

    PubMed Central

    Banas, J A; Russell, R R; Ferretti, J J

    1990-01-01

    The nucleotide sequence of the gbp gene, which encodes the glucan-binding protein (GBP) of Streptococcus mutans, was determined. The reading frame for gbp was 1,689 bases. A ribosome-binding site and putative promoter preceded the start codon, and potential stem-loop structures were identified downstream from the termination codon. The deduced amino acid sequence of the GBP revealed the presence of a signal peptide of 35 amino acids. The molecular weight of the processed protein was calculated to be 59,039. Two series of repeats spanned three-quarters of the carboxy-terminal end of the protein. The repeats were 32 to 34 and 17 to 20 amino acids in length and shared partial identity within each series. The repeats were found to be homologous to sequences hypothesized to be involved in glucan binding in the GTF-I of S. downei and to sequences within the protein products encoded by gtfB and gtfC of S. mutans. The repeated sequences may represent peptide segments that are important to glucan binding and may be distributed among GBPs from other bacterial inhabitants of plaque or the oral cavity. PMID:2307516

  7. False occurrences of functional motifs in protein sequences highlight evolutionary constraints

    PubMed Central

    Via, Allegra; Gherardini, Pier Federico; Ferraro, Enrico; Ausiello, Gabriele; Scalia Tomba, Gianpaolo; Helmer-Citterich, Manuela

    2007-01-01

    Background False occurrences of functional motifs in protein sequences can be considered as random events due solely to the sequence composition of a proteome. Here we use a numerical approach to investigate the random appearance of functional motifs with the aim of addressing biological questions such as: How are organisms protected from undesirable occurrences of motifs otherwise selected for their functionality? Has the random appearance of functional motifs in protein sequences been affected during evolution? Results Here we analyse the occurrence of functional motifs in random sequences and compare it to that observed in biological proteomes; the behaviour of random motifs is also studied. Most motifs exhibit a number of false positives significantly similar to the number of times they appear in randomized proteomes (=expected number of false positives). Interestingly, about 3% of the analysed motifs show a different kind of behaviour and appear in biological proteomes less than they do in random sequences. In some of these cases, a mechanism of evolutionary negative selection is apparent; this helps to prevent unwanted functionalities which could interfere with cellular mechanisms. Conclusion Our thorough statistical and biological analysis showed that there are several mechanisms and evolutionary constraints both of which affect the appearance of functional motifs in protein sequences. PMID:17331242

  8. GeneSV - an Approach to Help Characterize Possible Variations in Genomic and Protein Sequences.

    PubMed

    Zemla, Adam; Kostova, Tanya; Gorchakov, Rodion; Volkova, Evgeniya; Beasley, David W C; Cardosa, Jane; Weaver, Scott C; Vasilakis, Nikos; Naraghi-Arani, Pejman

    2014-01-01

    A computational approach for identification and assessment of genomic sequence variability (GeneSV) is described. For a given nucleotide sequence, GeneSV collects information about the permissible nucleotide variability (changes that potentially preserve function) observed in corresponding regions in genomic sequences, and combines it with conservation/variability results from protein sequence and structure-based analyses of evaluated protein coding regions. GeneSV was used to predict effects (functional vs. non-functional) of 37 amino acid substitutions on the NS5 polymerase (RdRp) of dengue virus type 2 (DENV-2), 36 of which are not observed in any publicly available DENV-2 sequence. 32 novel mutants with single amino acid substitutions in the RdRp were generated using a DENV-2 reverse genetics system. In 81% (26 of 32) of predictions tested, GeneSV correctly predicted viability of introduced mutations. In 4 of 5 (80%) mutants with double amino acid substitutions proximal in structure to one another GeneSV was also correct in its predictions. Predictive capabilities of the developed system were illustrated on dengue RNA virus, but described in the manuscript a general approach to characterize real or theoretically possible variations in genomic and protein sequences can be applied to any organism. PMID:24453480

  9. GeneSV – an Approach to Help Characterize Possible Variations in Genomic and Protein Sequences

    PubMed Central

    Zemla, Adam; Kostova, Tanya; Gorchakov, Rodion; Volkova, Evgeniya; Beasley, David W. C.; Cardosa, Jane; Weaver, Scott C.; Vasilakis, Nikos; Naraghi-Arani, Pejman

    2014-01-01

    A computational approach for identification and assessment of genomic sequence variability (GeneSV) is described. For a given nucleotide sequence, GeneSV collects information about the permissible nucleotide variability (changes that potentially preserve function) observed in corresponding regions in genomic sequences, and combines it with conservation/variability results from protein sequence and structure-based analyses of evaluated protein coding regions. GeneSV was used to predict effects (functional vs. non-functional) of 37 amino acid substitutions on the NS5 polymerase (RdRp) of dengue virus type 2 (DENV-2), 36 of which are not observed in any publicly available DENV-2 sequence. 32 novel mutants with single amino acid substitutions in the RdRp were generated using a DENV-2 reverse genetics system. In 81% (26 of 32) of predictions tested, GeneSV correctly predicted viability of introduced mutations. In 4 of 5 (80%) mutants with double amino acid substitutions proximal in structure to one another GeneSV was also correct in its predictions. Predictive capabilities of the developed system were illustrated on dengue RNA virus, but described in the manuscript a general approach to characterize real or theoretically possible variations in genomic and protein sequences can be applied to any organism. PMID:24453480

  10. Efficient use of unlabeled data for protein sequence classification: a comparative study

    PubMed Central

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-01-01

    Background Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags–the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Results Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. Conclusion The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably. PMID:19426450

  11. Tough coating proteins: subtle sequence variation modulates cohesion.

    PubMed

    Das, Saurabh; Miller, Dusty R; Kaufman, Yair; Martinez Rodriguez, Nadine R; Pallaoro, Alessia; Harrington, Matthew J; Gylys, Maryte; Israelachvili, Jacob N; Waite, J Herbert

    2015-03-01

    Mussel foot protein-1 (mfp-1) is an essential constituent of the protective cuticle covering all exposed portions of the byssus (plaque and the thread) that marine mussels use to attach to intertidal rocks. The reversible complexation of Fe(3+) by the 3,4-dihydroxyphenylalanine (Dopa) side chains in mfp-1 in Mytilus californianus cuticle is responsible for its high extensibility (120%) as well as its stiffness (2 GPa) due to the formation of sacrificial bonds that help to dissipate energy and avoid accumulation of stresses in the material. We have investigated the interactions between Fe(3+) and mfp-1 from two mussel species, M. californianus (Mc) and M. edulis (Me), using both surface sensitive and solution phase techniques. Our results show that although mfp-1 homologues from both species bind Fe(3+), mfp-1 (Mc) contains Dopa with two distinct Fe(3+)-binding tendencies and prefers to form intramolecular complexes with Fe(3+). In contrast, mfp-1 (Me) is better adapted to intermolecular Fe(3+) binding by Dopa. Addition of Fe(3+) did not significantly increase the cohesion energy between the mfp-1 (Mc) films at pH 5.5. However, iron appears to stabilize the cohesive bridging of mfp-1 (Mc) films at the physiologically relevant pH of 7.5, where most other mfps lose their ability to adhere reversibly. Understanding the molecular mechanisms underpinning the capacity of M. californianus cuticle to withstand twice the strain of M. edulis cuticle is important for engineering of tunable strain tolerant composite coatings for biomedical applications. PMID:25692318

  12. Evol and ProDy for bridging protein sequence evolution and structural dynamics

    PubMed Central

    Mao, Wenzhi; Liu, Ying; Chennubhotla, Chakra; Lezon, Timothy R.; Bahar, Ivet

    2014-01-01

    Correlations between sequence evolution and structural dynamics are of utmost importance in understanding the molecular mechanisms of function and their evolution. We have integrated Evol, a new package for fast and efficient comparative analysis of evolutionary patterns and conformational dynamics, into ProDy, a computational toolbox designed for inferring protein dynamics from experimental and theoretical data. Using information-theoretic approaches, Evol coanalyzes conservation and coevolution profiles extracted from multiple sequence alignments of protein families with their inferred dynamics. Availability and implementation: ProDy and Evol are open-source and freely available under MIT License from http://prody.csb.pitt.edu/. Contact: bahar@pitt.edu PMID:24849577

  13. Swfoldrate: predicting protein folding rates from amino acid sequence with sliding window method.

    PubMed

    Cheng, Xiang; Xiao, Xuan; Wu, Zhi-cheng; Wang, Pu; Lin, Wei-zhong

    2013-01-01

    Protein folding is the process by which a protein processes from its denatured state to its specific biologically active conformation. Understanding the relationship between sequences and the folding rates of proteins remains an important challenge. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. In this study, the long-range and short-range contact in protein were used to derive extended version of the pseudo amino acid composition based on sliding window method. This method is capable of predicting the protein folding rates just from the amino acid sequence without the aid of any structural class information. We systematically studied the contributions of individual features to folding rate prediction. The optimal feature selection procedures are adopted by means of combining the forward feature selection and sequential backward selection method. Using the jackknife cross validation test, the method was demonstrated on the large dataset. The predictor was achieved on the basis of multitudinous physicochemical features and statistical features from protein using nonlinear support vector machine (SVM) regression model, the method obtained an excellent agreement between predicted and experimentally observed folding rates of proteins. The correlation coefficient is 0.9313 and the standard error is 2.2692. The prediction server is freely available at http://www.jci-bioinfo.cn/swfrate/input.jsp. PMID:22933332

  14. Characterization and amino acid sequence of a fatty acid-binding protein from human heart.

    PubMed

    Offner, G D; Brecher, P; Sawlivich, W B; Costello, C E; Troxler, R F

    1988-05-15

    The complete amino acid sequence of a fatty acid-binding protein from human heart was determined by automated Edman degradation of CNBr, BNPS-skatole [3'-bromo-3-methyl-2-(2-nitrobenzenesulphenyl)indolenine], hydroxylamine, Staphylococcus aureus V8 proteinase, tryptic and chymotryptic peptides, and by digestion of the protein with carboxypeptidase A. The sequence of the blocked N-terminal tryptic peptide from citraconylated protein was determined by collisionally induced decomposition mass spectrometry. The protein contains 132 amino acid residues, is enriched with respect to threonine and lysine, lacks cysteine, has an acetylated valine residue at the N-terminus, and has an Mr of 14768 and an isoelectric point of 5.25. This protein contains two short internal repeated sequences from residues 48-54 and from residues 114-119 located within regions of predicted beta-structure and decreasing hydrophobicity. These short repeats are contained within two longer repeated regions from residues 48-60 and residues 114-125, which display 62% sequence similarity. These regions could accommodate the charged and uncharged moieties of long-chain fatty acids and may represent fatty acid-binding domains consistent with the finding that human heart fatty acid-binding protein binds 2 mol of oleate or palmitate/mol of protein. Detailed evidence for the amino acid sequences of the peptides has been deposited as Supplementary Publication SUP 50143 (23 pages) at the British Library Lending Division, Boston Spa, Yorkshire LS23 7BQ, U.K., from whom copies may be obtained as indicated in Biochem. J. (1988) 249, 5. PMID:3421901

  15. Unraveling the sequence and structure of the protein osteocalcin from a 42 ka fossil horse

    NASA Astrophysics Data System (ADS)

    Ostrom, Peggy H.; Gandhi, Hasand; Strahler, John R.; Walker, Angela K.; Andrews, Philip C.; Leykam, Joseph; Stafford, Thomas W.; Kelly, Robert L.; Walker, Danny N.; Buckley, Mike; Humpula, James

    2006-04-01

    We report the first complete amino acid sequence and evidence of secondary structure for osteocalcin from a temperate fossil. The osteocalcin derives from a 42 ka equid bone excavated from Juniper Cave, Wyoming. Results were determined by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-MS) and Edman sequencing with independent confirmation of the sequence in two laboratories. The ancient sequence was compared to that of three modern taxa: horse ( Equus caballus), zebra ( Equus grevyi), and donkey ( Equus asinus). Although there was no difference in sequence among modern taxa, MALDI-MS and Edman sequencing show that residues 48 and 49 of our modern horse are Thr, Ala rather than Pro, Val as previously reported (Carstanjen B., Wattiez, R., Armory, H., Lepage, O.M., Remy, B., 2002. Isolation and characterization of equine osteocalcin. Ann. Med. Vet.146(1), 31-38). MALDI-MS and Edman sequencing data indicate that the osteocalcin sequence of the 42 ka fossil is similar to that of modern horse. Previously inaccessible structural attributes for ancient osteocalcin were observed. Glu 39 rather than Gln 39 is consistent with deamidation, a process known to occur during fossilization and aging. Two post-translational modifications were documented: Hyp 9 and a disulfide bridge. The latter suggests at least partial retention of secondary structure. As has been done for ancient DNA research, we recommend standards for preparation and criteria for authenticating results of ancient protein sequencing.

  16. Increased functional protein expression using nucleotide sequence features enriched in highly expressed genes in zebrafish

    PubMed Central

    Horstick, Eric J.; Jordan, Diana C.; Bergeron, Sadie A.; Tabor, Kathryn M.; Serpe, Mihaela; Feldman, Benjamin; Burgess, Harold A.

    2015-01-01

    Many genetic manipulations are limited by difficulty in obtaining adequate levels of protein expression. Bioinformatic and experimental studies have identified nucleotide sequence features that may increase expression, however it is difficult to assess the relative influence of these features. Zebrafish embryos are rapidly injected with calibrated doses of mRNA, enabling the effects of multiple sequence changes to be compared in vivo. Using RNAseq and microarray data, we identified a set of genes that are highly expressed in zebrafish embryos and systematically analyzed for enrichment of sequence features correlated with levels of protein expression. We then tested enriched features by embryo microinjection and functional tests of multiple protein reporters. Codon selection, releasing factor recognition sequence and specific introns and 3′ untranslated regions each increased protein expression between 1.5- and 3-fold. These results suggested principles for increasing protein yield in zebrafish through biomolecular engineering. We implemented these principles for rational gene design in software for codon selection (CodonZ) and plasmid vectors incorporating the most active non-coding elements. Rational gene design thus significantly boosts expression in zebrafish, and a similar approach will likely elevate expression in other animal models. PMID:25628360

  17. A Full-Genomic Sequence-Verified Protein-Coding Gene Collection for Francisella tularensis

    PubMed Central

    Murthy, Tal; Rolfs, Andreas; Hu, Yanhui; Shi, Zhenwei; Raphael, Jacob; Moreira, Donna; Kelley, Fontina; McCarron, Seamus; Jepson, Daniel; Taycher, Elena; Zuo, Dongmei; Mohr, Stephanie E.; Fernandez, Mauricio; Brizuela, Leonardo; LaBaer, Joshua

    2007-01-01

    The rapid development of new technologies for the high throughput (HT) study of proteins has increased the demand for comprehensive plasmid clone resources that support protein expression. These clones must be full-length, sequence-verified and in a flexible format. The generation of these resources requires automated pipelines supported by software management systems. Although the availability of clone resources is growing, current collections are either not complete or not fully sequence-verified. We report an automated pipeline, supported by several software applications that enabled the construction of the first comprehensive sequence-verified plasmid clone resource for more than 96% of protein coding sequences of the genome of F. tularensis, a highly virulent human pathogen and the causative agent of tularemia. This clone resource was applied to a HT protein purification pipeline successfully producing recombinant proteins for 72% of the genes. These methods and resources represent significant technological steps towards exploiting the genomic information of F. tularensis in discovery applications. PMID:17593976

  18. Isolation and characterization of a carrot nucleolar protein with structural and sequence similarity to the vertebrate PESCADILLO protein.

    PubMed

    Ueda, Kenji; Xu, Zheng-Jun; Miyagi, Nobuaki; Ono, Michiyuki; Wabiko, Hiroetsu; Masuda, Kiyoshi; Inoue, Masayasu

    2013-07-01

    The nuclear matrix is involved in many nuclear events, but its protein architecture in plants is still not fully understood. A cDNA clone was isolated by immunoscreening with a monoclonal antibody raised against nuclear matrix proteins of Daucus carota L. Its deduced amino acid sequence showed about 40% identity with the PESCADILLO protein of zebrafish and humans. Primary structure analysis of the protein revealed a Pescadillo N-terminus domain, a single breast cancer C-terminal domain, two nuclear localization signals, and a potential coiled-coil region as also found in animal PESCADILLO proteins. Therefore, we designated this gene DcPES1. Although DcPES1 mRNA was detected in all tissues examined, its levels were highest in tissues with proliferating cells. Immunofluorescence using specific antiserum against the recombinant protein revealed that DcPES1 localized exclusively in the nucleolus. Examination of fusion proteins with green fluorescent protein revealed that the N-terminal portion was important for localization to the nucleoli of tobacco and onion cells. Moreover, when the nuclear matrix of carrot cells was immunostained with an anti-DcPES1 serum, the signal was detected in the nucleolus. Therefore, the DcPES1 protein appears to be a component of or tightly bound to components of the nuclear matrix. PMID:23683933

  19. Identification and sequence analysis of grain softness protein in selected wheat, rye and triticale.

    PubMed

    Kharrazi, M A S; Bobojonov, V

    2012-01-01

    Grain softness protein (GSP) is an important protein for overcoming milling and grain defenses in the innate immunity systems of cereals. The objective of this study was to evaluate and understand GSP sequences in selected wheat, rye and triticale. Using sequences for this gene from a sequence database, we performed clustering analysis to compare the sequences obtained from 3 germplasms with other studied sequences for GSP. The maximum difference between the Hirmand GSP genotype in wheat and the database sequences was 23% in EF109396 and EF109399. Most amino acid variation between the GSP sequences involved the same amino acids. The Nikita rye GSP gene showed 64% identity with DQ269918 and AY667063. The isoelectric point in the GSP of wheat and Lasko triticale was significantly higher than that of rye GSP. In addition, parameters such as optical density, grand average of hydrophobicity, percentage of hydrophobicity and hydrophilic amino acids, and number of alpha helices and beta sheets in GSP were similar in wheat and triticale but not in wheat and rye. PMID:22869084

  20. Eimeria maxima phosphatidylinositol 4-phosphate 5-kinase: locus sequencing, characterization, and cross-phylum comparison.

    PubMed

    Goh, Mei-Yen; Pan, Mei-Zhen; Blake, Damer P; Wan, Kiew-Lian; Song, Beng-Kah

    2011-03-01

    Phosphatidylinositol 4-phosphate 5-kinase (PIP5K) may play an important role in host-cell invasion by the Eimeria species, protozoan parasites which can cause severe intestinal disease in livestock. Here, we report the structural organization of the PIP5K gene in Eimeria maxima (Weybridge strain). Two E. maxima BAC clones carrying the E. maxima PIP5K (EmPIP5K) coding sequences were selected for shotgun sequencing, yielding a 9.1-kb genomic segment. The EmPIP5K coding region was initially identified using in silico gene-prediction approaches and subsequently confirmed by mapping rapid amplification of cDNA ends and RT-PCR-generated cDNA sequence to its genomic segment. The putative EmPIP5K gene was located at position 710-8036 nt on the complimentary strand and comprised of 23 exons. Alignment of the 1147 amino acid sequence with previously annotated PIP5K proteins from other Apicomplexa species detected three conserved motifs encompassing the kinase core domain, which has been shown by previous protein deletion studies to be necessary for PIP5K protein function. Phylogenetic analysis provided further evidence that the putative EmPIP5K protein is orthologous to that of other Apicomplexa. Subsequent comparative gene structure characterization revealed events of intron loss/gain throughout the evolution of the apicomplexan PIP5K gene. Further scrutiny of the genomic structure revealed a possible trend towards "intron gain" between two of the motif regions. Our findings offer preliminary insights into the structural variations that have occurred during the evolution of the PIP5K locus and may aid in understanding the functional role of this gene in the cellular biology of apicomplexan parasites. PMID:20938684

  1. Effect of k-tuple length on sample-comparison with high-throughput sequencing data.

    PubMed

    Wang, Ying; Lei, Xiaoye; Wang, Shun; Wang, Zicheng; Song, Nianfeng; Zeng, Feng; Chen, Ting

    2016-01-22

    The high-throughput metagenomic sequencing offers a powerful technique to compare the microbial communities. Without requiring extra reference sequences, alignment-free models with short k-tuple (k = 2-10 bp) yielded promising results. Short k-tuples describe the overall statistical distribution, but is hard to capture the specific characteristics inside one microbial community. Longer k-tuple contains more abundant information. However, because the frequency vector of long k-tuple(k ≥ 30 bp) is sparse, the statistical measures designed for short k-tuples are not applicable. In our study, we considered each tuple as a meaningful word and then each sequencing data as a document composed of the words. Therefore, the comparison between two sequencing data is processed as "topic analysis of documents" in text mining. We designed a pipeline with long k-tuple features to compare metagenomic samples combined using algorithms from text mining and pattern recognition. The pipeline is available at http://culotuple.codeplex.com/. Experiments show that our pipeline with long k-tuple features: ①separates genomes with high similarity; ②outperforms short k-tuple models in all experiments. When k ≥ 12, the short k-tuple measures are not applicable anymore. When k is between 20 and 40, long k-tuple pipeline obtains much better grouping results; ③is free from the effect of sequencing platforms/protocols. ③We obtained meaningful and supported biological results on the 40-tuples selected for comparison. PMID:26721429

  2. On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.

    PubMed Central

    Kabsch, W; Sander, C

    1984-01-01

    The search for amino acid sequence homologies can be a powerful tool for predicting protein structure. Discovered sequence homologies are currently used in predicting the function of oncogene proteins. To sharpen this tool, we investigated the structural significance of short sequence homologies by searching proteins of known three-dimensional structure for subsequence identities. In 62 proteins with 10,000 residues, we found that the longest isolated homologies between unrelated proteins are five residues long. In 6 (out of 25) cases we saw surprising structural adaptability: the same five residues are part of an alpha-helix in one protein and part of a beta-strand in another protein. These examples show quantitatively that pentapeptide structure within a protein is strongly dependent on sequence context, a fact essentially ignored in most protein structure prediction methods: just considering the local sequence of five residues is not sufficient to predict correctly the local conformation (secondary structure). Cooperativity of length six or longer must be taken into account. Also, we are warned that in the growing practice of comparing a new protein sequence with a data base of known sequences, finding an identical pentapeptide sequence between two proteins is not a significant indication of structural similarity or of evolutionary kinship. PMID:6422466

  3. Properties and sequence of a female-specific, juvenile hormone-induced protein from locust hemolymph.

    PubMed

    Zhang, J; McCracken, A; Wyatt, G R

    1993-02-15

    In the fat body of Locusta migratoria, an RNA transcript of about 800 nucleotides has been detected that is specific to the adult female and dependent on induction by juvenile hormone (JH) or an analog. The corresponding cDNA has been cloned (lambda 21) and a 718-base pair sequence determined. It encodes a 196-amino acid polypeptide, including a signal peptide. An NH2-terminal sequence has 24 out of 28 amino acids identical with those of a previously described 19K locust hemolymph protein, but the remainder of the sequence shows no similarity. From adult female hemolymph, a 21-kDa protein, designated 21K protein, has been purified, with an NH2-terminal sequence exactly matching that deduced from clone lambda 21. This 21K protein is found only in the adult female, is dependent on induction by JH, and is assumed to represent the product of the lambda 21 gene. It shows no immunochemical cross-reaction with locust 19K protein, apolipophorin III, nor with vitellogenin (Vg). Its isoelectric point is pH 5.4; it contains some carbohydrate. 21K protein is synthesized in adult female fat body, accumulates in hemolymph, and is taken up into the developing oocytes in parallel with Vg. In locusts deprived of JH with precocene, production of 21K protein and of lambda 21-hybridizing transcripts is induced by the JH analog, methoprene, in parallel with Vg and its mRNA. Because of its sex-, stage-, and JH-dependent regulation, coordinate with Vg, the 21K protein will be valuable for analysis of gene expression. PMID:7679110

  4. Sequence dependent interaction of hnRNP proteins with late adenoviral transcripts.

    PubMed Central

    van Eekelen, C; Ohlsson, R; Philipson, L; Mariman, E; van Beek, R; van Venrooij, W

    1982-01-01

    Irradiation with ultraviolet light was used to induce covalent linkage between hnRNA and its associated proteins in intact HeLa cells, late after infection with adenovirus type 2. Covalently linked hnRNA-protein complexes, containing polyadenylated adenoviral RNA, were isolated and their protein moiety characterized. Host 42,000 Mr hnRNP proteins proved to be the major proteins crosslinked to viral hnRNA. To investigate their possible involvement in RNA processing, the localization of these cross-linked polypeptides on adenoviral late transcripts was determined. Sequences of RNA around the attachment sites of the protein were isolated. After in vitro labeling they were hybridized to Southern blots of adeno DNA fragments. The hybridization patterns revealed that the 42,000 Mr polypeptides can be linked to adenoviral transcripts over the entire length of the RNA, corresponding to 16.2-91.5 m.u. of the viral genome. Fine mapping within the Hind III B region (16.8-31.5 m.u.) established, however, that the localization of the cross-linked polypeptides was not random in all parts of the transcript. Sequences around the third leader and the 3' part of the i-leader were overrepresented, whereas the regions encoding VA I and VA II RNA and the late region 1 mRNA bodies were underrepresented in the cross-linked RNA. Using genomic DNA fragments and a cDNA clone containing the tripartite leader it appeared that leader and intervening sequences were represented about equally in cross-linked RNA fragments. Although these results do not support the notion that introns or exons are specifically interacting with one RNP protein, they demonstrate that the 42,000 hnRNP proteins are non randomly positioned on the RNA sequence. Images PMID:6296766

  5. From protein sequence to dynamics and disorder with DynaMine

    NASA Astrophysics Data System (ADS)

    Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F.

    2013-11-01

    Protein function and dynamics are closely related; however, accurate dynamics information is difficult to obtain. Here based on a carefully assembled data set derived from experimental data for proteins in solution, we quantify backbone dynamics properties on the amino-acid level and develop DynaMine—a fast, high-quality predictor of protein backbone dynamics. DynaMine uses only protein sequence information as input and shows great potential in distinguishing regions of different structural organization, such as folded domains, disordered linkers, molten globules and pre-structured binding motifs of different sizes. It also identifies disordered regions within proteins with an accuracy comparable to the most sophisticated existing predictors, without depending on prior disorder knowledge or three-dimensional structural information. DynaMine provides molecular biologists with an important new method that grasps the dynamical characteristics of any protein of interest, as we show here for human p53 and E1A from human adenovirus 5.

  6. From protein sequence to dynamics and disorder with DynaMine.

    PubMed

    Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F

    2013-01-01

    Protein function and dynamics are closely related; however, accurate dynamics information is difficult to obtain. Here based on a carefully assembled data set derived from experimental data for proteins in solution, we quantify backbone dynamics properties on the amino-acid level and develop DynaMine--a fast, high-quality predictor of protein backbone dynamics. DynaMine uses only protein sequence information as input and shows great potential in distinguishing regions of different structural organization, such as folded domains, disordered linkers, molten globules and pre-structured binding motifs of different sizes. It also identifies disordered regions within proteins with an accuracy comparable to the most sophisticated existing predictors, without depending on prior disorder knowledge or three-dimensional structural information. DynaMine provides molecular biologists with an important new method that grasps the dynamical characteristics of any protein of interest, as we show here for human p53 and E1A from human adenovirus 5. PMID:24225580

  7. Structure-Templated Predictions of Novel Protein Interactions from Sequence Information

    PubMed Central

    Betel, Doron; Breitkreuz, Kevin E; Isserlin, Ruth; Dewar-Darch, Danielle; Tyers, Mike; Hogue, Christopher W. V

    2007-01-01

    The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain–motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information. PMID:17892321

  8. Sequence-based prediction of protein-peptide binding sites using support vector machine.

    PubMed

    Taherzadeh, Ghazaleh; Yang, Yuedong; Zhang, Tuo; Liew, Alan Wee-Chung; Zhou, Yaoqi

    2016-05-15

    Protein-peptide interactions are essential for all cellular processes including DNA repair, replication, gene-expression, and metabolism. As most protein-peptide interactions are uncharacterized, it is cost effective to investigate them computationally as the first step. All existing approaches for predicting protein-peptide binding sites, however, are based on protein structures despite the fact that the structures for most proteins are not yet solved. This article proposes the first machine-learning method called SPRINT to make Sequence-based prediction of Protein-peptide Residue-level Interactions. SPRINT yields a robust and consistent performance for 10-fold cross validations and independent test. The most important feature is evolution-generated sequence profiles. For the test set (1056 binding and non-binding residues), it yields a Matthews' Correlation Coefficient of 0.326 with a sensitivity of 64% and a specificity of 68%. This sequence-based technique shows comparable or more accurate than structure-based methods for peptide-binding site prediction. SPRINT is available as an online server at: http://sparks-lab.org/. © 2016 Wiley Periodicals, Inc. PMID:26833816

  9. Rapid search for tertiary fragments reveals protein sequence-structure relationships.

    PubMed

    Zhou, Jianfu; Grigoryan, Gevorg

    2015-04-01

    Finding backbone substructures from the Protein Data Bank that match an arbitrary query structural motif, composed of multiple disjoint segments, is a problem of growing relevance in structure prediction and protein design. Although numerous protein structure search approaches have been proposed, methods that address this specific task without additional restrictions and on practical time scales are generally lacking. Here, we propose a solution, dubbed MASTER, that is both rapid, enabling searches over the Protein Data Bank in a matter of seconds, and provably correct, finding all matches below a user-specified root-mean-square deviation cutoff. We show that despite the potentially exponential time complexity of the problem, running times in practice are modest even for queries with many segments. The ability to explore naturally plausible structural and sequence variations around a given motif has the potential to synthesize its design principles in an automated manner; so we go on to illustrate the utility of MASTER to protein structural biology. We demonstrate its capacity to rapidly establish structure-sequence relationships, uncover the native designability landscapes of tertiary structural motifs, identify structural signatures of binding, and automatically rewire protein topologies. Given the broad utility of protein tertiary fragment searches, we hope that providing MASTER in an open-source format will enable novel advances in understanding, predicting, and designing protein structure. PMID:25420575

  10. Boosting heterologous protein production in transgenic dicotyledonous seeds using Phaseolus vulgaris regulatory sequences.

    PubMed

    De Jaeger, Geert; Scheffer, Stanley; Jacobs, Anni; Zambre, Mukund; Zobell, Oliver; Goossens, Alain; Depicker, Ann; Angenon, Geert

    2002-12-01

    Over the past decade, several high value proteins have been produced in different transgenic plant tissues such as leaves, tubers, and seeds. Despite recent advances, many heterologous proteins accumulate to low concentrations, and the optimization of expression cassettes to make in planta production and purification economically feasible remains critical. Here, the regulatory sequences of the seed storage protein gene arcelin 5-I (arc5-I) of common bean (Phaseolus vulgaris) were evaluated for producing heterologous proteins in dicotyledonous seeds. The murine single chain variable fragment (scFv) G4 (ref. 4) was chosen as model protein because of the current industrial interest in producing antibodies and derived fragments in crops. In transgenic Arabidopsis thaliana seed stocks, the scFv under control of the 35S promoter of the cauliflower mosaic virus (CaMV) accumulated to approximately 1% of total soluble protein (TSP). However, a set of seed storage promoter constructs boosted the scFv accumulation to exceptionally high concentrations, reaching no less than 36.5% of TSP in homozygous seeds. Even at these high concentrations, the scFv proteins had antigen-binding activity and affinity similar to those produced in Escherichia coli. The feasibility of heterologous protein production under control of arc5-I regulatory sequences was also demonstrated in Phaseolus acutifolius, a promising crop for large scale production. PMID:12415287

  11. Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity.

    PubMed

    Camproux, A C; Tufféry, P

    2005-08-01

    Understanding and predicting protein structures depend on the complexity and the accuracy of the models used to represent them. We have recently set up a Hidden Markov Model to optimally compress protein three-dimensional conformations into a one-dimensional series of letters of a structural alphabet. Such a model learns simultaneously the shape of representative structural letters describing the local conformation and the logic of their connections, i.e. the transition matrix between the letters. Here, we move one step further and report some evidence that such a model of protein local architecture also captures some accurate amino acid features. All the letters have specific and distinct amino acid distributions. Moreover, we show that words of amino acids can have significant propensities for some letters. Perspectives point towards the prediction of the series of letters describing the structure of a protein from its amino acid sequence. PMID:16040198

  12. DNA sequence-dependent mechanics and protein-assisted bending in repressor-mediated loop formation

    NASA Astrophysics Data System (ADS)

    Boedicker, James Q.; Garcia, Hernan G.; Johnson, Stephanie; Phillips, Rob

    2013-12-01

    As the chief informational molecule of life, DNA is subject to extensive physical manipulations. The energy required to deform double-helical DNA depends on sequence, and this mechanical code of DNA influences gene regulation, such as through nucleosome positioning. Here we examine the sequence-dependent flexibility of DNA in bacterial transcription factor-mediated looping, a context for which the role of sequence remains poorly understood. Using a suite of synthetic constructs repressed by the Lac repressor and two well-known sequences that show large flexibility differences in vitro, we make precise statistical mechanical predictions as to how DNA sequence influences loop formation and test these predictions using in vivo transcription and in vitro single-molecule assays. Surprisingly, sequence-dependent flexibility does not affect in vivo gene regulation. By theoretically and experimentally quantifying the relative contributions of sequence and the DNA-bending protein HU to DNA mechanical properties, we reveal that bending by HU dominates DNA mechanics and masks intrinsic sequence-dependent flexibility. Such a quantitative understanding of how mechanical regulatory information is encoded in the genome will be a key step towards a predictive understanding of gene regulation at single-base pair resolution.

  13. DNA sequence-dependent mechanics and protein-assisted bending in repressor-mediated loop formation

    PubMed Central

    Boedicker, James Q.; Garcia, Hernan G.; Johnson, Stephanie; Phillips, Rob

    2014-01-01

    As the chief informational molecule of life, DNA is subject to extensive physical manipulations. The energy required to deform double-helical DNA depends on sequence, and this mechanical code of DNA influences gene regulation, such as through nucleosome positioning. Here we examine the sequence-dependent flexibility of DNA in bacterial transcription factor-mediated looping, a context for which the role of sequence remains poorly understood. Using a suite of synthetic constructs repressed by the Lac repressor and two well-known sequences that show large flexibility differences in vitro, we make precise statistical mechanical predictions as to how DNA sequence influences loop formation and test these predictions using in vivo transcription and in vitro single-molecule assays. Surprisingly, sequence-dependent flexibility does not affect in vivo gene regulation. By theoretically and experimentally quantifying the relative contributions of sequence and the DNA-bending protein HU to DNA mechanical properties, we reveal that bending by HU dominates DNA mechanics and masks intrinsic sequence-dependent flexibility. Such a quantitative understanding of how mechanical regulatory information is encoded in the genome will be a key step towards a predictive understanding of gene regulation at single-base pair resolution. PMID:24231252

  14. Functional characterisation of novel enantioselective lipase TALipA from Trichosporon asahii MSR54: sequence comparison revealed new signature sequence AXSXG among yeast lipases.

    PubMed

    Kumari, Arti; Gupta, Rani

    2015-01-01

    A gene encoding lipase TALipA from Trichosporon asahii MSR54 was successfully isolated, cloned and expressed in Pichia pastoris X-33. It was purified to homogeneity by affinity chromatography with 1.7 purification fold. SDS-PAGE revealed it as a monomeric 27-kDa protein. Sequence comparison showed that it has close affinity with bacterial and actinobacterial lipases. It has unique oxyanion hole "GL" and conserved pentapeptide AHSMG where alanine is present instead of glycine, which is unique to yeast lipase database. The temperature and pH optima for activity were 60 °C and pH 8.0, respectively. It is thermostable with t1/2 of 68 min at 70 °C. It hydrolyzed p-np esters with better specificity on p-np palmitate, which was again confirmed during hydrolysis of triacylglyceride mixture. The enzyme was found to be regioselective during hydrolysis of triolein. It exhibited enantio preference during esterification of phenylethanol depending upon solvent used. It was S-enantioselective in 1,4-dioxane and R-selective in isopropanol and hexane. It is a magnesium-activated metalloenzyme inhibited by 10-mM EDTA. It was stable towards most of the polar and non-polar solvents. PMID:25280633

  15. Antigenic and sequence diversity in gonococcal transferrin-binding protein A.

    PubMed

    Cornelissen, C N; Anderson, J E; Boulton, I C; Sparling, P F

    2000-08-01

    Neisseria gonorrhoeae is a gram-negative pathogen that is capable of satisfying its iron requirement with human iron-binding proteins such as transferrin and lactoferrin. Transferrin-iron utilization involves specific binding of human transferrin at the cell surface to what is believed to be a complex of two iron-regulated, transferrin-binding proteins, TbpA and TbpB. The genes encoding these proteins have been cloned and sequenced from a number of pathogenic, gram-negative bacteria. In the current study, we sequenced four additional tbpA genes from other N. gonorrhoeae strains to begin to assess the sequence diversity among gonococci. We compared these sequences to those from other pathogenic bacteria to identify conserved regions that might be important for the structure and function of these receptors. We generated polyclonal mouse sera against synthetic peptides deduced from the TbpA sequence from gonococcal strain FA19. Most of these synthetic peptides were predicted to correspond to surface-exposed regions of TbpA. We found that, while most reacted with denatured TbpA in Western blots, only one antipeptide serum reacted with native TbpA in the context of intact gonococci, consistent with surface exposure of the peptide to which this serum was raised. In addition, we evaluated a panel of gonococcal strains for antigenic diversity using these antipeptide sera. PMID:10899879

  16. Remote access to ACNUC nucleotide and protein sequence databases at PBIL.

    PubMed

    Gouy, Manolo; Delmotte, Stéphane

    2008-04-01

    The ACNUC biological sequence database system provides powerful and fast query and extraction capabilities to a variety of nucleotide and protein sequence databases. The collection of ACNUC databases served by the Pôle Bio-Informatique Lyonnais includes the EMBL, GenBank, RefSeq and UniProt nucleotide and protein sequence databases and a series of other sequence databases that support comparative genomics analyses: HOVERGEN and HOGENOM containing families of homologous protein-coding genes from vertebrate and prokaryotic genomes, respectively; Ensembl and Genome Reviews for analyses of prokaryotic and of selected eukaryotic genomes. This report describes the main features of the ACNUC system and the access to ACNUC databases from any internet-connected computer. Such access was made possible by the definition of a remote ACNUC access protocol and the implementation of Application Programming Interfaces between the C, Python and R languages and this communication protocol. Two retrieval programs for ACNUC databases, Query_win, with a graphical user interface and raa_query, with a command line interface, are also described. Altogether, these bioinformatics tools provide users with either ready-to-use means of querying remote sequence databases through a variety of selection criteria, or a simple way to endow application programs with an extensive access to these databases. Remote access to ACNUC databases is open to all and fully documented (http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html). PMID:17825976

  17. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties

    PubMed Central

    Neuwald, Andrew F.; Altschul, Stephen F.

    2016-01-01

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a “top-down” strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins’ structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO’s superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/. PMID

  18. PROTEINS FROM EIGHT EUKARYOTIC CYTOCHROME P-450 FAMILIES SHARE A SEGMENTED REGION OF SEQUENCE SIMILARITY

    EPA Science Inventory

    Proteins from eight eukaryotic families in the cytochrome P-450 superfamily share one region of sequence similarity. his region begins 275-310 amino acids from the amino terminus of each P-450, continues for 170 residues, and ends 35-50 amino acids before the carboxyl terminus. h...

  19. Nucleotide sequence of ompV, the gene for a major Vibrio cholerae outer membrane protein.

    PubMed

    Pohlner, J; Meyer, T F; Jalajakumari, M B; Manning, P A

    1986-12-01

    The nucleotide sequence of the ompV gene of Vibrio cholerae was determined. The product of the gene is a 28,000 dalton protein which, after the removal of a 19 amino acid signal sequence, produces a mature outer membrane protein of 26,000 daltons. The cleavage site was determined by amino-terminal amino acid sequencing of the purified mature protein. The DNA upstream of the gene shows the presence of a typical promoter region as judged from the Escherichia coli consensus information; however, the Shine-Dalgarno sequence is associated with a region capable of forming a secondary structure in the mRNA. The formation of this structure would inhibit binding of the mRNA to the ribosome and reduce translation. It is proposed that this structure is recognized by a positive activator in V. cholerae and because of its absence in E. coli ompV is poorly expressed. The distribution of rare codons within ompV suggests that they may serve to slow down the translation of particular domains such that the nascent polypeptide has an opportunity to take up its conformation without interference from the later formed regions. Such a mechanism could aid localization of the protein if export were by a contranslational secretion system. PMID:3031428

  20. Draft Genome Sequences of 18 Oral Streptococcus Strains That Encode Amylase-Binding Proteins

    PubMed Central

    Sabharwal, Amarpreet; Liao, Yu-Chieh; Lin, Hsin-Hung; Haase, Elaine M.

    2015-01-01

    A number of commensal oral streptococcal species produce a heterogeneous group of proteins that mediate binding of salivary α-amylase. This interaction likely influences streptococcal colonization of the oral cavity. Here, we present draft genome sequences of several strains of oral streptococcal species that bind human salivary amylase. PMID:25999552

  1. Draft genome sequences of 18 oral streptococcus strains that encode amylase-binding proteins.

    PubMed

    Sabharwal, Amarpreet; Liao, Yu-Chieh; Lin, Hsin-Hung; Haase, Elaine M; Scannapieco, Frank A

    2015-01-01

    A number of commensal oral streptococcal species produce a heterogeneous group of proteins that mediate binding of salivary α-amylase. This interaction likely influences streptococcal colonization of the oral cavity. Here, we present draft genome sequences of several strains of oral streptococcal species that bind human salivary amylase. PMID:25999552

  2. Protein identities - Graphocephala atropunctata expressed sequenced tags: expanding leafhopper vector biology

    Technology Transfer Automated Retrieval System (TEKTRAN)

    A small heat shock protein was isolated and sequenced from the Blue-green sharpshooter, BGSS, Graphocephala atropunctata (Signoret) (Hemiptera: Cicadellidae). The BGSS has been the native vector of Pierce’s disease in vineyards in California for nearly a century. The importance of this vector spec...

  3. Comment on "Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry".

    PubMed

    Pevzner, Pavel A; Kim, Sangtae; Ng, Julio

    2008-08-22

    Asara et al. (Reports, 13 April 2007, p. 280) reported sequencing of Tyrannosaurus rex proteins and used them to establish the evolutionary relationships between birds and dinosaurs. We argue that the reported T. rex peptides may represent statistical artifacts and call for complete data release to enable experimental and computational verification of their findings. PMID:18719266

  4. A Scalable Parallel Algorithm for Large-Scale Protein Sequence Homology Detection

    SciTech Connect

    Wu, Changjun; Kalyanaraman, Anantharaman; Cannon, William R.

    2010-09-13

    Protein sequence homology detection is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting homology between two protein sequences is computationally inexpensive, detecting pairwise homology at a large-scale becomes prohibitive, requiring millions of CPU hours. Yet, there is currently no efficient method available to parallelize this kernel. In this paper, we present the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for large-scale protein sequence data. Our method, called pGraph, is designed using a hierarchical multiple-master multiple-worker model, where the processor space is partitioned into subgroups and the hierarchy helps in ensuring the workload is load balanced fashion despite the inherent irregularity that may originate in the input. Experimental evaluation demonstrates that our method scales linearly on all input sizes tested (up to 640K sequences) on a 1,024 node supercomputer. In addition to demonstrating strong scaling, we present an extensive study of the various components of the system and related parametric studies.

  5. A thermodynamic analysis of the sequence-specific binding of RNA by bacteriophage MS2 coat protein

    PubMed Central

    Johansson, Hans E.; Dertinger, Dagmar; LeCuyer, Karen A.; Behlen, Linda S.; Greef, Charles H.; Uhlenbeck, Olke C.

    1998-01-01

    Most mutations in the sequence of the RNA hairpin that specifically binds MS2 coat protein either reduce the binding affinity or have no effect. However, one RNA mutation, a uracil to cytosine change in the loop, has the unusual property of increasing the binding affinity to the protein by nearly 100-fold. Guided by the structure of the protein–RNA complex, we used a series of protein mutations and RNA modifications to evaluate the thermodynamic basis for the improved affinity: The tight binding of the cytosine mutation is due to (i) the amino group of the cytosine residue making an intra-RNA hydrogen bond that increases the propensity of the free RNA to adopt the structure seen in the complex and (ii) the increased affinity of hydrogen bonds between the protein and a phosphate two bases away from the cytosine residue. The data are in good agreement with a recent comparison of the cocrystal structures of the two complexes, where small differences in the two structures are seen at the thermodynamically important sites. PMID:9689065

  6. Assessing fluctuating evolutionary pressure in yeast and mammal evolutionary rate covariation using bioinformatics of meiotic protein genetic sequences

    NASA Astrophysics Data System (ADS)

    Dehipawala, Sunil; Nguyen, A.; Tremberger, G.; Cheung, E.; Holden, T.; Lieberman, D.; Cheung, T.

    2013-09-01

    The evolutionary rate co-variation in meiotic proteins has been reported for yeast and mammal using phylogenic branch lengths which assess retention, duplication and mutation. The bioinformatics of the corresponding DNA sequences could be classified as a diagram of fractal dimension and Shannon entropy. Results from biomedical gene research provide examples on the diagram methodology. The identification of adaptive selection using entropy marker and functional-structural diversity using fractal dimension would support a regression analysis where the coefficient of determination would serve as evolutionary pathway marker for DNA sequences and be an important component in the astrobiology community. Comparisons between biomedical genes such as EEF2 (elongation factor 2 human, mouse, etc), WDR85 in epigenetics, HAR1 in human specificity, clinical trial targeted cancer gene CD47, SIRT6 in spermatogenesis, and HLA-C in mosquito bite immunology demonstrate the diagram classification methodology. Comparisons to the SEPT4-XIAP pair in stem cell apoptosis, testesexpressed taste genes TAS1R3-GNAT3 pair, and amyloid beta APLP1-APLP2 pair with the yeast-mammal DNA sequences for meiotic proteins RAD50-MRE11 pair and NCAPD2-ICK pair have accounted for the observed fluctuating evolutionary pressure systematically. Regression with high R-sq values or a triangular-like cluster pattern for concordant pairs in co-variation among the studied species could serve as evidences for the possible location of common ancestors in the entropy-fractal dimension diagram, consistent with an example of the human-chimp common ancestor study using the FOXP2 regulated genes reported in human fetal brain study. The Deinococcus radiodurans R1 Rad-A could be viewed as an outlier in the RAD50 diagram and also in the free energy versus fractal dimension regression Cook's distance, consistent with a non-Earth source for this radiation resistant bacterium. Convergent and divergent fluctuating evolutionary

  7. PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences.

    PubMed

    Elnitski, Laura; Riemer, Cathy; Petrykowska, Hanna; Florea, Liliana; Schwartz, Scott; Miller, Webb; Hardison, Ross

    2002-12-01

    Sequence conservation between species is useful both for locating coding regions of genes and for identifying functional noncoding segments. Hence interspecies alignment of genomic sequences is an important computational technique. However, its utility is limited without extensive annotation. We describe a suite of software tools, PipTools, and related programs that facilitate the annotation of genes and putative regulatory elements in pairwise alignments. The alignment server PipMaker uses the output of these tools to display detailed information needed to interpret alignments. These programs are provided in a portable format for use on common desktop computers and both the toolkit and the PipMaker server can be found at our Web site (http://bio.cse.psu.edu/). We illustrate the utility of the toolkit using annotation of a pairwise comparison of the mouse MHC class II and class III regions with orthologous human sequences and subsequently identify conserved, noncoding sequences that are DNase I hypersensitive sites in chromatin of mouse cells. PMID:12504859

  8. In Silico Genome Comparison and Distribution Analysis of Simple Sequences Repeats in Cassava

    PubMed Central

    Vásquez, Andrea; López, Camilo

    2014-01-01

    We conducted a SSRs density analysis in different cassava genomic regions. The information obtained was useful to establish comparisons between cassava's SSRs genomic distribution and those of poplar, flax, and Jatropha. In general, cassava has a low SSR density (~50 SSRs/Mbp) and has a high proportion of pentanucleotides, (24,2 SSRs/Mbp). It was found that coding sequences have 15,5 SSRs/Mbp, introns have 82,3 SSRs/Mbp, 5′ UTRs have 196,1 SSRs/Mbp, and 3′ UTRs have 50,5 SSRs/Mbp. Through motif analysis of cassava's genome SSRs, the most abundant motif was AT/AT while in intron sequences and UTRs regions it was AG/CT. In addition, in coding sequences the motif AAG/CTT was also found to occur most frequently; in fact, it is the third most used codon in cassava. Sequences containing SSRs were classified according to their functional annotation of Gene Ontology categories. The identified SSRs here may be a valuable addition for genetic mapping and future studies in phylogenetic analyses and genomic evolution. PMID:25374887

  9. Revealing divergent evolution, identifying circular permutations and detecting active-sites by protein structure comparison

    PubMed Central

    Chen, Luonan; Wu, Ling-Yun; Wang, Yong; Zhang, Shihua; Zhang, Xiang-Sun

    2006-01-01

    Background Protein structure comparison is one of the most important problems in computational biology and plays a key role in protein structure prediction, fold family classification, motif finding, phylogenetic tree reconstruction and protein docking. Results We propose a novel method to compare the protein structures in an accurate and efficient manner. Such a method can be used to not only reveal divergent evolution, but also identify circular permutations and further detect active-sites. Specifically, we define the structure alignment as a multi-objective optimization problem, i.e., maximizing the number of aligned atoms and minimizing their root mean square distance. By controlling a single distance-related parameter, theoretically we can obtain a variety of optimal alignments corresponding to different optimal matching patterns, i.e., from a large matching portion to a small matching portion. The number of variables in our algorithm increases with the number of atoms of protein pairs in almost a linear manner. In addition to solid theoretical background, numerical experiments demonstrated significant improvement of our approach over the existing methods in terms of quality and efficiency. In particular, we show that divergent evolution, circular permutations and active-sites (or structural motifs) can be identified by our method. The software SAMO is available upon request from the authors, or from and . Conclusion A novel formulation is proposed to accurately align protein structures in the framework of multi-objective optimization, based on a sequence order-independent strategy. A fast and accurate algorithm based on the bipartite matching algorithm is developed by exploiting the special features. Convergence of computation is shown in experiments and is also theoretically proven. PMID:16948858

  10. The SWISS-PROT protein sequence data bank and its new supplement TREMBL.

    PubMed Central

    Bairoch, A; Apweiler, R

    1996-01-01

    SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to seven additional databases; a variety of new documentation files; the creation of TREMBL, and unannotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT. PMID:8594581

  11. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998.

    PubMed Central

    Bairoch, A; Apweiler, R

    1998-01-01

    SWISS-PROT (http://www.expasy.ch/) is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT. PMID:9399796

  12. The SWISS-PROT protein sequence data bank and its supplement TrEMBL.

    PubMed Central

    Bairoch, A; Apweiler, R

    1997-01-01

    SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, structure of its domains, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and the creation of TrEMBL, a computer annotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT. PMID:9016499

  13. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999.

    PubMed Central

    Bairoch, A; Apweiler, R

    1999-01-01

    SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: cross-references to additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT. The URLs for SWISS-PROT on the WWW are: http://www.expasy.ch/sprot and http://www. ebi.ac.uk/sprot PMID:9847139

  14. Key for protein coding sequences identification: computer analysis of codon strategy.

    PubMed Central

    Rodier, F; Gabarro-Arpa, J; Ehrlich, R; Reiss, C

    1982-01-01

    The signal qualifying an AUG or GUG as an initiator in mRNAs processed by E. coli ribosomes is not found to be a systematic, literal homology sequence. In contrast, stability analysis reveals that initiators always occur within nucleic acid domains of low stability, for which a high A/U content is observed. Since no aminoacid selection pressure can be detected at N-termini of the proteins, the A/U enrichment results from a biased usage of the code degeneracy. A computer analysis is presented which allows easy detection of the codon strategy. N-terminal codons carry rather systematically A or U in third position, which suggests a mechanism for translation initiation and helps to detect protein coding sequences in sequenced DNA. PMID:7038623

  15. A novel multitarget tracking algorithm for Myosin VI protein molecules on actin filaments in TIRFM sequences.

    PubMed

    Li, G; Sanchez, V; Nagaraj, P C S B; Khan, S; Rajpoot, N

    2015-12-01

    We propose a novel multitarget tracking framework for Myosin VI protein molecules in total internal reflection fluorescence microscopy sequences which integrates an extended Hungarian algorithm with an interacting multiple model filter. The extended Hungarian algorithm, which is a linear assignment problem based method, helps to solve measurement assignment and spot association problems commonly encountered when dealing with multiple targets, although a two-motion model interacting multiple model filter increases the tracking accuracy by modelling the nonlinear dynamics of Myosin VI protein molecules on actin filaments. The evaluation of our tracking framework is conducted on both real and synthetic total internal reflection fluorescence microscopy sequences. The results show that the framework achieves higher tracking accuracies compared to the state-of-the-art tracking methods, especially for sequences with high spot density. PMID:26259144

  16. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts.

    PubMed

    Sun, Liang; Luo, Haitao; Bu, Dechao; Zhao, Guoguang; Yu, Kuntao; Zhang, Changhai; Liu, Yuanning; Chen, Runsheng; Zhao, Yi

    2013-09-01

    It is a challenge to classify protein-coding or non-coding transcripts, especially those re-constructed from high-throughput sequencing data of poorly annotated species. This study developed and evaluated a powerful signature tool, Coding-Non-Coding Index (CNCI), by profiling adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations. CNCI is effective for classifying incomplete transcripts and sense-antisense pairs. The implementation of CNCI offered highly accurate classification of transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, that demonstrated gene evolutionary divergence between vertebrates, and invertebrates, or between plants, and provided a long non-coding RNA catalog of orangutan. CNCI software is available at http://www.bioinfo.org/software/cnci. PMID:23892401

  17. Nonconsensus Protein Binding to Repetitive DNA Sequence Elements Significantly Affects Eukaryotic Genomes

    PubMed Central

    Barber-Zucker, Shiran; Gordân, Raluca; Lukatsky, David B.

    2015-01-01

    Recent genome-wide experiments in different eukaryotic genomes provide an unprecedented view of transcription factor (TF) binding locations and of nucleosome occupancy. These experiments revealed that a large fraction of TF binding events occur in regions where only a small number of specific TF binding sites (TFBSs) have been detected. Furthermore, in vitro protein-DNA binding measurements performed for hundreds of TFs indicate that TFs are bound with wide range of affinities to different DNA sequences that lack known consensus motifs. These observations have thus challenged the classical picture of specific protein-DNA binding and strongly suggest the existence of additional recognition mechanisms that affect protein-DNA binding preferences. We have previously demonstrated that repetitive DNA sequence elements characterized by certain symmetries statistically affect protein-DNA binding preferences. We call this binding mechanism nonconsensus protein-DNA binding in order to emphasize the point that specific consensus TFBSs do not contribute to this effect. In this paper, using the simple statistical mechanics model developed previously, we calculate the nonconsensus protein-DNA binding free energy for the entire C. elegans and D. melanogaster genomes. Using the available chromatin immunoprecipitation followed by sequencing (ChIP-seq) results on TF-DNA binding preferences for ~100 TFs, we show that DNA sequences characterized by low predicted free energy of nonconsensus binding have statistically higher experimental TF occupancy and lower nucleosome occupancy than sequences characterized by high free energy of nonconsensus binding. This is in agreement with our previous analysis performed for the yeast genome. We suggest therefore that nonconsensus protein-DNA binding assists the formation of nucleosome-free regions, as TFs outcompete nucleosomes at genomic locations with enhanced nonconsensus binding. In addition, here we perform a new, large-scale analysis using

  18. Sequence comparison of new prokaryotic and mitochondrial members of the polypeptide chain release factor family predicts a five-domain model for release factor structure.

    PubMed Central

    Pel, H J; Rep, M; Grivell, L A

    1992-01-01

    We have recently reported the cloning and sequencing of the gene for the mitochondrial release factor mRF-1. mRF-1 displays high sequence similarity to the bacterial release factors RF-1 and RF-2. A database search for proteins resembling these three factors revealed high similarities to two amino acid sequences deduced from unassigned genomic reading frames in Escherichia coli and Bacillus subtilis. The amino acid sequence derived from the Bacillus reading frame is 47% identical to E.coli and Salmonella typhimurium RF-2, strongly suggesting that it represents B.subtilis RF-2. Our comparison suggests that the expression of the B.subtilis gene is, like that of the E.coli and S. typhimurium RF-2 genes, autoregulated by a stop codon dependent +1 frameshift. A comparison of prokaryotic and mitochondrial release factor sequences, including the putative B.subtilis RF-2, leads us to propose a five-domain model for release factor structure. Possible functions of the various domains are discussed. PMID:1408743

  19. Sequence analyses and evolutionary relationships among the energy-coupling proteins Enzyme I and HPr of the bacterial phosphoenolpyruvate: sugar phosphotransferase system.

    PubMed Central

    Reizer, J.; Hoischen, C.; Reizer, A.; Pham, T. N.; Saier, M. H.

    1993-01-01

    We have previously reported the overexpression, purification, and biochemical properties of the Bacillus subtilis Enzyme I of the phosphoenolpyruvate: sugar phosphotransferase system (PTS) (Reizer, J., et al., 1992, J. Biol. Chem. 267, 9158-9169). We now report the sequencing of the ptsI gene of B. subtilis encoding Enzyme I (570 amino acids and 63,076 Da). Putative transcriptional regulatory signals are identified, and the pts operon is shown to be subject to carbon source-dependent regulation. Multiple alignments of the B. subtilis Enzyme I with (1) six other sequenced Enzymes I of the PTS from various bacterial species, (2) phosphoenolpyruvate synthase of Escherichia coli, and (3) bacterial and plant pyruvate: phosphate dikinases (PPDKs) revealed regions of sequence similarity as well as divergence. Statistical analyses revealed that these three types of proteins comprise a homologous family, and the phylogenetic tree of the 11 sequenced protein members of this family was constructed. This tree was compared with that of the 12 sequence HPr proteins or protein domains. Antibodies raised against the B. subtilis and E. coli Enzymes I exhibited immunological cross-reactivity with each other as well as with PPDK of Bacteroides symbiosus, providing support for the evolutionary relationships of these proteins suggested from the sequence comparisons. Putative flexible linkers tethering the N-terminal and the C-terminal domains of protein members of the Enzyme I family were identified, and their potential significance with regard to Enzyme I function is discussed. The codon choice pattern of the B. subtilis and E. coli ptsI and ptsH genes was found to exhibit a bias toward optimal codons in these organisms.(ABSTRACT TRUNCATED AT 250 WORDS) PMID:7686067

  20. Comparison of the bacterial HelA