Science.gov

Sample records for protein sequence comparison

  1. Protein sequence comparison and protein evolution

    SciTech Connect

    Pearson, W.R.

    1995-12-31

    This tutorial was one of eight tutorials selected to be presented at the Third International Conference on Intelligent Systems for Molecular Biology which was held in the United Kingdom from July 16 to 19, 1995. This tutorial examines how the information conserved during the evolution of a protein molecule can be used to infer reliably homology, and thus a shared proteinfold and possibly a shared active site or function. The authors start by reviewing a geological/evolutionary time scale. Next they look at the evolution of several protein families. During the tutorial, these families will be used to demonstrate that homologous protein ancestry can be inferred with confidence. They also examine different modes of protein evolution and consider some hypotheses that have been presented to explain the very earliest events in protein evolution. The next part of the tutorial will examine the technical aspects of protein sequence comparison. Both optimal and heuristic algorithms and their associated parameters that are used to characterize protein sequence similarities are discussed. Perhaps more importantly, they survey the statistics of local similarity scores, and how these statistics can both be used to improve the selectivity of a search and to evaluate the significance of a match. They them examine distantly related members of three protein families, the serine proteases, the glutathione transferases, and the G-protein-coupled receptors (GCRs). Finally, the discuss how sequence similarity can be used to examine internal repeated or mosaic structures in proteins.

  2. Computational methods for protein sequence comparison and search.

    PubMed

    Xu, Dong

    2009-04-01

    Protein sequence comparison and search has become commonplace not only for bioinformatics researchers but also for experimentalists in many cases. Because of the exponential growth in sequence data, sequence comparison in particular has become an increasingly important tool. Relating a new gene sequence to other known sequences often reveals its function, structure, and evolution. Many sequence comparison and search tools are available through public Web servers, and biologists can use them easily with little knowledge of computers or bioinformatics. This unit provides some theoretical background and describes popular tools for dot plot, sequence search against a database, multiple sequence alignments, protein tree construction, and protein family and motif search. Step-by-step examples are provided to illustrate how to use some of the most well-known tools. Finally, some general advice is given on combining different sequence analysis tools for biological inference.

  3. Protein sequence comparison based on K-string dictionary.

    PubMed

    Yu, Chenglong; He, Rong L; Yau, Stephen S-T

    2013-10-25

    The current K-string-based protein sequence comparisons require large amounts of computer memory because the dimension of the protein vector representation grows exponentially with K. In this paper, we propose a novel concept, the "K-string dictionary", to solve this high-dimensional problem. It allows us to use a much lower dimensional K-string-based frequency or probability vector to represent a protein, and thus significantly reduce the computer memory requirements for their implementation. Furthermore, based on this new concept, we use Singular Value Decomposition to analyze real protein datasets, and the improved protein vector representation allows us to obtain accurate gene trees.

  4. Molecular evolution of herpesviruses: genomic and protein sequence comparisons.

    PubMed Central

    Karlin, S; Mocarski, E S; Schachtel, G A

    1994-01-01

    Phylogenetic reconstruction of herpesvirus evolution is generally founded on amino acid sequence comparisons of specific proteins. These are relevant to the evolution of the specific gene (or set of genes), but the resulting phylogeny may vary depending on the particular sequence chosen for analysis (or comparison). In the first part of this report, we compare 13 herpesvirus genomes by using a new multidimensional methodology based on distance measures and partial orderings of dinucleotide relative abundances. The sequences were analyzed with respect to (i) genomic compositional extremes; (ii) total distances within and between genomes; (iii) partial orderings among genomes relative to a set of sequence standards; (iv) concordance correlations of genome distances; and (v) consistency with the alpha-, beta-, gammaherpesvirus classification. Distance assessments within individual herpesvirus genomes show each to be quite homogeneous relative to the comparisons between genomes. The gammaherpesviruses, Epstein-Barr virus (EBV), herpesvirus saimiri, and bovine herpesvirus 4 are both diverse and separate from other herpesvirus classes, whereas alpha- and betaherpesviruses overlap. The analysis revealed that the most central genome (closest to a consensus herpesvirus genome and most individual herpesvirus sequences of different classes) is that of human herpesvirus 6, suggesting that this genome is closest to a progenitor herpesvirus. The shorter DNA distances among alphaherpesviruses supports the hypothesis that the alpha class is of relatively recent ancestry. In our collection, equine herpesvirus 1 (EHV1) stands out as the most central alphaherpesvirus, suggesting it may approximate an ancestral alphaherpesvirus. Among all herpesviruses, the EBV genome is closest to human sequences. In the DNA partial orderings, the chicken sequence collection is invariably as close as or closer to all herpesvirus sequences than the human sequence collection is, which may imply that

  5. Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences

    PubMed Central

    Tian, Kun; Yang, Xiaoqian; Kong, Qin; Yin, Changchuan; He, Rong L.; Yau, Stephen S.-T.

    2015-01-01

    Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching. PMID:26384293

  6. 3D representations of amino acids—applications to protein sequence comparison and classification

    PubMed Central

    Li, Jie; Koehl, Patrice

    2014-01-01

    The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements. PMID:25379143

  7. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species

    NASA Technical Reports Server (NTRS)

    Haney, P. J.; Badger, J. H.; Buldak, G. L.; Reich, C. I.; Woese, C. R.; Olsen, G. J.

    1999-01-01

    The genome sequence of the extremely thermophilic archaeon Methanococcus jannaschii provides a wealth of data on proteins from a thermophile. In this paper, sequences of 115 proteins from M. jannaschii are compared with their homologs from mesophilic Methanococcus species. Although the growth temperatures of the mesophiles are about 50 degrees C below that of M. jannaschii, their genomic G+C contents are nearly identical. The properties most correlated with the proteins of the thermophile include higher residue volume, higher residue hydrophobicity, more charged amino acids (especially Glu, Arg, and Lys), and fewer uncharged polar residues (Ser, Thr, Asn, and Gln). These are recurring themes, with all trends applying to 83-92% of the proteins for which complete sequences were available. Nearly all of the amino acid replacements most significantly correlated with the temperature change are the same relatively conservative changes observed in all proteins, but in the case of the mesophile/thermophile comparison there is a directional bias. We identify 26 specific pairs of amino acids with a statistically significant (P < 0.01) preferred direction of replacement.

  8. A statistical physics perspective on alignment-independent protein sequence comparison

    PubMed Central

    Chattopadhyay, Amit K.; Nasiev, Diar; Flower, Darren R.

    2015-01-01

    Motivation: Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Results: Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from ‘first passage probability distribution’ to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. Contact: d.r.flower@aston.ac.uk PMID:25810434

  9. Relationships amongst bluetongue viruses revealed by comparisons of capsid and outer coat protein nucleotide sequences.

    PubMed

    Gould, A R; Pritchard, L I

    1990-08-01

    Sequence data from the gene segments coding for the capsid protein. VP3, of all eight Australian bluetongue virus serotypes were compared. The high degree of nucleotide sequence homology for VP3 genes amongst BTV isolates from the same geographic region supported previous studies (Gould, 1987; 1988b, c; Gould et al., 1988b) and was proposed as a basis for "topotyping" a bluetongue virus isolate (Gould et al., 1989). The complete nucleotide sequences which coded for the VP2 outer coat proteins of South African BTV serotypes 1 and 3 (vaccine strains) were determined and compared to cognate gene sequences from North American and Australian BTVs. These VP2 comparisons demonstrated that BTVs of the same serotype, but from different geographical regions, were closely related at the nucleotide and amino acid levels. However, close inter-relationships were also demonstrated amongst other BTVs irrespective of serotype or geographic origin. These data enabled phylogenic relationships of the BTV serotypes to be analysed using VP2 nucleotide sequences as a determinant.

  10. Shotgun protein sequencing.

    SciTech Connect

    Faulon, Jean-Loup Michel; Heffelfinger, Grant S.

    2009-06-01

    A novel experimental and computational technique based on multiple enzymatic digestion of a protein or protein mixture that reconstructs protein sequences from sequences of overlapping peptides is described in this SAND report. This approach, analogous to shotgun sequencing of DNA, is to be used to sequence alternative spliced proteins, to identify post-translational modifications, and to sequence genetically engineered proteins.

  11. Nucleotide sequence of dengue 2 RNA and comparison of the encoded proteins with those of other flaviviruses.

    PubMed

    Hahn, Y S; Galler, R; Hunkapiller, T; Dalrymple, J M; Strauss, J H; Strauss, E G

    1988-01-01

    We have determined the complete sequence of the RNA of dengue 2 virus (S1 candidate vaccine strain derived from the PR-159 isolate) with the exception of about 15 nucleotides at the 5' end. The genome organization is the same as that deduced earlier for other flaviviruses and the amino acid sequences of the encoded dengue 2 proteins show striking homology to those of other flaviviruses. The overall amino acid sequence similarity between dengue 2 and yellow fever virus is 44.7%, whereas that between dengue 2 and West Nile virus is 50.7%. These viruses represent three different serological subgroups of mosquito-borne flaviviruses. Comparison of the amino acid sequences shows that amino acid sequence homology is not uniformly distributed among the proteins; highest homology is found in some domains of nonstructural protein NS5 and lowest homology in the hydrophobic polypeptides ns2a and 2b. In general the structural proteins are less well conserved than the nonstructural proteins. Hydrophobicity profiles, however, are remarkably similar throughout the translated region. Comparison of the dengue 2 PR-159 sequence to partial sequence data from dengue 4 and another strain of dengue 2 virus reveals amino acid sequence homologies of about 64 and 96%, respectively, in the structural protein region. Thus as a general rule for flaviviruses examined to date, members of different serological subgroups demonstrate 50% or less amino acid sequence homology, members of the same subgroup average 65-75% homology, and strains of the same virus demonstrate greater than 95% amino acid sequence similarity.

  12. Zucchini yellow mosaic virus: biological properties, detection procedures and comparison of coat protein gene sequences.

    PubMed

    Coutts, B A; Kehoe, M A; Webster, C G; Wylie, S J; Jones, R A C

    2011-12-01

    Between 2006 and 2010, 5324 samples from at least 34 weed, two cultivated legume and 11 native species were collected from three cucurbit-growing areas in tropical or subtropical Western Australia. Two new alternative hosts of zucchini yellow mosaic virus (ZYMV) were identified, the Australian native cucurbit Cucumis maderaspatanus, and the naturalised legume species Rhyncosia minima. Low-level (0.7%) seed transmission of ZYMV was found in seedlings grown from seed collected from zucchini (Cucurbita pepo) fruit infected with isolate Cvn-1. Seed transmission was absent in >9500 pumpkin (C. maxima and C. moschata) seedlings from fruit infected with isolate Knx-1. Leaf samples from symptomatic cucurbit plants collected from fields in five cucurbit-growing areas in four Australian states were tested for the presence of ZYMV. When 42 complete coat protein (CP) nucleotide (nt) sequences from the new ZYMV isolates obtained were compared to those of 101 complete CP nt sequences from five other continents, phylogenetic analysis of the 143 ZYMV sequences revealed three distinct groups (A, B and C), with four subgroups in A (I-IV) and two in B (I-II). The new Australian sequences grouped according to collection location, fitting within A-I, A-II and B-II. The 16 new sequences from one isolated location in tropical northern Western Australia all grouped into subgroup B-II, which contained no other isolates. In contrast, the three sequences from the Northern Territory fitted into A-II with 94.6-99.0% nt identities with isolates from the United States, Iran, China and Japan. The 23 new sequences from the central west coast and two east coast locations all fitted into A-I, with 95.9-98.9% nt identities to sequences from Europe and Japan. These findings suggest that (i) there have been at least three separate ZYMV introductions into Australia and (ii) there are few changes to local isolate CP sequences following their establishment in remote growing areas. Isolates from A-I and B

  13. Establishing homologies in protein sequences

    NASA Technical Reports Server (NTRS)

    Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.

    1983-01-01

    Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.

  14. Sequence comparison and phylogenetic analysis by the Maximum Likelihood method of ribosome-inactivating proteins from angiosperms.

    PubMed

    Di Maro, Antimo; Citores, Lucía; Russo, Rosita; Iglesias, Rosario; Ferreras, José Miguel

    2014-08-01

    Ribosome-inactivating proteins (RIPs) from angiosperms are rRNA N-glycosidases that have been proposed as defence proteins against virus and fungi. They have been classified as type 1 RIPs, consisting of single-chain proteins, and type 2 RIPs, consisting of an A chain with RIP properties covalently linked to a B chain with lectin properties. In this work we have carried out a broad search of RIP sequence data banks from angiosperms in order to study their main structural characteristics and phylogenetic evolution. The comparison of the sequences revealed the presence, outside of the active site, of a novel structure that might be involved in the internal protein dynamics linked to enzyme catalysis. Also the B-chains presented another conserved structure that might function either supporting the beta-trefoil structure or in the communication between both sugar-binding sites. A systematic phylogenetic analysis of RIP sequences revealed that the most primitive type 1 RIPs were similar to that of the actual monocots (Poaceae and Asparagaceae). The primitive RIPs evolved to the dicot type 1 related RIPs (like those from Caryophyllales, Lamiales and Euphorbiales). The gene of a type 1 RIP related with the actual Euphorbiaceae type 1 RIPs fused with a double beta trefoil lectin gene similar to the actual Cucurbitaceae lectins to generate the type 2 RIPs and finally this gene underwent deletions rendering either type 1 RIPs (like those from Cucurbitaceae, Rosaceae and Iridaceae) or lectins without A chain (like those from Adoxaceae).

  15. Indigenous and introduced potyviruses of legumes and Passiflora spp. from Australia: biological properties and comparison of coat protein sequences

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Coat protein sequences of 33 Potyvirus isolates from legume and Passiflora spp. were sequenced to determine the identity of infecting viruses. Phylogenetic analysis of the sequences revealed the presence of seven distinct virus species....

  16. Sequence analysis of frog alpha B-crystallin cDNA: sequence homology and evolutionary comparison of alpha A, alpha B and heat shock proteins.

    PubMed

    Lu, S F; Pan, F M; Chiou, S H

    1995-11-22

    alpha-Crystallin is a major lens protein present in the lenses of all vertebrate species. Recent studies have revealed that bovine alpha-crystallins possess genuine chaperone activity similar to small heat-shock proteins. In order to facilitate the determination of the primary sequence of amphibian alpha B-crystallin, cDNA encoding alpha B subunit chain was amplified using a new "Rapid Amplification of cDNA Ends" (RACE) protocol of Polymerase Chain Reaction (PCR). PCR-amplified product corresponding to alpha B subunit was then subcloned into pUC18 vector and transformed into E. coli strain JM109. Plasmids purified from the positive clones were prepared for nucleotide sequencing by the automatic fluorescence-based dideoxynucleotide chain-termination method. Sequencing more than five clones containing DNA inserts coding for alpha B-crystallin subunit constructed only one complete full-length reading frame of 522 base pairs similar to that of alpha A subunit, covering a deduced protein sequence of 173 amino acids including the universal translation-initiating methionine. The frog alpha B crystallin shows 69, 66 and 56% whereas alpha A crystallin shows 83, 81 and 69% sequence similarity to the homologous chains of bovine, chicken and dogfish, respectively, revealing a more divergent structural relationship among these alpha B subunits as compared to alpha A subunits. Structural analysis and comparison of alpha A- and alpha B-crystallin subunits from eye lenses of different classes of vertebrates also shed some light on the evolutionary relatedness between alpha B/alpha A crystallins and the small heat-shock proteins.

  17. Balbiani ring DNA: sequence comparisons and evolutionary history of a family of hierarchically repetitive protein-coding genes.

    PubMed

    Pustell, J; Kafatos, F C; Wobus, U; Bäumlein, H

    1984-01-01

    All known types of Balbiani ring (BR) genes consist of multiple, tandemly arranged, ca. 180 to 300-bp repeat units that can be divided into a constant region and a subrepeat region. The latter region includes short tandem subrepeats (SRs). Comparison of all available BR sequences using computer methods has enabled us (a) to define more precisely the constant and subrepeat regions, (b) to infer the evolutionary relationships among the various types of BR repeats, (c) to derive a consensus approximation of an ancestral sequence from a small segment of which the highly diverse present-day SRs may have originated, and (d) to detect an underlying substructure in the constant region, evident in the consensus but not in the present-day sequences and possibly corresponding to an original 39-bp DNA segment from which the extant, giant BR sequences may have evolved. We discuss the processes of reduplication, diversification, and homogenization within the hierarchically repetitive BR sequences as examples of how a simple DNA element may evolve into a diverse family of large, protein-coding genes.

  18. Sequence Comparison and Phylogeny of Nucleotide Sequence of Coat Protein and Nucleic Acid Binding Protein of a Distinct Isolate of Shallot virus X from India.

    PubMed

    Majumder, S; Baranwal, V K

    2011-06-01

    Shallot virus X (ShVX), a type species in the genus Allexivirus of the family Alfaflexiviridae has been associated with shallot plants in India and other shallot growing countries like Russia, Germany, Netherland, and New Zealand. Coat protein (CP) and nucleic acid binding protein (NB) region of the virus was obtained by reverse transcriptase polymerase chain reaction from scales leaves of shallot bulbs. The partial cDNA contained two open reading frames encoding proteins of molecular weights of 28.66 and 14.18 kDa belonging to Flexi_CP super-family and viral NB super-family, respectively. The percent identity and phylogenetic analysis of amino acid sequences of CP and NB region of the virus associated with shallot indicated that it was a distinct isolate of ShVX.

  19. Sequence Comparison and Phylogeny of Nucleotide Sequence of Coat Protein and Nucleic Acid Binding Protein of a Distinct Isolate of Shallot virus X from India.

    PubMed

    Majumder, S; Baranwal, V K

    2011-06-01

    Shallot virus X (ShVX), a type species in the genus Allexivirus of the family Alfaflexiviridae has been associated with shallot plants in India and other shallot growing countries like Russia, Germany, Netherland, and New Zealand. Coat protein (CP) and nucleic acid binding protein (NB) region of the virus was obtained by reverse transcriptase polymerase chain reaction from scales leaves of shallot bulbs. The partial cDNA contained two open reading frames encoding proteins of molecular weights of 28.66 and 14.18 kDa belonging to Flexi_CP super-family and viral NB super-family, respectively. The percent identity and phylogenetic analysis of amino acid sequences of CP and NB region of the virus associated with shallot indicated that it was a distinct isolate of ShVX. PMID:23637504

  20. Comparison of amino acid sequence of bovine coagulation Factor IX (Christmas Factor) with that of other vitamin K-dependent plasma proteins.

    PubMed

    Katayama, K; Ericsson, L H; Enfield, D L; Walsh, K A; Neurath, H; Davie, E W; Titani, K

    1979-10-01

    The amino acid sequence of bovine blood coagulation Factor IX (Christmas Factor) is presented and compared with the sequences of other vitamin K-dependent plasma proteins and pancreatic trypsinogen. The 416-residue sequence of Factor IX was determined largely by automated Edman degradation of two large segments, containing 181 and 235 residues, isolated after activating Factor IX with a protease from Russell's viper venom. Subfragments of the two segments were produced by enzymatic digestion and by chemical cleavage of methionyl, tryptophyl, and asparaginyl-glycyl bonds. Comparison of the amino acid sequences of Factor IX, Factor X, and Protein C demonstrates that they are homologous throughout. Their homology with prothrombin, however, is restricted to the amino-terminal region, which is rich in gamma-carboxyglutamic acid, and the carboxyl-terminal region, which represents the catalytic domain of these proteins and corresponds to that of pancreatic serine proteases.

  1. Protein Structure Comparison and Classification

    NASA Astrophysics Data System (ADS)

    Çamoǧlu, Orhan; Singh, Ambuj K.

    The success of genome projects has generated an enormous amount of sequence data. In order to realize the full value of the data, we need to understand its functional role and its evolutionary origin. Sequence comparison methods are incredibly valuable for this task. However, for sequences falling in the twilight zone (usually between 20 and 35% sequence similarity), we need to resort to structural alignment and comparison for a meaningful analysis. Such a structural approach can be used for classification of proteins, isolation of structural motifs, and discovery of drug targets.

  2. Sequence comparisons via algorithmic mutual information.

    PubMed

    Milosavljević, A

    1994-01-01

    One of the main problems in DNA and protein sequence comparisons is to decide whether observed similarity of two sequences should be explained by their relatedness or by mere presence of some shared internal structure, e.g., shared internal tandem repeats. The standard methods that are based on statistics or classical information theory can be used to discover either internal structure or mutual sequence similarity, but cannot take into account both. Consequently, currently used methods for sequence comparison employ "masking" techniques that simply eliminate sequences that exhibit internal repetitive structure prior to sequence comparisons. The "masking" approach precludes discovery of homologous sequences of moderate or low complexity, which abound at both DNA and protein levels. As a solution to this problem, we propose a general method that is based on algorithmic information theory and minimal length encoding. We show that algorithmic mutual information factors out the sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. We extend that recently developed algorithmic significance method (Milosavljević & Jurka 1993) to show that significance depends exponentially on algorithmic mutual information.

  3. Indigenous and introduced potyviruses of legumes and Passiflora spp. from Australia: biological properties and comparison of coat protein nucleotide sequences.

    PubMed

    Coutts, Brenda A; Kehoe, Monica A; Webster, Craig G; Wylie, Stephen J; Jones, Roger A C

    2011-10-01

    Five Australian potyviruses, passion fruit woodiness virus (PWV), passiflora mosaic virus (PaMV), passiflora virus Y, clitoria chlorosis virus (ClCV) and hardenbergia mosaic virus (HarMV), and two introduced potyviruses, bean common mosaic virus (BCMV) and cowpea aphid-borne mosaic virus (CAbMV), were detected in nine wild or cultivated Passiflora and legume species growing in tropical, subtropical or Mediterranean climatic regions of Western Australia. When ClCV (1), PaMV (1), PaVY (8) and PWV (5) isolates were inoculated to 15 plant species, PWV and two PaVY P. foetida isolates infected P. edulis and P. caerulea readily but legumes only occasionally. Another PaVY P. foetida isolate resembled five PaVY legume isolates in infecting legumes readily but not infecting P. edulis. PaMV resembled PaVY legume isolates in legumes but also infected P. edulis. ClCV did not infect P. edulis or P. caerulea and behaved differently from PaVY legume isolates and PaMV when inoculated to two legume species. When complete coat protein (CP) nucleotide (nt) sequences of 33 new isolates were compared with 41 others, PWV (8), HarMV (4), PaMV (1) and ClCV (1) were within a large group of Australian isolates, while PaVY (14), CAbMV (1) and BCMV (3) isolates were in three other groups. Variation among PWV and PaVY isolates was sufficient for division into four clades each (I-IV). A variable block of 56 amino acid residues at the N-terminal region of the CPs of PaMV and ClCV distinguished them from PWV. Comparison of PWV, PaMV and ClCV CP sequences showed that nt identities were both above and below the 76-77% potyvirus species threshold level. This research gives insights into invasion of new hosts by potyviruses at the natural vegetation and cultivated area interface, and illustrates the potential of indigenous viruses to emerge to infect introduced plants. PMID:21744001

  4. Protein sequence comparisons show that the 'pseudoproteases' encoded by poxviruses and certain retroviruses belong to the deoxyuridine triphosphatase family.

    PubMed Central

    McGeoch, D J

    1990-01-01

    Amino acid sequence comparisons show extensive similarities among the deoxyuridine triphosphatases (dUTPases) of Escherichia coli and of herpesviruses, and the 'protease-like' or 'pseudoprotease' sequences encoded by certain retroviruses in the oncovirus and lentivirus families and by poxviruses. These relationships suggest strongly that the 'pseudoproteases' actually are dUTPases, and have not arisen by duplication of an oncovirus protease gene as had been suggested. The herpesvirus dUTPase sequences differ from the others in that they are longer (about 370 residues, against around 140) and one conserved element ('Motif 3') is displaced relative to its position in the other sequences; a model involving internal duplication of the herpesvirus gene can account effectively for these observations. Sequences closely similar to Motif 3 are also found in phosphofructokinases, where they form part of the active site and fructose phosphate binding structure; thus these sequences may represent a class of structural element generally involved in phosphate transfer to and from glycosides. PMID:2165588

  5. Herpes simplex virus type 1 (HSV-1) strain HSZP host shutoff gene: nucleotide sequence and comparison with HSV-1 strains differing in early shutoff of host protein synthesis.

    PubMed

    Vojvodová, A; Matis, J; Kúdelová, M; Rajcáni, J

    1997-01-01

    The UL41 gene of the HSZP strain of herpes simplex virus type 1 (HSV-1) defective with respect to the early shutoff of host protein synthesis was sequenced and compared with the corresponding HSV-1 strain KOS and 17 gene sequences. In comparison with strain 17, nine mutations (base changes) were HSZP specific, five KOS specific and four were common for both strains. Nine mutations caused codon changes. Three of these mapped to the nonconserved regions and the others to the conserved regions of the functional map of UL41 gene. One KOS specific mutation mapped to the region responsible for the binding of the virion host shutoff (vhs) protein to the alpha-transinducing factor (VP16). The possible relationship between mutations and host shutoff function is discussed. The nucleotide sequence data of the UL41 gene of HSZP and KOS have been submitted to the Genbank nucleotide database and have been assigned the accession numbers Z72337 and Z72338.

  6. Comparison of the sequences and functions of Streptococcus equi M-like proteins SeM and SzPSe.

    PubMed Central

    Timoney, J F; Artiushin, S C; Boschwitz, J S

    1997-01-01

    Streptococcus equi (Streptococcus equi subsp. equi), a Lancefield group C streptococcus, causes strangles, a highly contagious purulent lymphadenitis and pharyngitis of members of the family Equidae. The antiphagocytic 58-kDa M-like protein SeM is a major virulence factor and protective antigen. The amino acid sequence and structure of SeM has been determined and compared to that of a second, 40-kDa M-like protein (SzPSe) of S. equi and to those of other streptococcal proteins. Both SeM and SzPSe are mainly alpha-helical fibrillar molecules with no homology other than that between their signal and membrane anchor sequences and are only distantly related to other streptococcal M and M-like proteins. The sequence of SzPSe indicates that it is an allele of SzP that encodes the variable protective M-like and typing antigens of S. zooepidemicus (S. equi subsp. zooepidemicus). SeM is opsonogenic for S. equi but not for the closely related S. zooepidemicus, whereas SzPSe is strongly opsonogenic for S. zooepidemicus but not for S. equi. Both proteins bind equine fibrinogen. SeM and SzPSe proteins from temporally and geographically separated isolates of S. equi are identical in size. The results taken together support previous evidence that S. equi is a clonal pathogen originating from an ancestral strain of S. zooepidemicus. We postulate that acquisition of SeM synthesis was a key element in the success of the clone because of its effect in enhancing resistance to phagocytosis and because protective immunity entails a requirement for SeM-specific antibody. PMID:9284125

  7. Comparisons of Ribosomal Protein Gene Promoters Indicate Superiority of Heterologous Regulatory Sequences for Expressing Transgenes in Phytophthora infestans

    PubMed Central

    Khachatoorian, Careen; Judelson, Howard S.

    2015-01-01

    Molecular genetics approaches in Phytophthora research can be hampered by the limited number of known constitutive promoters for expressing transgenes and the instability of transgene activity. We have therefore characterized genes encoding the cytoplasmic ribosomal proteins of Phytophthora and studied their suitability for expressing transgenes in P. infestans. Phytophthora spp. encode a standard complement of 79 cytoplasmic ribosomal proteins. Several genes are duplicated, and two appear to be pseudogenes. Half of the genes are expressed at similar levels during all stages of asexual development, and we discovered that the majority share a novel promoter motif named the PhRiboBox. This sequence is enriched in genes associated with transcription, translation, and DNA replication, including tRNA and rRNA biogenesis. Promoters from the three P. infestans genes encoding ribosomal proteins S9, L10, and L23 and their orthologs from P. capsici were tested for their ability to drive transgenes in stable transformants of P. infestans. Five of the six promoters yielded strong expression of a GUS reporter, but the stability of expression was higher using the P. capsici promoters. With the RPS9 and RPL10 promoters of P. infestans, about half of transformants stopped making GUS over two years of culture, while their P. capsici orthologs conferred stable expression. Since cross-talk between native and transgene loci may trigger gene silencing, we encourage the use of heterologous promoters in transformation studies. PMID:26716454

  8. Comparisons of Ribosomal Protein Gene Promoters Indicate Superiority of Heterologous Regulatory Sequences for Expressing Transgenes in Phytophthora infestans.

    PubMed

    Poidevin, Laetitia; Andreeva, Kalina; Khachatoorian, Careen; Judelson, Howard S

    2015-01-01

    Molecular genetics approaches in Phytophthora research can be hampered by the limited number of known constitutive promoters for expressing transgenes and the instability of transgene activity. We have therefore characterized genes encoding the cytoplasmic ribosomal proteins of Phytophthora and studied their suitability for expressing transgenes in P. infestans. Phytophthora spp. encode a standard complement of 79 cytoplasmic ribosomal proteins. Several genes are duplicated, and two appear to be pseudogenes. Half of the genes are expressed at similar levels during all stages of asexual development, and we discovered that the majority share a novel promoter motif named the PhRiboBox. This sequence is enriched in genes associated with transcription, translation, and DNA replication, including tRNA and rRNA biogenesis. Promoters from the three P. infestans genes encoding ribosomal proteins S9, L10, and L23 and their orthologs from P. capsici were tested for their ability to drive transgenes in stable transformants of P. infestans. Five of the six promoters yielded strong expression of a GUS reporter, but the stability of expression was higher using the P. capsici promoters. With the RPS9 and RPL10 promoters of P. infestans, about half of transformants stopped making GUS over two years of culture, while their P. capsici orthologs conferred stable expression. Since cross-talk between native and transgene loci may trigger gene silencing, we encourage the use of heterologous promoters in transformation studies. PMID:26716454

  9. Supercomputers and biological sequence comparison algorithms.

    PubMed

    Core, N G; Edmiston, E W; Saltz, J H; Smith, R M

    1989-12-01

    Comparison of biological (DNA or protein) sequences provides insight into molecular structure, function, and homology and is increasingly important as the available databases become larger and more numerous. One method of increasing the speed of the calculations is to perform them in parallel. We present the results of initial investigations using two dynamic programming algorithms on the Intel iPSC hypercube and the Connection Machine as well as an inexpensive, heuristically-based algorithm on the Encore Multimax.

  10. Mercury BLASTP: Accelerating Protein Sequence Alignment

    PubMed Central

    Jacob, Arpith; Lancaster, Joseph; Buhler, Jeremy; Harris, Brandon; Chamberlain, Roger D.

    2008-01-01

    Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results. PMID:19492068

  11. Rosetta stone method for detecting protein function and protein-protein interactions from genome sequences

    DOEpatents

    Eisenberg, David; Marcotte, Edward M.; Pellegrini, Matteo; Thompson, Michael J.; Yeates, Todd O.

    2002-10-15

    A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.

  12. Distinguishing proteins from arbitrary amino acid sequences.

    PubMed

    Yau, Stephen S-T; Mao, Wei-Guang; Benson, Max; He, Rong Lucy

    2015-01-01

    What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe. PMID:25609314

  13. Methods for comparing a DNA sequence with a protein sequence.

    PubMed

    Huang, X; Zhang, J

    1996-12-01

    We describe two methods for constructing an optimal global alignment of, and an optimal local alignment between, a DNA sequence and a protein sequence. The alignment model of the methods addresses the problems of frameshifts and introns in the DNA sequence. The methods require computer memory proportional to the sequence lengths, so they can rigorously process very huge sequences. The simplified versions of the methods were implemented as computer programs named NAP and LAP. The experimental results demonstrate that the programs are sensitive and powerful tools for finding genes by DNA-protein sequence homology.

  14. Comparisons of coat protein gene sequences show that East African isolates of Sweet potato feathery mottle virus form a genetically distinct group.

    PubMed

    Kreuze, J F; Karyeija, R F; Gibson, R W; Valkonen, J P

    2000-01-01

    Sweet potato feathery mottle virus (SPFMV, genus Potyvirus) infects sweet potatoes (Ipomoea batatas) worldwide, but no sequence data on isolates from Africa are available. Coat protein (CP) gene sequences from eight East African isolates from Madagascar and different districts of Uganda (the second biggest sweet potato producer in the world) and two West African isolates from Nigeria and Niger were determined. They were compared by phylogenetic analysis with the previously reported sequences of ten SPFMV isolates from other continents. The East African SPFMV isolates formed a distinct cluster, whereas the other isolates were not clustered according to geographic origin. These data indicate that East African isolates of SPFMV form a genetically unique group.

  15. Method and apparatus for biological sequence comparison

    DOEpatents

    Marr, Thomas G.; Chang, William I-Wei

    1997-01-01

    A method and apparatus for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence.

  16. Method and apparatus for biological sequence comparison

    DOEpatents

    Marr, T.G.; Chang, W.I.

    1997-12-23

    A method and apparatus are disclosed for comparing biological sequences from a known source of sequences, with a subject (query) sequence. The apparatus takes as input a set of target similarity levels (such as evolutionary distances in units of PAM), and finds all fragments of known sequences that are similar to the subject sequence at each target similarity level, and are long enough to be statistically significant. The invention device filters out fragments from the known sequences that are too short, or have a lower average similarity to the subject sequence than is required by each target similarity level. The subject sequence is then compared only to the remaining known sequences to find the best matches. The filtering member divides the subject sequence into overlapping blocks, each block being sufficiently large to contain a minimum-length alignment from a known sequence. For each block, the filter member compares the block with every possible short fragment in the known sequences and determines a best match for each comparison. The determined set of short fragment best matches for the block provide an upper threshold on alignment values. Regions of a certain length from the known sequences that have a mean alignment value upper threshold greater than a target unit score are concatenated to form a union. The current block is compared to the union and provides an indication of best local alignment with the subject sequence. 5 figs.

  17. A new graphical representation of protein sequences and its applications

    NASA Astrophysics Data System (ADS)

    Hou, Wenbing; Pan, Qiuhui; He, Mingfeng

    2016-02-01

    Sequence analysis is one of the foundations in bioinformatics for the abundant information hidden in the sequences. It is helpful for scientists' study on the function of DNA, proteins and cells. In this paper, we outline a novel method for protein sequences similarity analysis based on the physical-chemical properties of amino acids. We consider the protein sequence as a rigid-body with mass. Then we introduce the moment of inertia to the calculation of similarity of sequences and the sequences are transformed into vectors by the tensor for moment of inertia. The Euclidean distance is employed as a measurement of the similarities. At last, the comparison with other references' results shows our approach is reasonable and effective.

  18. Protein folds and families: sequence and structure alignments.

    PubMed

    Holm, L; Sander, C

    1999-01-01

    Dali and HSSP are derived databases organizing protein space in the structurally known regions. We use an automatic structure alignment program (Dali) for the classification of all known 3D structures based on all-against-all comparison of 3D structures in the Protein Data Bank. The HSSP database associates 1D sequences with known 3D structures using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). As a result, the HSSP database not only provides aligned sequence families, but also implies secondary and tertiary structures covering 36% of all sequences in Swiss-Prot. The structure classification by Dali and the sequence families in HSSP can be browsed jointly from a web interface providing a rich network of links between neighbours in fold space, between domains and proteins, and between structures and sequences. In particular, this results in a database of explicit multiple alignments of protein families in the twilight zone of sequence similarity. The organization of protein structures and families provides a map of the currently known regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The databases are available from http://www.embl-ebi.ac.uk/dali/

  19. Comparison of the complete sequences of three different isolates of Pepino mosaic virus: size variability of the TGBp3 protein between tomato and L. peruvianum isolates.

    PubMed

    López, C; Soler, S; Nuez, F

    2005-03-01

    The complete nucleotide sequence of the genomes of two Spanish isolates (LE-2000 and LE-2002) from tomato and one Peruvian isolate (LP-2001) from Lycopersicon peruvianum of the Pepino mosaic virus (PepMV) were determined. The tomato isolates share identities higher than 99%, while the genome of LP-2001 had mean nucleotide identities of 95.6% to 96.0% with tomato isolates. The predicted amino acid sequences showed similarities ranging between 95.2% and 100% with TGBp3 and TGBp2 and CP proteins, respectively. In LP-2001 two main differences were found with respect to the tomato isolates; (i) the 5' untranslated region (UTR) was 2 nt shorter by deletion at position 12-13 and it had some polymorphims at the putative promoter sequence reported for PepMV tomato isolates and other potexviruses, which could be functionally significant for RNA replication, and (ii) the TGBp3 protein had two extra amino acids in the C-terminal region.

  20. Comparison of metagenomic samples using sequence signatures

    PubMed Central

    2012-01-01

    Background Sequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied. Results We studied several dissimilarity measures, including d2, d2* and d2S recently developed from our group, a measure (hereinafter noted as Hao) used in CVTree developed from Hao’s group (Qi et al., 2004), measures based on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al. (2009), as well as standard lp measures between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We compared their performance using a series of extensive simulations and three real next-generation sequencing (NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d2S can achieve superior performance when comparing metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples. New insights into the environmental factors affecting microbial compositions in metagenomic samples are obtained through

  1. The Shannon information entropy of protein sequences.

    PubMed Central

    Strait, B J; Dewey, T G

    1996-01-01

    A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed. PMID:8804598

  2. The PIR-International Protein Sequence Database.

    PubMed

    Barker, W C; Garavelli, J S; McGarvey, P B; Marzec, C R; Orcutt, B C; Srinivasarao, G Y; Yeh, L S; Ledley, R S; Mewes, H W; Pfeiffer, F; Tsugita, A; Wu, C

    1999-01-01

    The Protein Information Resource (PIR; http://www-nbrf.georgetown. edu/pir/) supports research on molecular evolution, functional genomics, and computational biology by maintaining a comprehensive, non-redundant, well-organized and freely available protein sequence database. Since 1988 the database has been maintained collaboratively by PIR-International, an international association of data collection centers cooperating to develop this resource during a period of explosive growth in new sequence data and new computer technologies. The PIR Protein Sequence Database entries are classified into superfamilies, families and homology domains, for which sequence alignments are available. Full-scale family classification supports comparative genomics research, aids sequence annotation, assists database organization and improves database integrity. The PIR WWW server supports direct on-line sequence similarity searches, information retrieval, and knowledge discovery by providing the Protein Sequence Database and other supplementary databases. Sequence entries are extensively cross-referenced and hypertext-linked to major nucleic acid, literature, genome, structure, sequence alignment and family databases. The weekly release of the Protein Sequence Database can be accessed through the PIR Web site. The quarterly release of the database is freely available from our anonymous FTP server and is also available on CD-ROM with the accompanying ATLAS database search program.

  3. Protein Sequencing with Tandem Mass Spectrometry

    NASA Astrophysics Data System (ADS)

    Ziady, Assem G.; Kinter, Michael

    The recent introduction of electrospray ionization techniques that are suitable for peptides and whole proteins has allowed for the design of mass spectrometric protocols that provide accurate sequence information for proteins. The advantages gained by these approaches over traditional Edman Degradation sequencing include faster analysis and femtomole, sometimes attomole, sensitivity. The ability to efficiently identify proteins has allowed investigators to conduct studies on their differential expression or modification in response to various treatments or disease states. In this chapter, we discuss the use of electrospray tandem mass spectrometry, a technique whereby protein-derived peptides are subjected to fragmentation in the gas phase, revealing sequence information for the protein. This powerful technique has been instrumental for the study of proteins and markers associated with various disorders, including heart disease, cancer, and cystic fibrosis. We use the study of protein expression in cystic fibrosis as an example.

  4. Protein sequence analysis using Hewlett-Packard biphasic sequencing cartridges in an applied biosystems 473A protein sequencer.

    PubMed

    Tang, S; Mozdzanowski, J; Anumula, K R

    1999-01-01

    Protein sequence analysis using an adsorptive biphasic sequencing cartridge, a set of two coupled columns introduced by Hewlett-Packard for protein sequencing by Edman degradation, in an Applied Biosystems 473A protein sequencer has been demonstrated. Samples containing salts, detergents, excipients, etc. (e.g., formulated protein drugs) can be easily analyzed using the ABI sequencer. Simple modifications to the ABI sequencer to accommodate the cartridge extend its utility in the analysis of difficult samples. The ABI sequencer solvents and reagents were compatible with the HP cartridge for sequencing. Sequence information up to ten residues can be easily generated by this nonoptimized procedure, and it is sufficient for identifying proteins by database search and for preparing a DNA probe for cloning novel proteins.

  5. Variable region sequences of murine IgM anti-IgG monoclonal autoantibodies (rheumatoid factors). II. Comparison of hybridomas derived by lipopolysaccharide stimulation and secondary protein immunization

    PubMed Central

    1987-01-01

    We have obtained the complete variable region mRNA sequences of 11 LPS- derived and 14 secondary immunization-derived monoclonal IgM anti-IgG antibodies (rheumatoid factors, RFs). A comparative analysis of these sequences showed that monoclonal RFs derived after polyclonal activation are structurally very similar to RFs derived after secondary protein immunization. This study was undertaken to evaluate the potential relationship between two previously described phenomena: (a) during a secondary response to a protein antigen, RF is produced in quantities that equal or exceed the immunogen-specific antibody; and (b) the frequency of B cells that make RF after polyclonal activation is quite high; 3-10%. It has been unclear whether LPS-stimulated cells that produce IgM anti-IgG that is detected by an in vitro assay are related to the cells that produce RF after in vivo stimulation. The similarity of the antigen receptors found in the two types of RF, however, suggests that most or all of the RF-producing B cells detected after LPS stimulation would also be stimulated during the secondary immune response. Thus, the presence of relatively large number of B cells that can make RF after nonspecific stimulation provides an explanation for the magnitude of RF production accompanying the secondary immune response. PMID:3494096

  6. Vibrio cholerae O395 tcpA pilin gene sequence and comparison of predicted protein structural features to those of type 4 pilins.

    PubMed Central

    Shaw, C E; Taylor, R K

    1990-01-01

    Vibrio cholerae O1 expresses a pilus that is coordinately regulated with cholera toxin production and hence termed TCP, for toxin-coregulated pilus. Insertion of Tn5 IS50L::phoA (TnphoA) into the major pilin subunit gene, tcpA, has previously been shown to render the strain avirulent as a result of its inability to colonize. One such insertion was isolated and used as a probe to screen for clones containing the intact tcpA gene. The DNA sequence of tcpA was determined by using the intact gene and several tcpA-phoA gene fusions. The deduced protein sequence agreed completely with that previously determined for the TcpA N terminus and with the size of the mature pilin protein. The reported homology with N-methylphenylalanine (type 4) pilins near the N terminus was extended and shown to include components of the atypical leader peptide as well as overall predicted structural similarities in other regions of the pilins. In contrast to the modified N-terminal phenylalanine residue found in all characterized type 4 pilins, the corresponding position in tcpA contains a Met codon, thus implying that the previously uncharacterized amino acid corresponding to the N-terminal position of the mature TcpA pilin is a modified form of methionine. Except for this difference, mature TcpA has the overall predicted structural motifs shared among type 4 pilins. Images PMID:1974887

  7. Identifying and quantifying orphan protein sequences in fungi.

    PubMed

    Ekman, Diana; Elofsson, Arne

    2010-02-19

    For large regions of many proteins, and even entire proteins, no homology to known domains or proteins can be detected. These sequences are often referred to as orphans. Surprisingly, it has been reported that the large number of orphans is sustained in spite of a rapid increase of available genomic sequences. However, it is believed that de novo creation of coding sequences is rare in comparison to mechanisms such as domain shuffling and gene duplication; hence, most sequences should have homologs in other genomes. To investigate this, the sequences of 19 complete fungi genomes were compared. By using the phylogenetic relationship between these genomes, we could identify potentially de novo created orphans in Saccharomyces cerevisiae. We found that only a small fraction, <2%, of the S. cerevisiae proteome is orphan, which confirms that de novo creation of coding sequences is indeed rare. Furthermore, we found it necessary to compare the most closely related species to distinguish between de novo created sequences and rapidly evolving sequences where homologs are present but cannot be detected. Next, the orphan proteins (OPs) and orphan domains (ODs) were characterized. First, it was observed that both OPs and ODs are short. In addition, at least some of the OPs have been shown to be functional in experimental assays, showing that they are not pseudogenes. Furthermore, in contrast to what has been reported before and what is seen for older orphans, S. cerevisiae specific ODs and proteins are not more disordered than other proteins. This might indicate that many of the older, and earlier classified, orphans indeed are fast-evolving sequences. Finally, >90% of the detected ODs are located at the protein termini, which suggests that these orphans could have been created by mutations that have affected the start or stop codons.

  8. PROCAIN: protein profile comparison with assisting information

    PubMed Central

    Wang, Yong; Sadreyev, Ruslan I.; Grishin, Nick V.

    2009-01-01

    Detection of remote sequence homology is essential for the accurate inference of protein structure, function and evolution. The most sensitive detection methods involve the comparison of evolutionary patterns reflected in multiple sequence alignments (MSAs) of protein families. We present PROCAIN, a new method for MSA comparison based on the combination of ‘vertical’ MSA context (substitution constraints at individual sequence positions) and ‘horizontal’ context (patterns of residue content at multiple positions). Based on a simple and tractable profile methodology and primitive measures for the similarity of horizontal MSA patterns, the method achieves the quality of homology detection comparable to a more complex advanced method employing hidden Markov models (HMMs) and secondary structure (SS) prediction. Adding SS information further improves PROCAIN performance beyond the capabilities of current state-of-the-art tools. The potential value of the method for structure/function predictions is illustrated by the detection of subtle homology between evolutionary distant yet structurally similar protein domains. ProCAIn, relevant databases and tools can be downloaded from: http://prodata.swmed.edu/procain/download. The web server can be accessed at http://prodata.swmed.edu/procain/procain.php. PMID:19357092

  9. PROCAIN: protein profile comparison with assisting information.

    PubMed

    Wang, Yong; Sadreyev, Ruslan I; Grishin, Nick V

    2009-06-01

    Detection of remote sequence homology is essential for the accurate inference of protein structure, function and evolution. The most sensitive detection methods involve the comparison of evolutionary patterns reflected in multiple sequence alignments (MSAs) of protein families. We present PROCAIN, a new method for MSA comparison based on the combination of 'vertical' MSA context (substitution constraints at individual sequence positions) and 'horizontal' context (patterns of residue content at multiple positions). Based on a simple and tractable profile methodology and primitive measures for the similarity of horizontal MSA patterns, the method achieves the quality of homology detection comparable to a more complex advanced method employing hidden Markov models (HMMs) and secondary structure (SS) prediction. Adding SS information further improves PROCAIN performance beyond the capabilities of current state-of-the-art tools. The potential value of the method for structure/function predictions is illustrated by the detection of subtle homology between evolutionary distant yet structurally similar protein domains. ProCAIn, relevant databases and tools can be downloaded from: http://prodata.swmed.edu/procain/download. The web server can be accessed at http://prodata.swmed.edu/procain/procain.php. PMID:19357092

  10. PROCAIN: protein profile comparison with assisting information.

    PubMed

    Wang, Yong; Sadreyev, Ruslan I; Grishin, Nick V

    2009-06-01

    Detection of remote sequence homology is essential for the accurate inference of protein structure, function and evolution. The most sensitive detection methods involve the comparison of evolutionary patterns reflected in multiple sequence alignments (MSAs) of protein families. We present PROCAIN, a new method for MSA comparison based on the combination of 'vertical' MSA context (substitution constraints at individual sequence positions) and 'horizontal' context (patterns of residue content at multiple positions). Based on a simple and tractable profile methodology and primitive measures for the similarity of horizontal MSA patterns, the method achieves the quality of homology detection comparable to a more complex advanced method employing hidden Markov models (HMMs) and secondary structure (SS) prediction. Adding SS information further improves PROCAIN performance beyond the capabilities of current state-of-the-art tools. The potential value of the method for structure/function predictions is illustrated by the detection of subtle homology between evolutionary distant yet structurally similar protein domains. ProCAIn, relevant databases and tools can be downloaded from: http://prodata.swmed.edu/procain/download. The web server can be accessed at http://prodata.swmed.edu/procain/procain.php.

  11. Measuring the functional sequence complexity of proteins

    PubMed Central

    Durston, Kirk K; Chiu, David KY; Abel, David L; Trevors, Jack T

    2007-01-01

    Background Abel and Trevors have delineated three aspects of sequence complexity, Random Sequence Complexity (RSC), Ordered Sequence Complexity (OSC) and Functional Sequence Complexity (FSC) observed in biosequences such as proteins. In this paper, we provide a method to measure functional sequence complexity. Methods and Results We have extended Shannon uncertainty by incorporating the data variable with a functionality variable. The resulting measured unit, which we call Functional bit (Fit), is calculated from the sequence data jointly with the defined functionality variable. To demonstrate the relevance to functional bioinformatics, a method to measure functional sequence complexity was developed and applied to 35 protein families. Considerations were made in determining how the measure can be used to correlate functionality when relating to the whole molecule and sub-molecule. In the experiment, we show that when the proposed measure is applied to the aligned protein sequences of ubiquitin, 6 of the 7 highest value sites correlate with the binding domain. Conclusion For future extensions, measures of functional bioinformatics may provide a means to evaluate potential evolving pathways from effects such as mutations, as well as analyzing the internal structural and functional relationships within the 3-D structure of proteins. PMID:18062814

  12. Sequencing proteins with transverse ionic transport

    NASA Astrophysics Data System (ADS)

    Boynton, Paul; di Ventra, Massimiliano

    2015-03-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms. By obtaining the order of the amino acids that composes a given protein one can determine both its secondary and tertiary structures through protein structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Mass spectrometry is the current technique of choice for de novo sequencing, but because some amino acids have the same mass the sequence cannot be completely determined in many cases. In this paper we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel, similar to that proposed in for DNA sequencing. Indeed, we find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing.

  13. Simultaneous Alignment and Folding of Protein Sequences

    PubMed Central

    Waldispühl, Jérôme; O'Donnell, Charles W.; Will, Sebastian; Devadas, Srinivas; Backofen, Rolf

    2014-01-01

    Abstract Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structure prediction where current approaches fail. Importantly, partiFold-Align requires no prior training. These general techniques are widely applicable to many more protein families (partiFold-Align is available at http://partifold.csail.mit.edu/). PMID:24766258

  14. Sequence information signal processor for local and global string comparisons

    DOEpatents

    Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

    1997-01-01

    A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.

  15. The DynaMine webserver: predicting protein dynamics from sequence.

    PubMed

    Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F

    2014-07-01

    Protein dynamics are important for understanding protein function. Unfortunately, accurate protein dynamics information is difficult to obtain: here we present the DynaMine webserver, which provides predictions for the fast backbone movements of proteins directly from their amino-acid sequence. DynaMine rapidly produces a profile describing the statistical potential for such movements at residue-level resolution. The predicted values have meaning on an absolute scale and go beyond the traditional binary classification of residues as ordered or disordered, thus allowing for direct dynamics comparisons between protein regions. Through this webserver, we provide molecular biologists with an efficient and easy to use tool for predicting the dynamical characteristics of any protein of interest, even in the absence of experimental observations. The prediction results are visualized and can be directly downloaded. The DynaMine webserver, including instructive examples describing the meaning of the profiles, is available at http://dynamine.ibsquare.be.

  16. Globally, unrelated protein sequences appear random

    PubMed Central

    Lavelle, Daniel T.; Pearson, William R.

    2010-01-01

    Motivation: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models. Results: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in α-helical secondary structures (but not β-strands). Five-residue consensus exceptional words are enriched for α-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for α-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random. Contact: wrp@virginia.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19948773

  17. Protein sequence classification using feature hashing.

    PubMed

    Caragea, Cornelia; Silvescu, Adrian; Mitra, Prasenjit

    2012-06-21

    Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

  18. Dimeric 3-phosphoglycerate kinases from hyperthermophilic Archaea. Cloning, sequencing and expression of the 3-phosphoglycerate kinase gene of Pyrococcus woesei in Escherichia coli and characterization of the protein. Structural and functional comparison with the 3-phosphoglycerate kinase of Methanothermus fervidus.

    PubMed

    Hess, D; Krüger, K; Knappik, A; Palm, P; Hensel, R

    1995-10-01

    The gene coding for the 3-phosphoglycerate kinase (EC 2.7.2.3) of Pyrococcus woesei was cloned and sequenced. The gene sequence comprises 1230 bp coding for a polypeptide with the theoretical M(r) of 46,195. The deduced protein sequence exhibits a high similarity (46.1% and 46.6% identity) to the other known archaeal 3-phosphoglycerate kinases of Methanobacterium bryantii and Methanothermus fervidus [Fabry, S., Heppner, P., Dietmaier, W. & Hensel, R. (1990) Gene 91, 19-25]. By comparing the 3-phosphoglycerate kinase sequences of the mesophilic and the two thermophilic Archaea, trends in thermoadaptation were confirmed that could be deduced from comparisons of glyceraldehyde-3-phosphate dehydrogenase sequences from the same organisms [Zwickl, P., Fabry, S., Bogedain, C., Haas, A. & Hensel, R. (1990) J. Bacteriol. 172, 4329-4338]. With increasing temperature the average hydrophobicity and the portion of aromatic residues increases, whereas the chain flexibility as well as the content in chemically labile residues (Asn, Cys) decreases. To study the phenotypic properties of the 3-phosphoglycerate kinases from thermophilic Archaea in more detail, the 3-phosphoglycerate kinase genes from P. woesei and M. fervidus were expressed in Escherichia coli. Comparisons of kinetic and molecular properties of the enzymes from the original organisms and from E. coli indicate that the proteins expressed in the mesophilic host are folded correctly. Besides their higher thermostability according to their origin from hyperthermophilic organisms, both enzymes differ from their bacterial and eucaryotic homologues mainly in two respects. (a) The 3-phosphoglycerate kinases from P. woesei and M. fervidus are homomeric dimers in their native state contrary to all other known 3-phosphoglycerate kinases, which are monomers including the enzyme from the mesophilic Archaeum M. bryantii. (b) Monovalent cations are essential for the activity of both archaeal enzymes with K+ being significantly more

  19. Dimeric 3-phosphoglycerate kinases from hyperthermophilic Archaea. Cloning, sequencing and expression of the 3-phosphoglycerate kinase gene of Pyrococcus woesei in Escherichia coli and characterization of the protein. Structural and functional comparison with the 3-phosphoglycerate kinase of Methanothermus fervidus.

    PubMed

    Hess, D; Krüger, K; Knappik, A; Palm, P; Hensel, R

    1995-10-01

    The gene coding for the 3-phosphoglycerate kinase (EC 2.7.2.3) of Pyrococcus woesei was cloned and sequenced. The gene sequence comprises 1230 bp coding for a polypeptide with the theoretical M(r) of 46,195. The deduced protein sequence exhibits a high similarity (46.1% and 46.6% identity) to the other known archaeal 3-phosphoglycerate kinases of Methanobacterium bryantii and Methanothermus fervidus [Fabry, S., Heppner, P., Dietmaier, W. & Hensel, R. (1990) Gene 91, 19-25]. By comparing the 3-phosphoglycerate kinase sequences of the mesophilic and the two thermophilic Archaea, trends in thermoadaptation were confirmed that could be deduced from comparisons of glyceraldehyde-3-phosphate dehydrogenase sequences from the same organisms [Zwickl, P., Fabry, S., Bogedain, C., Haas, A. & Hensel, R. (1990) J. Bacteriol. 172, 4329-4338]. With increasing temperature the average hydrophobicity and the portion of aromatic residues increases, whereas the chain flexibility as well as the content in chemically labile residues (Asn, Cys) decreases. To study the phenotypic properties of the 3-phosphoglycerate kinases from thermophilic Archaea in more detail, the 3-phosphoglycerate kinase genes from P. woesei and M. fervidus were expressed in Escherichia coli. Comparisons of kinetic and molecular properties of the enzymes from the original organisms and from E. coli indicate that the proteins expressed in the mesophilic host are folded correctly. Besides their higher thermostability according to their origin from hyperthermophilic organisms, both enzymes differ from their bacterial and eucaryotic homologues mainly in two respects. (a) The 3-phosphoglycerate kinases from P. woesei and M. fervidus are homomeric dimers in their native state contrary to all other known 3-phosphoglycerate kinases, which are monomers including the enzyme from the mesophilic Archaeum M. bryantii. (b) Monovalent cations are essential for the activity of both archaeal enzymes with K+ being significantly more

  20. The PIR-International Protein Sequence Database.

    PubMed

    Barker, W C; Garavelli, J S; Haft, D H; Hunt, L T; Marzec, C R; Orcutt, B C; Srinivasarao, G Y; Yeh, L S; Ledley, R S; Mewes, H W; Pfeiffer, F; Tsugita, A

    1998-01-01

    From its origin the Protein Information Resource (http://www-nbrf. georgetown.edu/pir/) has supported research on evolution and computational biology by designing and compiling a comprehensive, quality controlled, and well-organized protein sequence database. The database has been produced and updated on a regular schedule since 1984. Since 1988 it has been maintained collaboratively by the PIR-International, an association of data collection centers engaged in international cooperation for the development of this research resource during a period of explosive acquisition of new data. As of June 1997, essentially all sequence entries have been classified into families, allowing the efficient application of methods to propagate and standardize annotation among related sequences. The databases are available through the Internet by the World-Wide Web and FTP, or on CD-ROM and magnetic media.

  1. HPMV: human protein mutation viewer - relating sequence mutations to protein sequence architecture and function changes.

    PubMed

    Sherman, Westley Arthur; Kuchibhatla, Durga Bhavani; Limviphuvadh, Vachiranee; Maurer-Stroh, Sebastian; Eisenhaber, Birgit; Eisenhaber, Frank

    2015-10-01

    Next-generation sequencing advances are rapidly expanding the number of human mutations to be analyzed for causative roles in genetic disorders. Our Human Protein Mutation Viewer (HPMV) is intended to explore the biomolecular mechanistic significance of non-synonymous human mutations in protein-coding genomic regions. The tool helps to assess whether protein mutations affect the occurrence of sequence-architectural features (globular domains, targeting signals, post-translational modification sites, etc.). As input, HPMV accepts protein mutations - as UniProt accessions with mutations (e.g. HGVS nomenclature), genome coordinates, or FASTA sequences. As output, HPMV provides an interactive cartoon showing the mutations in relation to elements of the sequence architecture. A large variety of protein sequence architectural features were selected for their particular relevance to mutation interpretation. Clicking a sequence feature in the cartoon expands a tree view of additional information including multiple sequence alignments of conserved domains and a simple 3D viewer mapping the mutation to known PDB structures, if available. The cartoon is also correlated with a multiple sequence alignment of similar sequences from other organisms. In cases where a mutation is likely to have a straightforward interpretation (e.g. a point mutation disrupting a well-understood targeting signal), this interpretation is suggested. The interactive cartoon can be downloaded as standalone viewer in Java jar format to be saved and viewed later with only a standard Java runtime environment. The HPMV website is: http://hpmv.bii.a-star.edu.sg/ .

  2. Sequencing of proteins extracted from stones.

    PubMed

    Binette, J P; Binette, M B

    1994-01-01

    Proteins from urinary tract and gallbladder stones were extracted and characterized to determine the composition of the matrix and possibly unravel the role of the organic phase in stone formation. Proteins from crushed stones were extracted by electrodialysis and concentrated in the Amicon centricon cartridge or by lyophilization after dialysis against distilled water. Aliquots were first analyzed by isoelectric focusing in gel and if suitable subjected to two-dimensional (2D) electrophoresis. The most promising spots were harvested and the N-terminal amino acids sequenced, thus providing maximum information with minimum expenditure of material. The 2D separations and amino acid sequences of several protein extracts demonstrated similarities and differences in composition and achieved the identification or demonstration of previously and recently detected polypeptides.

  3. Integrative visual analysis of protein sequence mutations

    PubMed Central

    2014-01-01

    Background An important aspect of studying the relationship between protein sequence, structure and function is the molecular characterization of the effect of protein mutations. To understand the functional impact of amino acid changes, the multiple biological properties of protein residues have to be considered together. Results Here, we present a novel visual approach for analyzing residue mutations. It combines different biological visualizations and integrates them with molecular data derived from external resources. To show various aspects of the biological information on different scales, our approach includes one-dimensional sequence views, three-dimensional protein structure views and two-dimensional views of residue interaction networks as well as aggregated views. The views are linked tightly and synchronized to reduce the cognitive load of the user when switching between them. In particular, the protein mutations are mapped onto the views together with further functional and structural information. We also assess the impact of individual amino acid changes by the detailed analysis and visualization of the involved residue interactions. We demonstrate the effectiveness of our approach and the developed software on the data provided for the BioVis 2013 data contest. Conclusions Our visual approach and software greatly facilitate the integrative and interactive analysis of protein mutations based on complementary visualizations. The different data views offered to the user are enriched with information about molecular properties of amino acid residues and further biological knowledge. PMID:25237389

  4. Sequence determinants of protein aggregation: tools to increase protein solubility

    PubMed Central

    Ventura, Salvador

    2005-01-01

    Escherichia coli is one of the most widely used hosts for the production of recombinant proteins. However, very often the target protein accumulates into insoluble aggregates in a misfolded and biologically inactive form. Bacterial inclusion bodies are major bottlenecks in protein production and are hampering the development of top priority research areas such structural genomics. Inclusion body formation was formerly considered to occur via non-specific association of hydrophobic surfaces in folding intermediates. Increasing evidence, however, indicates that protein aggregation in bacteria resembles to the well-studied process of amyloid fibril formation. Both processes appear to rely on the formation of specific, sequence-dependent, intermolecular interactions driving the formation of structured protein aggregates. This similarity in the mechanisms of aggregation will probably allow applying anti-aggregational strategies already tested in the amyloid context to the less explored area of protein aggregation inside bacteria. Specifically, new sequence-based approaches appear as promising tools to tune protein aggregation in biotechnological processes. PMID:15847694

  5. Sequence analysis of the AAA protein family.

    PubMed Central

    Beyer, A.

    1997-01-01

    The AAA protein family, a recently recognized group of Walker-type ATPases, has been subjected to an extensive sequence analysis. Multiple sequence alignments revealed the existence of a region of sequence similarity, the so-called AAA cassette. The borders of this cassette were localized and within it, three boxes of a high degree of conservation were identified. Two of these boxes could be assigned to substantial parts of the ATP binding site (namely, to Walker motifs A and B); the third may be a portion of the catalytic center. Phylogenetic trees were calculated to obtain insights into the evolutionary history of the family. Subfamilies with varying degrees of intra-relatedness could be discriminated; these relationships are also supported by analysis of sequences outside the canonical AAA boxes: within the cassette are regions that are strongly conserved within each subfamily, whereas little or even no similarity between different subfamilies can be observed. These regions are well suited to define fingerprints for subfamilies. A secondary structure prediction utilizing all available sequence information was performed and the result was fitted to the general 3D structure of a Walker A/GTPase. The agreement was unexpectedly high and strongly supports the conclusion that the AAA family belongs to the Walker superfamily of A/GTPases. PMID:9336829

  6. Protein sequences encode safeguards against aggregation.

    PubMed

    Reumers, Joke; Maurer-Stroh, Sebastian; Schymkowitz, Joost; Rousseau, Fréderic

    2009-03-01

    Functional requirements shaped proteins into globular structures. Under these structural constraints, which require both regular secondary structure and a hydrophobic core, protein aggregation is an unavoidable corollary to protein structure. However, as aggregation results in reduced fitness, natural selection will tend to eliminate strongly aggregating sequences. The analysis of distribution and variation of aggregation patterns in the human proteome using the TANGO algorithm confirms the findings of a previous study on several proteomes: the flanks of aggregation-prone regions are enriched with charged residues and proline, the so-called gatekeeper-residues. Moreover, in this study, we observed a widespread redundancy in gatekeeper usage. Interestingly, aggregating regions from key proteins such as p53 or huntingtin are among the most extensive "gatekept" sequences. As a consequence, mutations that remove gatekeepers could therefore result in a strong increase in disease-susceptibility. In a set of disease-associated mutations from the UniProt database, we find a strong enrichment of mutations that disrupt gatekeeper motifs. Closer inspection of a number of case studies indicates clearly that removing gatekeepers may play a determining role in widely varying disorders, such as van der Woude syndrome (VWS), X-linked Fabry disease (FD), and limb-girdle muscular dystrophy. PMID:19156839

  7. Benchmarking NMR experiments: A relational database of protein pulse sequences

    NASA Astrophysics Data System (ADS)

    Senthamarai, Russell R. P.; Kuprov, Ilya; Pervushin, Konstantin

    2010-03-01

    Systematic benchmarking of multi-dimensional protein NMR experiments is a critical prerequisite for optimal allocation of NMR resources for structural analysis of challenging proteins, e.g. large proteins with limited solubility or proteins prone to aggregation. We propose a set of benchmarking parameters for essential protein NMR experiments organized into a lightweight (single XML file) relational database (RDB), which includes all the necessary auxiliaries (waveforms, decoupling sequences, calibration tables, setup algorithms and an RDB management system). The database is interfaced to the Spinach library ( http://spindynamics.org), which enables accurate simulation and benchmarking of NMR experiments on large spin systems. A key feature is the ability to use a single user-specified spin system to simulate the majority of deposited solution state NMR experiments, thus providing the (hitherto unavailable) unified framework for pulse sequence evaluation. This development enables predicting relative sensitivity of deposited implementations of NMR experiments, thus providing a basis for comparison, optimization and, eventually, automation of NMR analysis. The benchmarking is demonstrated with two proteins, of 170 amino acids I domain of αXβ2 Integrin and 440 amino acids NS3 helicase.

  8. Performance comparison of Next Generation sequencing platforms.

    PubMed

    Erguner, Bekir; Ustek, Duran; Sagiroglu, Mahmut S

    2015-01-01

    Next Generation DNA Sequencing technologies offer ultra high sequencing throughput for very low prices. The increase in throughput and diminished costs open up new research areas. Moreover, number of clinicians utilizing DNA sequencing keeps growing. One of the main concern for researchers and clinicians who are adopting these platforms is their sequencing accuracy. We compared three of the most commonly used Next Generation Sequencing platforms; Ion Torrent from Life Technologies, GS FLX+ from Roche and HiSeq 2000 from Illumina.

  9. Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.

    PubMed

    Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai

    2015-12-01

    The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences.

  10. Discovery of Recurrent Sequence Motifs in Saccharomyces cerevisiae Cell Wall Proteins

    PubMed Central

    Coronado, Juan E.; Epstein, Susan L.; Qiu, Wei-Gang; Lipke, Peter N.

    2008-01-01

    This paper describes a procedure for the discovery of recurrent substrings in amino acid sequences of proteins, and its application to fungal cell walls. The evolutionary origins of fungal cell walls are an open biological question. This question can be approached by studies of similarity among the sequences and sub-sequences of fungal wall proteins and by comparison to proteins in animals. We describe here how we have discovered building blocks, represented as recurrent sequence motifs (sub-sequences), within fungal cell wall proteins. These motifs have not been systematically identified before, because the low Shannon entropy of the cell wall sequences has hindered searches for local sequence similarities by sequence alignments. Nonetheless, our new, composition-based scoring matrices for local alignment searches now support statistically valid alignments for such low entropy sequences (Coronado et al. 2006. Euk. Cell 5: 628–637). We have now searched for similarities in a set of 171 known and putative cell wall proteins from baker’s yeast, Saccharomyces cerevisiae. The aligned segments were repeatedly subdivided and catalogued to identify 217 recurrent sequence motifs of length 8 amino acids or greater. 95% of these motifs occur in more than one cell wall protein. The median length of the motifs is 22 amino acid residues, considerably shorter than protein domains. For many cell wall proteins, these motifs collectively account for more than half of their amino acids. The prevalence of these motifs supports the idea of fungal cell wall proteins as assemblies of recurrent building blocks. PMID:19430580

  11. Alignment-free protein interaction network comparison

    PubMed Central

    Ali, Waqar; Rito, Tiago; Reinert, Gesine; Sun, Fengzhu; Deane, Charlotte M.

    2014-01-01

    Motivation: Biological network comparison software largely relies on the concept of alignment where close matches between the nodes of two or more networks are sought. These node matches are based on sequence similarity and/or interaction patterns. However, because of the incomplete and error-prone datasets currently available, such methods have had limited success. Moreover, the results of network alignment are in general not amenable for distance-based evolutionary analysis of sets of networks. In this article, we describe Netdis, a topology-based distance measure between networks, which offers the possibility of network phylogeny reconstruction. Results: We first demonstrate that Netdis is able to correctly separate different random graph model types independent of network size and density. The biological applicability of the method is then shown by its ability to build the correct phylogenetic tree of species based solely on the topology of current protein interaction networks. Our results provide new evidence that the topology of protein interaction networks contains information about evolutionary processes, despite the lack of conservation of individual interactions. As Netdis is applicable to all networks because of its speed and simplicity, we apply it to a large collection of biological and non-biological networks where it clusters diverse networks by type. Availability and implementation: The source code of the program is freely available at http://www.stats.ox.ac.uk/research/proteins/resources. Contact: w.ali@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25161230

  12. Diverse nucleotide compositions and sequence fluctuation in Rubisco protein genes

    NASA Astrophysics Data System (ADS)

    Holden, Todd; Dehipawala, S.; Cheung, E.; Bienaime, R.; Ye, J.; Tremberger, G., Jr.; Schneider, P.; Lieberman, D.; Cheung, T.

    2011-10-01

    The Rubisco protein-enzyme is arguably the most abundance protein on Earth. The biology dogma of transcription and translation necessitates the study of the Rubisco genes and Rubisco-like genes in various species. Stronger correlation of fractal dimension of the atomic number fluctuation along a DNA sequence with Shannon entropy has been observed in the studied Rubisco-like gene sequences, suggesting a more diverse evolutionary pressure and constraints in the Rubisco sequences. The strategy of using metal for structural stabilization appears to be an ancient mechanism, with data from the porphobilinogen deaminase gene in Capsaspora owczarzaki and Monosiga brevicollis. Using the chi-square distance probability, our analysis supports the conjecture that the more ancient Rubisco-like sequence in Microcystis aeruginosa would have experienced very different evolutionary pressure and bio-chemical constraint as compared to Bordetella bronchiseptica, the two microbes occupying either end of the correlation graph. Our exploratory study would indicate that high fractal dimension Rubisco sequence would support high carbon dioxide rate via the Michaelis- Menten coefficient; with implication for the control of the whooping cough pathogen Bordetella bronchiseptica, a microbe containing a high fractal dimension Rubisco-like sequence (2.07). Using the internal comparison of chi-square distance probability for 16S rRNA (~ E-22) versus radiation repair Rec-A gene (~ E-05) in high GC content Deinococcus radiodurans, our analysis supports the conjecture that high GC content microbes containing Rubisco-like sequence are likely to include an extra-terrestrial origin, relative to Deinococcus radiodurans. Similar photosynthesis process that could utilize host star radiation would not compete with radiation resistant process from the biology dogma perspective in environments such as Mars and exoplanets.

  13. PROCAIN server for remote protein sequence similarity search

    PubMed Central

    Wang, Yong; Sadreyev, Ruslan I.; Grishin, Nick V.

    2009-01-01

    Sensitive and accurate detection of distant protein homology is essential for the studies of protein structure, function and evolution. We recently developed PROCAIN, a method that is based on sequence profile comparison and involves the analysis of four signals—similarities of residue content at the profile positions combined with three types of assisting information: sequence motifs, residue conservation and predicted secondary structure. Here we present the PROCAIN web server that allows the user to submit a query sequence or multiple sequence alignment and perform the search in a profile database of choice. The output is structured similar to that of BLAST, with the list of detected homologs sorted by E-value and followed by profile–profile alignments. The front page allows the user to adjust multiple options of input processing and output formatting, as well as search settings, including the relative weights assigned to the three types of assisting information. Availability: http://prodata.swmed.edu/procain/ Contact: grishin@chop.swmed.edu PMID:19497935

  14. Complete VAX/VMS DNA/protein sequence analysis system

    SciTech Connect

    Smith, D.W.

    1987-05-01

    A complete yet flexible system of programs and database libraries for analysis of DNA, RNA and protein sequences is implemented for VAX/VMS computers. Types of analysis include 1) construction and analysis of chimeric sequences (cloning in the VAX), 2) multiple analysis of one or more single sequences, 3) search and comparison studies using sequence libraries, and 4) direct input and analysis of experimental data. Published groups of programs, including the Staden, Los Alamos, Zuker, Pearson, and PHYLIP programs, are used. GenBank and EMBL DNA libraries and PIR and Doolittle NEWAT protein libraries are available, with associated programs. The system is tutorial, with online documentation for relevent VAX software, the programs, and the databases. The complete documentation is flexibly maintained on reserve via computer printout placed in 3-ring binders. Command files are used extensively; porting of the entire system to another VAX/VMS system requires modification of a single command. Users of the system are members of a VAX group, with automatic implementation of the system upon login. The present system occupies about 140,000 blocks, and is easily expanded, or contracted, as desired. The UCSD system is used extensively for both teaching and research purposes. Use of microcomputers emulating Tektronix 4014 graphics terminals permits saving of graphics output to disk for subsequent modification to generate high quality publishable figures.

  15. Integrated visual analysis of protein structures, sequences, and feature data

    PubMed Central

    2015-01-01

    Background To understand the molecular mechanisms that give rise to a protein's function, biologists often need to (i) find and access all related atomic-resolution 3D structures, and (ii) map sequence-based features (e.g., domains, single-nucleotide polymorphisms, post-translational modifications) onto these structures. Results To streamline these processes we recently developed Aquaria, a resource offering unprecedented access to protein structure information based on an all-against-all comparison of SwissProt and PDB sequences. In this work, we provide a requirements analysis for several frequently occuring tasks in molecular biology and describe how design choices in Aquaria meet these requirements. Finally, we show how the interface can be used to explore features of a protein and gain biologically meaningful insights in two case studies conducted by domain experts. Conclusions The user interface design of Aquaria enables biologists to gain unprecedented access to molecular structures and simplifies the generation of insight. The tasks involved in mapping sequence features onto structures can be conducted easier and faster using Aquaria. PMID:26329268

  16. Comparison of next-generation sequencing systems.

    PubMed

    Liu, Lin; Li, Yinhu; Li, Siliang; Hu, Ni; He, Yimin; Pong, Ray; Lin, Danni; Lu, Lihua; Law, Maggie

    2012-01-01

    With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world's biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized.

  17. DNA Sequencing Using an Engineered Protein Nanopore

    NASA Astrophysics Data System (ADS)

    Gundlach, Jens H.

    2010-03-01

    Inexpensive and fast sequencing of DNA is of paramount importance to medicine, the life sciences and to many other applications. Because of the nanometer diameter of DNA a nanometer-scale reader directly interfaced to macroscopic observables seems particularly attractive. We are working on a new single molecule technique based on a biological pore embedded in a lipid bilayer. When a voltage is applied across the bilayer an ion current is measured that flows through the nanometer opening of the pore. Poly-negatively charged single stranded DNA passes through the pore and reduces the ion current with the remaining ion current being indicative of the nucleotide type in the constriction of the pore. The protein pore that we introduced to the field, MspA, has a shape ideally suited to nanopore sequencing, has robustness comparable to solid state devices, is easily reproduced with sub-nanometer level precision and is engineerable using genetic mutations. I will present proof-of-principle data showing that this technique can lead to a direct very inexpensive and fast sequencing technology. The experimental electronic signatures of the DNA translocation process provide an ideal test bed for molecular dynamics simulations, which in turn allows developing intuition and prediction of nanoscale dynamics.

  18. Sequence-Based Prediction of Type III Secreted Proteins

    PubMed Central

    Arnold, Roland; Brandmaier, Stefan; Kleine, Frederick; Tischler, Patrick; Heinz, Eva; Behrens, Sebastian; Niinikoski, Antti; Mewes, Hans-Werner; Horn, Matthias; Rattei, Thomas

    2009-01-01

    The type III secretion system (TTSS) is a key mechanism for host cell interaction used by a variety of bacterial pathogens and symbionts of plants and animals including humans. The TTSS represents a molecular syringe with which the bacteria deliver effector proteins directly into the host cell cytosol. Despite the importance of the TTSS for bacterial pathogenesis, recognition and targeting of type III secreted proteins has up until now been poorly understood. Several hypotheses are discussed, including an mRNA-based signal, a chaperon-mediated process, or an N-terminal signal peptide. In this study, we systematically analyzed the amino acid composition and secondary structure of N-termini of 100 experimentally verified effector proteins. Based on this, we developed a machine-learning approach for the prediction of TTSS effector proteins, taking into account N-terminal sequence features such as frequencies of amino acids, short peptides, or residues with certain physico-chemical properties. The resulting computational model revealed a strong type III secretion signal in the N-terminus that can be used to detect effectors with sensitivity of ∼71% and selectivity of ∼85%. This signal seems to be taxonomically universal and conserved among animal pathogens and plant symbionts, since we could successfully detect effector proteins if the respective group was excluded from training. The application of our prediction approach to 739 complete bacterial and archaeal genome sequences resulted in the identification of between 0% and 12% putative TTSS effector proteins. Comparison of effector proteins with orthologs that are not secreted by the TTSS showed no clear pattern of signal acquisition by fusion, suggesting convergent evolutionary processes shaping the type III secretion signal. The newly developed program EffectiveT3 (http://www.chlamydiaedb.org) is the first universal in silico prediction program for the identification of novel TTSS effectors. Our findings will

  19. Capsid protein sequence diversity of avian nephritis virus.

    PubMed

    Todd, D; Trudgett, J; Smyth, V J; Donnelly, B; McBride, N; Welsh, M D

    2011-06-01

    The capsid gene sequences of 25 avian nephritis viruses (ANVs), collected in the UK, Germany and Belgium from the 1980s to 2008, were determined and compared with those of serotype 1 (ANV-1) and serotype 2 (ANV-2) ANV isolates. Amino acid identities as low as 51% were determined. Pairwise comparisons supported by phylogenetic analysis identified six ANVs, including ANV-1 and ANV-2, which shared<80% amino acid identities with one another, and which were selected to be representative of six groups. The ANVs were not distributed according to geographical location or year of sampling, and the detection of ANVs from five different groups in 11 samples sourced from six flocks belonging to the same UK organization within a 4-month period indicated that sequence-diverse ANVs were co-circulating. Amino acid alignments demonstrated the existence of variable regions throughout the capsid protein, nine of which were selected for detailed comparisons. With most ANVs, the variable region sequences were similar to those of one of the six representative ANVs, but some ANV capsids displayed novel variable region profiles, in which variable regions that were characteristic of more than one representative ANV were present. Phylogenetic analysis based on C-terminal sequences of approximately 260 amino acids and SimPlot analysis provided evidence that RNA recombination events located in the 1250 to 1350 nucleotide region resulted in new combinations of the N-terminal and C-terminal capsid regions. The high level of capsid sequence diversity observed in the present study has important implications for both the control and diagnosis of ANV infections.

  20. Miraculous catch of iron-sulfur protein sequences in the Sargasso Sea.

    PubMed

    Meyer, Jacques

    2004-07-16

    Recent shotgun sequencing of filtered Sargasso Sea water samples has yielded data in astounding amount and diversity. Iron-sulfur proteins, which are ancient, diverse and ubiquitous, have been implemented here to further probe the sequence diversity of the Sargasso Sea database (SSDB). Sequence searches and comparisons confirm that the SSDB by and large equals in diversity the combined currently available databases. The data thus suggest that microbial diversity has so far been underestimated by orders of magnitude.

  1. PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context.

    PubMed

    Zhou, Jiyun; Xu, Ruifeng; He, Yulan; Lu, Qin; Wang, Hongpeng; Kong, Bing

    2016-01-01

    Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community. PMID:27282833

  2. PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context

    PubMed Central

    Zhou, Jiyun; Xu, Ruifeng; He, Yulan; Lu, Qin; Wang, Hongpeng; Kong, Bing

    2016-01-01

    Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community. PMID:27282833

  3. Geometric Aspects of Biological Sequence Comparison

    PubMed Central

    Stojmirović, Aleksandar

    2009-01-01

    Abstract We introduce a geometric framework suitable for studying the relationships among biological sequences. In contrast to previous works, our formulation allows asymmetric distances (quasi-metrics), originating from uneven weighting of strings, which may induce non-trivial partial orders on sets of biosequences. The distances considered are more general than traditional generalized string edit distances. In particular, our framework enables non-trivial conversion between sequence similarities, both local and global, and distances. Our constructions apply to a wide class of scoring schemes and require much less restrictive gap penalties than the ones regularly used. Numerous examples are provided to illustrate the concepts introduced and their potential applications. PMID:19361329

  4. The bioinformatics of nucleotide sequence coding for proteins requiring metal coenzymes and proteins embedded with metals

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Cheung, E.; Holden, T.; Sullivan, R.; Nguyen, A.; Lieberman, D.; Cheung, T.

    2015-09-01

    All metallo-proteins need post-translation metal incorporation. In fact, the isotope ratio of Fe, Cu, and Zn in physiology and oncology have emerged as an important tool. The nickel containing F430 is the prosthetic group of the enzyme methyl coenzyme M reductase which catalyzes the release of methane in the final step of methano-genesis, a prime energy metabolism candidate for life exploration space mission in the solar system. The 3.5 Gyr early life sulfite reductase as a life switch energy metabolism had Fe-Mo clusters. The nitrogenase for nitrogen fixation 3 billion years ago had Mo. The early life arsenite oxidase needed for anoxygenic photosynthesis energy metabolism 2.8 billion years ago had Mo and Fe. The selection pressure in metal incorporation inside a protein would be quantifiable in terms of the related nucleotide sequence complexity with fractal dimension and entropy values. Simulation model showed that the studied metal-required energy metabolism sequences had at least ten times more selection pressure relatively in comparison to the horizontal transferred sequences in Mealybug, guided by the outcome histogram of the correlation R-sq values. The metal energy metabolism sequence group was compared to the circadian clock KaiC sequence group using magnesium atomic level bond shifting mechanism in the protein, and the simulation model would suggest a much higher selection pressure for the energy life switch sequence group. The possibility of using Kepler 444 as an example of ancient life in Galaxy with the associated exoplanets has been proposed and is further discussed in this report. Examples of arsenic metal bonding shift probed by Synchrotron-based X-ray spectroscopy data and Zn controlled FOXP2 regulated pathways in human and chimp brain studied tissue samples are studied in relationship to the sequence bioinformatics. The analysis results suggest that relatively large metal bonding shift amount is associated with low probability correlation R

  5. Protein Sequence Annotation Tool (PSAT): A centralized web-based meta-server for high-throughput sequence annotations

    DOE PAGESBeta

    Leung, Elo; Huang, Amy; Cadag, Eithon; Montana, Aldrin; Soliman, Jan Lorenz; Zhou, Carol L. Ecale

    2016-01-20

    In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less

  6. UFO: a web server for ultra-fast functional profiling of whole genome protein sequences

    PubMed Central

    Meinicke, Peter

    2009-01-01

    Background Functional profiling is a key technique to characterize and compare the functional potential of entire genomes. The estimation of profiles according to an assignment of sequences to functional categories is a computationally expensive task because it requires the comparison of all protein sequences from a genome with a usually large database of annotated sequences or sequence families. Description Based on machine learning techniques for Pfam domain detection, the UFO web server for ultra-fast functional profiling allows researchers to process large protein sequence collections instantaneously. Besides the frequencies of Pfam and GO categories, the user also obtains the sequence specific assignments to Pfam domain families. In addition, a comparison with existing genomes provides dissimilarity scores with respect to 821 reference proteomes. Considering the underlying UFO domain detection, the results on 206 test genomes indicate a high sensitivity of the approach. In comparison with current state-of-the-art HMMs, the runtime measurements show a considerable speed up in the range of four orders of magnitude. For an average size prokaryotic genome, the computation of a functional profile together with its comparison typically requires about 10 seconds of processing time. Conclusion For the first time the UFO web server makes it possible to get a quick overview on the functional inventory of newly sequenced organisms. The genome scale comparison with a large number of precomputed profiles allows a first guess about functionally related organisms. The service is freely available and does not require user registration or specification of a valid email address. PMID:19725959

  7. Genomic Sequence Comparisons, 1987-2003 Final Report

    SciTech Connect

    George M. Church

    2004-07-29

    This project was to develop new DNA sequencing and RNA and protein quantitation methods and related genome annotation tools. The project began in 1987 with the development of multiplex sequencing (published in Science in 1988), and one of the first automated sequencing methods. This lead to the first commercial genome sequence in 1994 and to the establishment of the main commercial participants (GTC then Agencourt) in the public DOE/NIH genome project. In collaboration with GTC we contributed to one of the first complete DOE genome sequences, in 1997, that of Methanobacterium thermoautotropicum, a species of great relevance to energy-rich gas production.

  8. SPIDER: software for protein identification from sequence tags with de novo sequencing error.

    PubMed

    Han, Yonghua; Ma, Bin; Zhang, Kaizhong

    2005-06-01

    For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software. PMID:16108090

  9. SPIDER: software for protein identification from sequence tags with de novo sequencing error.

    PubMed

    Han, Yonghua; Ma, Bin; Zhang, Kaizhong

    2004-01-01

    For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software. PMID:16448014

  10. Folding and Stabilization of Native-Sequence-Reversed Proteins.

    PubMed

    Zhang, Yuanzhao; Weber, Jeffrey K; Zhou, Ruhong

    2016-04-26

    Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols.

  11. Folding and Stabilization of Native-Sequence-Reversed Proteins

    PubMed Central

    Zhang, Yuanzhao; Weber, Jeffrey K; Zhou, Ruhong

    2016-01-01

    Though the problem of sequence-reversed protein folding is largely unexplored, one might speculate that reversed native protein sequences should be significantly more foldable than purely random heteropolymer sequences. In this article, we investigate how the reverse-sequences of native proteins might fold by examining a series of small proteins of increasing structural complexity (α-helix, β-hairpin, α-helix bundle, and α/β-protein). Employing a tandem protein structure prediction algorithmic and molecular dynamics simulation approach, we find that the ability of reverse sequences to adopt native-like folds is strongly influenced by protein size and the flexibility of the native hydrophobic core. For β-hairpins with reverse-sequences that fail to fold, we employ a simple mutational strategy for guiding stable hairpin formation that involves the insertion of amino acids into the β-turn region. This systematic look at reverse sequence duality sheds new light on the problem of protein sequence-structure mapping and may serve to inspire new protein design and protein structure prediction protocols. PMID:27113844

  12. Orpinomyces cellulase celf protein and coding sequences

    SciTech Connect

    Li, Xin-Liang; Chen, Huizhong; Ljungdahl, Lars G.

    2000-09-05

    A cDNA (1,520 bp), designated celF, consisting of an open reading frame (ORF) encoding a polypeptide (CelF) of 432 amino acids was isolated from a cDNA library of the anaerobic rumen fungus Orpinomyces PC-2 constructed in Escherichia coli. Analysis of the deduced amino acid sequence showed that starting from the N-terminus, CelF consists of a signal peptide, a cellulose binding domain (CBD) followed by an extremely Asn-rich linker region which separate the CBD and the catalytic domains. The latter is located at the C-terminus. The catalytic domain of CelF is highly homologous to CelA and CelC of Orpinomyces PC-2, to CelA of Neocallimastix patriciarum and also to cellobiohydrolase IIs (CBHIIs) from aerobic fungi. However, Like CelA of Neocallimastix patriciarum, CelF does not have the noncatalytic repeated peptide domain (NCRPD) found in CelA and CelC from the same organism. The recombinant protein CelF hydrolyzes cellooligosaccharides in the pattern of CBHII, yielding only cellobiose as product with cellotetraose as the substrate. The genomic celF is interrupted by a 111 bp intron, located within the region coding for the CBD. The intron of the celF has features in common with genes from aerobic filamentous fungi.

  13. Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison.

    PubMed

    Birney, E; Durbin, R

    1997-01-01

    We have developed a code generating language, called Dynamite, specialised for the production and subsequent manipulation of complex dynamic programming methods for biological sequence comparison. From a relatively simple text definition file Dynamite will produce a variety of implementations of a dynamic programming method, including database searches and linear space alignments. The speed of the generated code is comparable to hand written code, and the additional flexibility has proved invaluable in designing and testing new algorithms. An innovation is a flexible labelling system, which can be used to annotate the original sequences with biological information. We illustrate the Dynamite syntax and flexibility by showing definitions for dynamic programming routines (i) to align two protein sequences under the assumption that they are both poly-topic transmembrane proteins, with the simultaneous assignment of transmembrane helices and (ii) to align protein information to genomic DNA, allowing for introns and sequencing error.

  14. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Hixson, Kim K.; Purvine, Samuel O.; Anderson, Gordon A.; Smith, Richard D.

    2008-10-15

    De novo sequencing has a promise to discover the protein post-translation modifications; however, such approach is still in their infancy and not widely applied for proteomics practices due to its limited reliability. In this work, we describe a de novo sequencing approach for discovery of protein modifications through identification of the UStags (Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry for peptides and polypeptides in a yeast lysate, and the de novo sequences obtained were filtered to define a more limited set of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags’ prefix and suffix sequences and the UStags themselves) were used to infer the possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances of yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. Random matching of the de novo sequences to the predicted sequences were examined with use of two random (false) databases, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity are described. The de novo-UStag complements the UStag method previously reported by enabling discovery of new protein modifications.

  15. MESSA: MEta-Server for protein Sequence Analysis

    PubMed Central

    2012-01-01

    Background Computational sequence analysis, that is, prediction of local sequence properties, homologs, spatial structure and function from the sequence of a protein, offers an efficient way to obtain needed information about proteins under study. Since reliable prediction is usually based on the consensus of many computer programs, meta-severs have been developed to fit such needs. Most meta-servers focus on one aspect of sequence analysis, while others incorporate more information, such as PredictProtein for local sequence feature predictions, SMART for domain architecture and sequence motif annotation, and GeneSilico for secondary and spatial structure prediction. However, as predictions of local sequence properties, three-dimensional structure and function are usually intertwined, it is beneficial to address them together. Results We developed a MEta-Server for protein Sequence Analysis (MESSA) to facilitate comprehensive protein sequence analysis and gather structural and functional predictions for a protein of interest. For an input sequence, the server exploits a number of select tools to predict local sequence properties, such as secondary structure, structurally disordered regions, coiled coils, signal peptides and transmembrane helices; detect homologous proteins and assign the query to a protein family; identify three-dimensional structure templates and generate structure models; and provide predictive statements about the protein's function, including functional annotations, Gene Ontology terms, enzyme classification and possible functionally associated proteins. We tested MESSA on the proteome of Candidatus Liberibacter asiaticus. Manual curation shows that three-dimensional structure models generated by MESSA covered around 75% of all the residues in this proteome and the function of 80% of all proteins could be predicted. Availability MESSA is free for non-commercial use at http://prodata.swmed.edu/MESSA/ PMID:23031578

  16. Intra-species sequence comparisons for annotating genomes

    SciTech Connect

    Boffelli, Dario; Weer, Claire V.; Weng, Li; Lewis, Keith D.; Shoukry, Malak I.; Pachter, Lior; Keys, David N.; Rubin, Edward M.

    2004-07-15

    Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intra-species sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents and a set of genomic intervals amplified, resequenced and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom and raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species. The sequence data from this study has been submitted to GenBank under accession nos. AY667278-AY667407.

  17. Fold Recognition Using Sequence Fingerprints of Protein Local Substructures

    SciTech Connect

    Kryshtafovych, A A; Hvidsten, T; Komorowski, J; Fidelis, K

    2003-06-04

    A protein local substructure (descriptor) is a set of several short non-overlapping fragments of the polypeptide chain. Each descriptor describes local environment of a particular residue and includes only those segments that are located in the proximity of this residue. Similar descriptors from the representative set of proteins were analyzed to reveal links between the substructures and sequences of their segments. Using detected sequence-based fingerprints specific geometrical conformations are assigned to new sequences. The ability of the approach to recognize correct SCOP folds was tested on 273 sequences from the 49 most popular folds. Good predictions were obtained in 85% of cases. No performance drop was observed with decreasing sequence similarity between target sequences and sequences from the training set of proteins.

  18. On combining protein sequences and nucleic acid sequences in phylogenetic analysis: the homeobox protein case.

    PubMed

    Agosti, D; Jacobs, D; DeSalle, R

    1996-01-01

    Amino acid encoding genes contain character state information that may be useful for phylogenetic analysis on at least two levels. The nucleotide sequence and the translated amino acid sequences have both been employed separately as character states for cladistic studies of various taxa, including studies of the genealogy of genes in multigene families. In essence, amino acid sequences and nucleic acid sequences are two different ways of character coding the information in a gene. Silent positions in the nucleotide sequence (first or third positions in codons that can accrue change without changing the identity of the amino acid that the triplet codes for) may accrue change relatively rapidly and become saturated, losing the pattern of historical divergence. On the other hand, non-silent nucleotide alterations and their accompanying amino acid changes may evolve too slowly to reveal relationships among closely related taxa. In general, the dynamics of sequence change in silent and non-silent positions in protein coding genes result in homoplasy and lack of resolution, respectively. We suggest that the combination of nucleic acid and the translated amino acid coded character states into the same data matrix for phylogenetic analysis addresses some of the problems caused by the rapid change of silent nucleotide positions and overall slow rate of change of non-silent nucleotide positions and slowly changing amino acid positions. One major theoretical problem with this approach is the apparent non-independence of the two sources of characters. However, there are at least three possible outcomes when comparing protein coding nucleic acid sequences with their translated amino acids in a phylogenetic context on a codon by codon basis. First, the two character sets for a codon may be entirely congruent with respect to the information they convey about the relationships of a certain set of taxa. Second, one character set may display no information concerning a phylogenetic

  19. Sequencing proteins with transverse ionic transport in nanochannels

    NASA Astrophysics Data System (ADS)

    Boynton, Paul; di Ventra, Massimiliano

    2016-05-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer’s Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique’s potential for de novo protein sequencing.

  20. Sequencing proteins with transverse ionic transport in nanochannels

    PubMed Central

    Boynton, Paul; Di Ventra, Massimiliano

    2016-01-01

    De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer’s Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique’s potential for de novo protein sequencing. PMID:27140520

  1. Nonrandom tripeptide sequence distributions at protein carboxyl termini.

    PubMed

    Gatto, Gregory J; Berg, Jeremy M

    2003-04-01

    The availability of complete genome sequences enables the statistical analysis of sequence features without significant database-imposed bias. The carboxyl termini of proteins often contain regions associated with protein targeting and enhanced translational termination. We analyzed the frequency of occurrence of C-terminal tripeptides in representative archaeal, bacterial, and eukaryotic genomes. The sequence distribution in prokaryotic genomes nearly matches that generated by the randomization of the observed tripeptide set. In contrast, eukaryotic genomes contain large numbers of overrepresented sequences. Some of these correspond to highly repeated sequences from either duplicated endogenous genes or transposon open reading frames. Gratifyingly, others represent previously known targeting signals or sequences associated with an increase in translational termination efficiency. However, a number of overrepresented tripeptides have not been previously noted and may represent novel functional sequences. For example, the sequence XSS may enhance translational termination efficiency in plants, whereas FWC may be a targeting or processing signal for certain amino acid permeases in yeast.

  2. The SBASE protein domain library, release 3.0: a collection of annotated protein sequence segments.

    PubMed

    Pongor, S; Hátsági, Z; Degtyarenko, K; Fábián, P; Skerl, V; Hegyi, H; Murvai, J; Bevilacqua, V

    1994-09-01

    SBASE 3.0 is the third release of SBASE, a collection of annotated protein domain sequences. SBASE entries represent various structural, functional, ligand-binding and topogenic segments of proteins as defined by their publishing authors. SBASE can be used for establishing domain homologies using different database-search tools such as FASTA [Lipman and Pearson (1985) Science, 227, 1436-1441], and BLAST3 [Altschul and Lipman (1990) Proc. Natl. Acad. Sci. USA, 87, 5509-5513] which is especially useful in the case of loosely defined domain types for which efficient consensus patterns can not be established. The present release contains 41,749 entries provided with standardized names and cross-referenced to the major protein and nucleic acid databanks as well as to the PROSITE catalogue of protein sequence patterns. The entries are clustered into 2285 groups using the BLAST algorithm for computing similarity measures. SBASE 3.0 is freely available on request to the authors or by anonymous 'ftp' file transfer from < ftp.icgeb.trieste.it >. Individual records can be retrieved with the gopher server at < icgeb.trieste.it > and with a www-server at < http:@www.icgeb.trieste.it >. Automated searching of SBASE by BLAST can be carried out with the electronic mail server < sbase@icgeb.trieste.it >. Another mail server < domain@hubi.abc.hu > assigns SBASE domain homologies on the basis of SWISS-PROT searches. A comparison of pertinent search strategies is presented.

  3. Protein Sequence Classification with Improved Extreme Learning Machine Algorithms

    PubMed Central

    2014-01-01

    Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876

  4. Meta sequence analysis of human blood peptides and their parent proteins.

    PubMed

    Bowden, Peter; Pendrak, Voitek; Zhu, Peihong; Marshall, John G

    2010-04-18

    Sequence analysis of the blood peptides and their qualities will be key to understanding the mechanisms that contribute to error in LC-ESI-MS/MS. Analysis of peptides and their proteins at the level of sequences is much more direct and informative than the comparison of disparate accession numbers. A portable database of all blood peptide and protein sequences with descriptor fields and gene ontology terms might be useful for designing immunological or MRM assays from human blood. The results of twelve studies of human blood peptides and/or proteins identified by LC-MS/MS and correlated against a disparate array of genetic libraries were parsed and matched to proteins from the human ENSEMBL, SwissProt and RefSeq databases by SQL. The reported peptide and protein sequences were organized into an SQL database with full protein sequences and up to five unique peptides in order of prevalence along with the peptide count for each protein. Structured query language or BLAST was used to acquire descriptive information in current databases. Sampling error at the level of peptides is the largest source of disparity between groups. Chi Square analysis of peptide to protein distributions confirmed the significant agreement between groups on identified proteins.

  5. Methods of protein structure comparison

    PubMed Central

    Kufareva, Irina; Abagyan, Ruben

    2015-01-01

    Despite its apparent simplicity, the problem of quantifying the differences between two structures of the same protein or complex is non-trivial and continues evolving. In this chapter, we described several methods routinely used to compare computational models to experimental answers in several modeling assessments. The two major classes of measures, positional distance-based and contact-based, were presented, compared and analyzed. The most popular measure of the first class, the global RMSD, is shown to be the least representative of the degree of structural similarity because it is dominated by the largest error. Several distance-dependent algorithms designed to attenuate the drawbacks of RMSD are described. Measures of the second class, contact-based, are shown to be more robust and relevant. We also illustrate the importance of using combined measures, utility-based measures, and the role of the distributions derived from the pairs of experimental structures in interpreting the results. PMID:22323224

  6. A dotplot program for the Atari ST, for the analysis of DNA and protein sequences.

    PubMed

    Karreman, C

    1992-02-01

    A program was written in GFA-BASIC for the Atari ST microcomputer aimed at drawing two-dimensional homology 'dotplot' patterns for two protein or DNA sequences. The program, built around a machine-code subroutine, communicates interactively with the user by means of a multi-button dialogue panel and mouse-directed input. A 1000 X 1000 sequence comparison with a 14:21 stringency window takes 12 s.

  7. Amino acid sequences of proteins from Leptospira serovar pomona.

    PubMed

    Alves, S F; Lefebvre, R B; Probert, W

    2000-01-01

    This report describes a partial amino acid sequences from three putative outer envelope proteins from Leptospira serovar pomona. In order to obtain internal fragments for protein sequencing, enzymatic and chemical digestion was performed. The enzyme clostripain was used to digest the proteins 32 and 45 kDa. In situ digestion of 40 kDa molecular weight protein was accomplished using cyanogen bromide. The 32 kDa protein generated two fragments, one of 21 kDa and another of 10 kDa that yielded five residues. A fragment of 24 kDa that yielded nineteen residues of amino acids was obtained from 45 kDa protein. A fragment with a molecular weight of 20 kDa, yielding a twenty amino acids sequence from the 40 kDa protein.

  8. A 3D sequence-independent representation of the protein data bank.

    PubMed

    Fischer, D; Tsai, C J; Nussinov, R; Wolfson, H

    1995-10-01

    Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally non-redundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average approximately 2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 A, or 268 chains including lower resolution entries, NMR entries and models. The

  9. A 3D sequence-independent representation of the protein data bank.

    PubMed

    Fischer, D; Tsai, C J; Nussinov, R; Wolfson, H

    1995-10-01

    Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally non-redundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average approximately 2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 A, or 268 chains including lower resolution entries, NMR entries and models. The

  10. Comparison of mitochondrial genome sequences of pangolins (Mammalia, Pholidota).

    PubMed

    Hassanin, Alexandre; Hugot, Jean-Pierre; van Vuuren, Bettine Jansen

    2015-04-01

    The complete mitochondrial genome was sequenced for three species of pangolins, Manis javanica, Phataginus tricuspis, and Smutsia temminckii, and comparisons were made with two other species, Manis pentadactyla and Phataginus tetradactyla. The genome of Manidae contains the 37 genes found in a typical mammalian genome, and the structure of the control region is highly conserved among species. In Manis, the overall base composition differs from that found in African genera. Phylogenetic analyses support the monophyly of the genera Manis, Phataginus, and Smutsia, as well as the basal division between Maninae and Smutsiinae. Comparisons with GenBank sequences reveal that the reference genomes of M. pentadactyla and P. tetradactyla (accession numbers NC_016008 and NC_004027) were sequenced from misidentified taxa, and that a new species of tree pangolin should be described in Gabon. PMID:25746396

  11. Comparison of mitochondrial genome sequences of pangolins (Mammalia, Pholidota).

    PubMed

    Hassanin, Alexandre; Hugot, Jean-Pierre; van Vuuren, Bettine Jansen

    2015-04-01

    The complete mitochondrial genome was sequenced for three species of pangolins, Manis javanica, Phataginus tricuspis, and Smutsia temminckii, and comparisons were made with two other species, Manis pentadactyla and Phataginus tetradactyla. The genome of Manidae contains the 37 genes found in a typical mammalian genome, and the structure of the control region is highly conserved among species. In Manis, the overall base composition differs from that found in African genera. Phylogenetic analyses support the monophyly of the genera Manis, Phataginus, and Smutsia, as well as the basal division between Maninae and Smutsiinae. Comparisons with GenBank sequences reveal that the reference genomes of M. pentadactyla and P. tetradactyla (accession numbers NC_016008 and NC_004027) were sequenced from misidentified taxa, and that a new species of tree pangolin should be described in Gabon.

  12. Immobilized residue-specific endoproteinases for protein sequencing.

    PubMed

    Ronnenberg, J; Preitz, B; Wöstemeier, G; Diekmann, S

    1994-06-01

    Before proteins can be sequenced, the peptide chain has to be cut into small fragments of less than about 50 amino acids using residue-specific endoproteinases. These enzymes can be immobilized in a highly active form. Using immobilized endoproteinases for protein sequencing results in a series of advantages: (1) the high enzyme activity in the column results in short reaction times; (2) the protein fragments are easily eluted from the column whilst the endoproteinase is completely retained on the column; the protein fragments are clean yielding in low sequencing background; (3) the protein sample to be sequenced is free of exogenous enzymes; (4) endoproteinase self-digestion is prevented by immobilization; therefore, the sample solution does not contain any endoproteinase fragments; (5) enzymes are especially stable when immobilized. Columns with immobilized endoproteinases can be applied repeatedly and stored for many months.

  13. Dissecting the relationship between protein structure and sequence variation

    NASA Astrophysics Data System (ADS)

    Shahmoradi, Amir; Wilke, Claus; Wilke Lab Team

    2015-03-01

    Over the past decade several independent works have shown that some structural properties of proteins are capable of predicting protein evolution. The strength and significance of these structure-sequence relations, however, appear to vary widely among different proteins, with absolute correlation strengths ranging from 0 . 1 to 0 . 8 . Here we present the results from a comprehensive search for the potential biophysical and structural determinants of protein evolution by studying more than 200 structural and evolutionary properties in a dataset of 209 monomeric enzymes. We discuss the main protein characteristics responsible for the general patterns of protein evolution, and identify sequence divergence as the main determinant of the strengths of virtually all structure-evolution relationships, explaining ~ 10 - 30 % of observed variation in sequence-structure relations. In addition to sequence divergence, we identify several protein structural properties that are moderately but significantly coupled with the strength of sequence-structure relations. In particular, proteins with more homogeneous back-bone hydrogen bond energies, large fractions of helical secondary structures and low fraction of beta sheets tend to have the strongest sequence-structure relation. BEACON-NSF center for the study of evolution in action.

  14. PROMALS web server for accurate multiple protein sequence alignments.

    PubMed

    Pei, Jimin; Kim, Bong-Hyun; Tang, Ming; Grishin, Nick V

    2007-07-01

    Multiple sequence alignments are essential in homology inference, structure modeling, functional prediction and phylogenetic analysis. We developed a web server that constructs multiple protein sequence alignments using PROMALS, a progressive method that improves alignment quality by using additional homologs from PSI-BLAST searches and secondary structure predictions from PSIPRED. PROMALS shows higher alignment accuracy than other advanced methods, such as MUMMALS, ProbCons, MAFFT and SPEM. The PROMALS web server takes FASTA format protein sequences as input. The output includes a colored alignment augmented with information about sequence grouping, predicted secondary structures and positional conservation. The PROMALS web server is available at: http://prodata.swmed.edu/promals/ PMID:17452345

  15. Cytoplasmic intermediate filament proteins of invertebrates are closer to nuclear lamins than are vertebrate intermediate filament proteins; sequence characterization of two muscle proteins of a nematode.

    PubMed Central

    Weber, K; Plessmann, U; Ulrich, W

    1989-01-01

    The giant body muscle cells of the nematode Ascaris lumbricoides show a complex three dimensional array of intermediate filaments (IFs). They contain two proteins, A (71 kd) and B (63 kd), which we now show are able to form homopolymeric filaments in vitro. The complete amino acid sequence of B and 80% of A have been determined. A and B are two homologous proteins with a 55% sequence identity over the rod and tail domains. Sequence comparisons with the only other invertebrate IF protein currently known (Helix pomatia) and with vertebrate IF proteins show that along the coiled-coil rod domain, sequence principles rather than actual sequences are conserved in evolution. Noticeable exceptions are the consensus sequences at the ends of the rod, which probably play a direct role in IF assembly. Like the Helix IF protein the nematode proteins have six extra heptads in the coil 1b segment. These are characteristic of nuclear lamins from vertebrates and invertebrates and are not found in vertebrate IF proteins. Unexpectedly the enhanced homology between lamins and invertebrate IF proteins continues in the tail domains, which in vertebrate IF proteins totally diverge. The sequence alignment necessitates the introduction of a 15 residue deletion in the tail domain of all three invertebrate IF proteins. Its location coincides with the position of the karyophilic signal sequence, which dictates nuclear entry of the lamins. The results provide the first molecular support for the speculation that nuclear lamins and cytoplasmic IF proteins arose in eukaryotic evolution from a common lamin-like predecessor. Images PMID:2583097

  16. A convenient and adaptable microcomputer environment for DNA and protein sequence manipulation and analysis.

    PubMed Central

    Pustell, J; Kafatos, F C

    1986-01-01

    We describe the further development of a widely used package of DNA and protein sequence analysis programs for microcomputers (1,2,3). The package now provides a screen oriented user interface, and an enhanced working environment with powerful formatting, disk access, and memory management tools. The new GenBank floppy disk database is supported transparently to the user and a similar version of the NBRF protein database is provided. The programs can use sequence file annotation to automatically annotate printouts and translate or extract specified regions from sequences by name. The sequence comparison programs can now perform a 5000 X 5000 bp analysis in 12 minutes on an IBM PC. A program to locate potential protein coding regions in nucleic acids, a digitizer interface, and other additions are also described. PMID:3753784

  17. Detection of protein similarities using nucleotide sequence databases.

    PubMed

    Henikoff, S; Wallace, J C

    1988-07-11

    A simple procedure is described for finding similarities between proteins using nucleotide sequence databases. The approach is illustrated by several examples of previously unknown correspondences with important biological implications: Drosophila elongation factor Tu is shown to be encoded by two genes that are differently expressed during development; a cluster of three Drosophila genes likely encode maltases; a flesh-fly fat body protein resembles the hypothesized Drosophila alcohol dehydrogenase ancestral protein; an unknown protein encoded at the multifunctional E. coli hisT locus resembles aspartate beta-semialdehyde dehydrogenase; and the E. coli tyrR protein is related to nitrogen regulatory proteins. These and other matches were discovered using a personal computer of the type available in most laboratories collecting DNA sequence data. As relatively few sequences were sampled to find these matches, it is likely that much of the existing data has not been adequately examined.

  18. Nucleotide sequence of a cloned duck hepatitis B virus genome: comparison with woodchuck and human hepatitis B virus sequences.

    PubMed Central

    Mandart, E; Kay, A; Galibert, F

    1984-01-01

    The nucleotide sequence of an EcoRI duck hepatitis B virus (DHBV) clone was elucidated by using the Maxam and Gilbert method. This sequence, which is 3,021 nucleotides long, was compared with the two previously analyzed hepatitis B-like viruses (human and woodchuck). From this comparison, it was shown that DHBV is derived from an ancestor common to the two others but has a slightly different genomic organization. There was no intergenic region between genes 5 and 8, which were fused into a single open reading frame in DHBV. Genes for the surface and core proteins were assigned to open reading frames 7 and 5/8. Amino acid comparisons showed some structural relationship between gene 6 product and avian reverse transcriptase, suggesting either evolution from a common ancestor or convergence to some particular structure to fulfill a specific function. This should be correlated with the synthesis of an RNA intermediate during DNA replication. This is also taken as an argument in favor of the hypothesis that gene 6 codes for the DNA polymerase that is found within the virion. DNA sequence comparison also showed that the two mammalian hepatitis B viruses are more homologous to each other than they are to DHBV, indicating that DHBV starts to evolve on its own earlier than the two other viruses, as do birds compared with mammals. From this it is proposed that the viruses evolved in a fashion parallel to the species they infect. PMID:6699938

  19. What Makes a Protein Sequence a Prion?

    PubMed Central

    Sabate, Raimon; Rousseau, Frederic; Schymkowitz, Joost; Ventura, Salvador

    2015-01-01

    Typical amyloid diseases such as Alzheimer's and Parkinson's were thought to exclusively result from de novo aggregation, but recently it was shown that amyloids formed in one cell can cross-seed aggregation in other cells, following a prion-like mechanism. Despite the large experimental effort devoted to understanding the phenomenon of prion transmissibility, it is still poorly understood how this property is encoded in the primary sequence. In many cases, prion structural conversion is driven by the presence of relatively large glutamine/asparagine (Q/N) enriched segments. Several studies suggest that it is the amino acid composition of these regions rather than their specific sequence that accounts for their priogenicity. However, our analysis indicates that it is instead the presence and potency of specific short amyloid-prone sequences that occur within intrinsically disordered Q/N-rich regions that determine their prion behaviour, modulated by the structural and compositional context. This provides a basis for the accurate identification and evaluation of prion candidate sequences in proteomes in the context of a unified framework for amyloid formation and prion propagation. PMID:25569335

  20. Assessing the Drosophila melanogaster and Anopheles gambiae Genome Annotations Using Genome-Wide Sequence Comparisons

    PubMed Central

    Jaillon, Olivier; Dossat, Carole; Eckenberg, Ralph; Eiglmeier, Karin; Segurens, Béatrice; Aury, Jean-Marc; Roth, Charles W.; Scarpelli, Claude; Brey, Paul T.; Weissenbach, Jean; Wincker, Patrick

    2003-01-01

    We performed genome-wide sequence comparisons at the protein coding level between the genome sequences of Drosophila melanogaster and Anopheles gambiae. Such comparisons detect evolutionarily conserved regions (ecores) that can be used for a qualitative and quantitative evaluation of the available annotations of both genomes. They also provide novel candidate features for annotation. The percentage of ecores mapping outside annotations in the A. gambiae genome is about fourfold higher than in D. melanogaster. The A. gambiae genome assembly also contains a high proportion of duplicated ecores, possibly resulting from artefactual sequence duplications in the genome assembly. The occurrence of 4063 ecores in the D. melanogaster genome outside annotations suggests that some genes are not yet or only partially annotated. The present work illustrates the power of comparative genomics approaches towards an exhaustive and accurate establishment of gene models and gene catalogues in insect genomes. PMID:12840038

  1. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins.

    PubMed

    Shen, Yufeng; Tolić, Nikola; Hixson, Kim K; Purvine, Samuel O; Anderson, Gordon A; Smith, Richard D

    2008-10-15

    De novo sequencing is a spectrum analysis approach for mass spectrometry data to discover post-translational modifications in proteins; however, such an approach is still in its infancy and is still not widely applied to proteomic practices due to its limited reliability. In this work, we describe a de novo sequencing approach for the discovery of protein modifications based on identification of the proteome UStags (Shen, Y.; Tolić, N.; Hixson, K. K.; Purvine, S. O.; Pasa-Tolić, L.; Qian, W. J.; Adkins, J. N.; Moore, R. J.; Smith, R. D. Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry data for peptides and polypeptides from a yeast lysate, and the de novo sequences obtained were selected based on filter levels designed to provide a limited yet high quality subset of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags' prefix and suffix sequences and the UStags themselves) were used to infer possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances within several yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. To determine false discovery rates, two random (false) databases were independently used for sequence matching, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity of the approach were investigated and described. The combined de novo-UStag approach complements the UStag method previously reported by enabling the discovery of new protein modifications. PMID:18783246

  2. Using homology relations within a database markedly boosts protein sequence similarity search.

    PubMed

    Tong, Jing; Sadreyev, Ruslan I; Pei, Jimin; Kinch, Lisa N; Grishin, Nick V

    2015-06-01

    Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence-based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit's known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre.

  3. Using homology relations within a database markedly boosts protein sequence similarity search.

    PubMed

    Tong, Jing; Sadreyev, Ruslan I; Pei, Jimin; Kinch, Lisa N; Grishin, Nick V

    2015-06-01

    Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence-based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit's known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre. PMID:26038555

  4. Sequence and comparative genomic analysis of actin-related proteins.

    PubMed

    Muller, Jean; Oma, Yukako; Vallar, Laurent; Friederich, Evelyne; Poch, Olivier; Winsor, Barbara

    2005-12-01

    Actin-related proteins (ARPs) are key players in cytoskeleton activities and nuclear functions. Two complexes, ARP2/3 and ARP1/11, also known as dynactin, are implicated in actin dynamics and in microtubule-based trafficking, respectively. ARP4 to ARP9 are components of many chromatin-modulating complexes. Conventional actins and ARPs codefine a large family of homologous proteins, the actin superfamily, with a tertiary structure known as the actin fold. Because ARPs and actin share high sequence conservation, clear family definition requires distinct features to easily and systematically identify each subfamily. In this study we performed an in depth sequence and comparative genomic analysis of ARP subfamilies. A high-quality multiple alignment of approximately 700 complete protein sequences homologous to actin, including 148 ARP sequences, allowed us to extend the ARP classification to new organisms. Sequence alignments revealed conserved residues, motifs, and inserted sequence signatures to define each ARP subfamily. These discriminative characteristics allowed us to develop ARPAnno (http://bips.u-strasbg.fr/ARPAnno), a new web server dedicated to the annotation of ARP sequences. Analyses of sequence conservation among actins and ARPs highlight part of the actin fold and suggest interactions between ARPs and actin-binding proteins. Finally, analysis of ARP distribution across eukaryotic phyla emphasizes the central importance of nuclear ARPs, particularly the multifunctional ARP4.

  5. Understanding the sequence determinants of conformational switching using protein design.

    PubMed Central

    Dalal, S.; Regan, L.

    2000-01-01

    An important goal of protein design is to understand the forces that stabilize a particular fold in preference to alternative folds. Here, we describe an extension of earlier studies in which we successfully designed a stable, native-like helical protein that is 50% identical in sequence to a predominantly beta-sheet protein, the B1 domain of Streptococcal IgG-binding protein G. We report the characteristics of a series of variants of our original design that have even higher sequence identity to the B1 domain. Their properties illustrate the extent to which protein stability and conformation can be modulated through careful manipulation of key amino acid residues. Our results have implications for understanding conformational change phenomena of central biological importance and in probing the malleability of the sequence/structure relationship. PMID:11045612

  6. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm

    PubMed Central

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770

  7. Evolutionary bridges to new protein folds: design of C-terminal Cro protein chameleon sequences.

    PubMed

    Anderson, William J; Van Dorn, Laura O; Ingram, Wendy M; Cordes, Matthew H J

    2011-09-01

    Regions of amino-acid sequence that are compatible with multiple folds may facilitate evolutionary transitions in protein structure. In a previous study, we described a heuristically designed chameleon sequence (SASF1, structurally ambivalent sequence fragment 1) that could adopt either of two naturally occurring conformations (α-helical or β-sheet) when incorporated as part of the C-terminal dimerization subdomain of two structurally divergent transcription factors, P22 Cro and λ Cro. Here we describe longer chameleon designs (SASF2 and SASF3) that in the case of SASF3 correspond to the full C-terminal half of the ordered region of a P22 Cro/λ Cro sequence alignment (residues 34-57). P22-SASF2 and λ(WDD)-SASF2 show moderate thermal stability in denaturation curves monitored by circular dichroism (T(m) values of 46 and 55°C, respectively), while P22-SASF3 and λ(WDD)-SASF3 have somewhat reduced stability (T(m) values of 33 and 49°C, respectively). (13)C and (1)H NMR secondary chemical shift analysis confirms two C-terminal α-helices for P22-SASF2 (residues 36-45 and 54-57) and two C-terminal β-strands for λ(WDD)-SASF2 (residues 40-45 and 50-52), corresponding to secondary structure locations in the two parent sequences. Backbone relaxation data show that both chameleon sequences have a relatively well-ordered structure. Comparisons of (15)N-(1)H correlation spectra for SASF2 and SASF3-containing proteins strongly suggest that SASF3 retains the chameleonism of SASF2. Both Cro C-terminal conformations can be encoded in a single sequence, showing the plausibility of linking different Cro folds by smooth evolutionary transitions. The N-terminal subdomain, though largely conserved in structure, also exerts an important contextual influence on the structure of the C-terminal region.

  8. 3D structures of membrane proteins from genomic sequencing

    PubMed Central

    Hopf, Thomas A.; Colwell, Lucy J.; Sheridan, Robert; Rost, Burkhard; Sander, Chris; Marks, Debora S.

    2012-01-01

    Summary We show that amino acid co-variation in proteins, extracted from the evolutionary sequence record, can be used to fold transmembrane proteins. We use this technique to predict previously unknown, 3D structures for 11 transmembrane proteins (with up to 14 helices) from their sequences alone. The prediction method (EVfold_membrane), applies a maximum entropy approach to infer evolutionary co-variation in pairs of sequence positions within a protein family and then generates all-atom models with the derived pairwise distance constraints. We benchmark the approach with blinded, de novo computation of known transmembrane protein structures from 23 families, demonstrating unprecedented accuracy of the method for large transmembrane proteins. We show how the method can predict oligomerization, functional sites, and conformational changes in transmembrane proteins. With the rapid rise in large-scale sequencing, more accurate and more comprehensive information on evolutionary constraints can be decoded from genetic variation, greatly expanding the repertoire of transmembrane proteins amenable to modelling by this method. PMID:22579045

  9. Affinity Purification of Sequence-Specific DNA Binding Proteins

    NASA Astrophysics Data System (ADS)

    Kadonaga, James T.; Tjian, Robert

    1986-08-01

    We describe a method for affinity purification of sequence-specific DNA binding proteins that is fast and effective. Complementary chemically synthesized oligodeoxynucleotides that contain a recognition site for a sequence-specific DNA binding protein are annealed and ligated to give oligomers. This DNA is then covalently coupled to Sepharose CL-2B with cyanogen bromide to yield the affinity resin. A partially purified protein fraction is combined with competitor DNA and subsequently passed through the DNA-Sepharose resin. The desired sequence-specific DNA binding protein is purified because it preferentially binds to the recognition sites in the affinity resin rather than to the nonspecific competitor DNA in solution. For example, a protein fraction that is enriched for transcription factor Sp1 can be further purified 500- to 1000-fold by two sequential affinity chromatography steps to give Sp1 of an estimated 90% homogeneity with 30% yield. In addition, the use of tandem affinity columns containing different protein binding sites allows the simultaneous purification of multiple DNA binding proteins from the same extract. This method provides a means for the purification of rare sequence-specific DNA binding proteins, such as Sp1 and CAAT-binding transcription factor.

  10. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing.

    PubMed

    Kanda, Kojun; Pflug, James M; Sproul, John S; Dasenko, Mark A; Maddison, David R

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles

  11. Molecular sled sequences are common in mammalian proteins

    PubMed Central

    Xiong, Kan; Blainey, Paul C.

    2016-01-01

    Recent work revealed a new class of molecular machines called molecular sleds, which are small basic molecules that bind and slide along DNA with the ability to carry cargo along DNA. Here, we performed biochemical and single-molecule flow stretching assays to investigate the basis of sliding activity in molecular sleds. In particular, we identified the functional core of pVIc, the first molecular sled characterized; peptide functional groups that control sliding activity; and propose a model for the sliding activity of molecular sleds. We also observed widespread DNA binding and sliding activity among basic polypeptide sequences that implicate mammalian nuclear localization sequences and many cell penetrating peptides as molecular sleds. These basic protein motifs exhibit weak but physiologically relevant sequence-nonspecific DNA affinity. Our findings indicate that many mammalian proteins contain molecular sled sequences and suggest the possibility that substantial undiscovered sliding activity exists among nuclear mammalian proteins. PMID:26857546

  12. Molecular sled sequences are common in mammalian proteins.

    PubMed

    Xiong, Kan; Blainey, Paul C

    2016-03-18

    Recent work revealed a new class of molecular machines called molecular sleds, which are small basic molecules that bind and slide along DNA with the ability to carry cargo along DNA. Here, we performed biochemical and single-molecule flow stretching assays to investigate the basis of sliding activity in molecular sleds. In particular, we identified the functional core of pVIc, the first molecular sled characterized; peptide functional groups that control sliding activity; and propose a model for the sliding activity of molecular sleds. We also observed widespread DNA binding and sliding activity among basic polypeptide sequences that implicate mammalian nuclear localization sequences and many cell penetrating peptides as molecular sleds. These basic protein motifs exhibit weak but physiologically relevant sequence-nonspecific DNA affinity. Our findings indicate that many mammalian proteins contain molecular sled sequences and suggest the possibility that substantial undiscovered sliding activity exists among nuclear mammalian proteins. PMID:26857546

  13. Genomic 3' terminal sequence comparison of three isolates of rabbit haemorrhagic disease virus.

    PubMed

    Milton, I D; Vlasak, R; Nowotny, N; Rodak, L; Carter, M J

    1992-05-15

    Comparison of sequence data is necessary in older to investigate virus origins, identify features common to virulent strains, and characterize genomic organization within virus families. A virulent caliciviral disease of rabbits recently emerged in China. We have sequenced 1100 bases from the 3' ends of two independent European isolates of this virus, and compared these with previously determined calicivirus sequences. Rabbit caliciviruses were closely related, despite the different countries in which isolation was made. This supports the rapid spread of a new virus across Europe. The capsid protein sequences of these rabbit viruses differ markedly from those determined for feline calicivirus, but a hypothetical 3' open reading frame is relatively well conserved between the caliciviruses of these two different hosts and argues for a functional role.

  14. Nucleotide sequence of Bacillus phage Nf terminal protein gene.

    PubMed Central

    Leavitt, M C; Ito, J

    1987-01-01

    The nucleotide sequence of Bacillus phage Nf gene E has been determined. Gene E codes for phage terminal protein which is the primer necessary for the initiation of DNA replication. The deduced amino acid sequence of Nf terminal protein is approximately 66% homologous with the terminal proteins of Bacillus phages PZA and luminal diameter 29, and shows similar hydropathy and secondary structure predictions. A serine which has been identified as the residue which covalently links the protein to the 5' end of the genome in luminal diameter 29, is conserved in all three phages. The hydropathic and secondary structural environment of this serine is similar in these phage terminal proteins and also similar to the linking serine of adenovirus terminal protein. PMID:3601672

  15. In silico comparative analysis of DNA and amino acid sequences for prion protein gene.

    PubMed

    Kim, Y; Lee, J; Lee, C

    2008-01-01

    Genetic variability might contribute to species specificity of prion diseases in various organisms. In this study, structures of the prion protein gene (PRNP) and its amino acids were compared among species of which sequence data were available. Comparisons of PRNP DNA sequences among 12 species including human, chimpanzee, monkey, bovine, ovine, dog, mouse, rat, wallaby, opossum, chicken and zebrafish allowed us to identify candidate regulatory regions in intron 1 and 3'-untranslated region (UTR) in addition to the coding region. Highly conserved putative binding sites for transcription factors, such as heat shock factor 2 (HSF2) and myocite enhancer factor 2 (MEF2), were discovered in the intron 1. In 3'-UTR, the functional sequence (ATTAAA) for nucleus-specific polyadenylation was found in all the analysed species. The functional sequence (TTTTTAT) for maturation-specific polyadenylation was identically observed only in ovine, and one or two nucleotide mismatches in the other species. A comparison of the amino acid sequences in 53 species revealed a large sequence identity. Especially the octapeptide repeat region was observed in all the species but frog and zebrafish. Functional changes and susceptibility to prion diseases with various isoforms of prion protein could be caused by numeric variability and conformational changes discovered in the repeat sequences.

  16. Nucleotide sequence of the gene encoding the nitrogenase iron protein of Thiobacillus ferrooxidans

    SciTech Connect

    Pretorius, I.M.; Rawlings, D.E.; O'Neill, E.G.; Jones, W.A.; Kirby, R.; Woods, D.R.

    1987-01-01

    The DNA sequence was determined for the cloned Thiobacillus ferrooxidans nifH and part of the nifD genes. The DNA chains were radiolabeled with (..cap alpha..-/sup 32/P)dCTP (3000 Ci/mmol) or (..cap alpha..-/sup 35/S)dCTP (400 Ci/mmol). A putative T. ferrooxidans nifH promoter was identified whose sequences showed perfect consensus with those of the Klebsiella pneumoniae nif promoter. Two putative consensus upstream activator sequences were also identified. The amino acid sequence was deduced from the DNA sequence. In a comparison of nifH DNA sequences from T. ferrooxidans and eight other nitrogen-fixing microbes, a Rhizobium sp. isolated from Parasponia andersonii showed the greatest homology (74%) and Clostridium pasteurianum (nifH1) showed the least homology (54%). In the comparison of the amino acid sequences of the Fe proteins, the Rhizobium sp. and Rhizobium japonicum showed the greatest homology (both 86%) and C. pasteurianum (nifH1 gene product) demonstrated the least homology (56%) to the T. ferrooxidans Fe protein.

  17. Increasing Sequence Diversity with Flexible Backbone Protein Design: The Complete Redesign of a Protein Hydrophobic Core

    SciTech Connect

    Murphy, Grant S.; Mills, Jeffrey L.; Miley, Michael J.; Machius, Mischa; Szyperski, Thomas; Kuhlman, Brian

    2015-10-15

    Protein design tests our understanding of protein stability and structure. Successful design methods should allow the exploration of sequence space not found in nature. However, when redesigning naturally occurring protein structures, most fixed backbone design algorithms return amino acid sequences that share strong sequence identity with wild-type sequences, especially in the protein core. This behavior places a restriction on functional space that can be explored and is not consistent with observations from nature, where sequences of low identity have similar structures. Here, we allow backbone flexibility during design to mutate every position in the core (38 residues) of a four-helix bundle protein. Only small perturbations to the backbone, 12 {angstrom}, were needed to entirely mutate the core. The redesigned protein, DRNN, is exceptionally stable (melting point >140C). An NMR and X-ray crystal structure show that the side chains and backbone were accurately modeled (all-atom RMSD = 1.3 {angstrom}).

  18. Protein sequences bound to mineral surfaces persist into deep time

    PubMed Central

    Demarchi, Beatrice; Hall, Shaun; Roncal-Herrero, Teresa; Freeman, Colin L; Woolley, Jos; Crisp, Molly K; Wilson, Julie; Fotakis, Anna; Fischer, Roman; Kessler, Benedikt M; Rakownikow Jersie-Christensen, Rosa; Olsen, Jesper V; Haile, James; Thomas, Jessica; Marean, Curtis W; Parkington, John; Presslee, Samantha; Lee-Thorp, Julia; Ditchfield, Peter; Hamilton, Jacqueline F; Ward, Martyn W; Wang, Chunting Michelle; Shaw, Marvin D; Harrison, Terry; Domínguez-Rodrigo, Manuel; MacPhee, Ross DE; Kwekason, Amandus; Ecker, Michaela; Kolska Horwitz, Liora; Chazan, Michael; Kröger, Roland; Thomas-Oates, Jane; Harding, John H; Cappellini, Enrico; Penkman, Kirsty; Collins, Matthew J

    2016-01-01

    Proteins persist longer in the fossil record than DNA, but the longevity, survival mechanisms and substrates remain contested. Here, we demonstrate the role of mineral binding in preserving the protein sequence in ostrich (Struthionidae) eggshell, including from the palaeontological sites of Laetoli (3.8 Ma) and Olduvai Gorge (1.3 Ma) in Tanzania. By tracking protein diagenesis back in time we find consistent patterns of preservation, demonstrating authenticity of the surviving sequences. Molecular dynamics simulations of struthiocalcin-1 and -2, the dominant proteins within the eggshell, reveal that distinct domains bind to the mineral surface. It is the domain with the strongest calculated binding energy to the calcite surface that is selectively preserved. Thermal age calculations demonstrate that the Laetoli and Olduvai peptides are 50 times older than any previously authenticated sequence (equivalent to ~16 Ma at a constant 10°C). DOI: http://dx.doi.org/10.7554/eLife.17092.001 PMID:27668515

  19. High-resolution mapping of protein sequence-function relationships.

    PubMed

    Fowler, Douglas M; Araya, Carlos L; Fleishman, Sarel J; Kellogg, Elizabeth H; Stephany, Jason J; Baker, David; Fields, Stanley

    2010-09-01

    We present a large-scale approach to investigate the functional consequences of sequence variation in a protein. The approach entails the display of hundreds of thousands of protein variants, moderate selection for activity and high-throughput DNA sequencing to quantify the performance of each variant. Using this strategy, we tracked the performance of >600,000 variants of a human WW domain after three and six rounds of selection by phage display for binding to its peptide ligand. Binding properties of these variants defined a high-resolution map of mutational preference across the WW domain; each position had unique features that could not be captured by a few representative mutations. Our approach could be applied to many in vitro or in vivo protein assays, providing a general means for understanding how protein function relates to sequence.

  20. A Fractal Dimension and Wavelet Transform Based Method for Protein Sequence Similarity Analysis.

    PubMed

    Yang, Lina; Tang, Yuan Yan; Lu, Yang; Luo, Huiwu

    2015-01-01

    One of the key tasks related to proteins is the similarity comparison of protein sequences in the area of bioinformatics and molecular biology, which helps the prediction and classification of protein structure and function. It is a significant and open issue to find similar proteins from a large scale of protein database efficiently. This paper presents a new distance based protein similarity analysis using a new encoding method of protein sequence which is based on fractal dimension. The protein sequences are first represented into the 1-dimensional feature vectors by their biochemical quantities. A series of Hybrid method involving discrete Wavelet transform, Fractal dimension calculation (HWF) with sliding window are then applied to form the feature vector. At last, through the similarity calculation, we can obtain the distance matrix, by which, the phylogenic tree can be constructed. We apply this approach by analyzing the ND5 (NADH dehydrogenase subunit 5) protein cluster data set. The experimental results show that the proposed model is more accurate than the existing ones such as Su's model, Zhang's model, Yao's model and MEGA software, and it is consistent with some known biological facts. PMID:26357222

  1. A Fractal Dimension and Wavelet Transform Based Method for Protein Sequence Similarity Analysis.

    PubMed

    Yang, Lina; Tang, Yuan Yan; Lu, Yang; Luo, Huiwu

    2015-01-01

    One of the key tasks related to proteins is the similarity comparison of protein sequences in the area of bioinformatics and molecular biology, which helps the prediction and classification of protein structure and function. It is a significant and open issue to find similar proteins from a large scale of protein database efficiently. This paper presents a new distance based protein similarity analysis using a new encoding method of protein sequence which is based on fractal dimension. The protein sequences are first represented into the 1-dimensional feature vectors by their biochemical quantities. A series of Hybrid method involving discrete Wavelet transform, Fractal dimension calculation (HWF) with sliding window are then applied to form the feature vector. At last, through the similarity calculation, we can obtain the distance matrix, by which, the phylogenic tree can be constructed. We apply this approach by analyzing the ND5 (NADH dehydrogenase subunit 5) protein cluster data set. The experimental results show that the proposed model is more accurate than the existing ones such as Su's model, Zhang's model, Yao's model and MEGA software, and it is consistent with some known biological facts.

  2. Using homology relations within a database markedly boosts protein sequence similarity search

    PubMed Central

    Tong, Jing; Sadreyev, Ruslan I.; Pei, Jimin; Kinch, Lisa N.; Grishin, Nick V.

    2015-01-01

    Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence–based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit’s known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre. PMID:26038555

  3. Can natural proteins designed with 'inverted' peptide sequences adopt native-like protein folds?

    PubMed

    Sridhar, Settu; Guruprasad, Kunchur

    2014-01-01

    We have carried out a systematic computational analysis on a representative dataset of proteins of known three-dimensional structure, in order to evaluate whether it would possible to 'swap' certain short peptide sequences in naturally occurring proteins with their corresponding 'inverted' peptides and generate 'artificial' proteins that are predicted to retain native-like protein fold. The analysis of 3,967 representative proteins from the Protein Data Bank revealed 102,677 unique identical inverted peptide sequence pairs that vary in sequence length between 5-12 and 18 amino acid residues. Our analysis illustrates with examples that such 'artificial' proteins may be generated by identifying peptides with 'similar structural environment' and by using comparative protein modeling and validation studies. Our analysis suggests that natural proteins may be tolerant to accommodating such peptides.

  4. Predicting protein disorder by analyzing amino acid sequence

    PubMed Central

    Yang, Jack Y; Yang, Mary Qu

    2008-01-01

    Background Many protein regions and some entire proteins have no definite tertiary structure, presenting instead as dynamic, disorder ensembles under different physiochemical circumstances. These proteins and regions are known as Intrinsically Unstructured Proteins (IUP). IUP have been associated with a wide range of protein functions, along with roles in diseases characterized by protein misfolding and aggregation. Results Identifying IUP is important task in structural and functional genomics. We exact useful features from sequences and develop machine learning algorithms for the above task. We compare our IUP predictor with PONDRs (mainly neural-network-based predictors), disEMBL (also based on neural networks) and Globplot (based on disorder propensity). Conclusion We find that augmenting features derived from physiochemical properties of amino acids (such as hydrophobicity, complexity etc.) and using ensemble method proved beneficial. The IUP predictor is a viable alternative software tool for identifying IUP protein regions and proteins. PMID:18831799

  5. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

    PubMed

    Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus; Hatje, Klas; Kollmar, Martin; Leimeister, Chris-André; Morgenstern, Burkhard

    2014-07-01

    In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.

  6. Correlated mutations in protein sequences: Phylogenetic and structural effects

    SciTech Connect

    Lapedes, A.S. |; Giraud, B.G.; Stormo, G.D.

    1998-12-01

    Covariation analysis of sets of aligned sequences for RNA molecules is relatively successful in elucidating RNA secondary structure, as well as some aspects of tertiary structure. Covariation analysis of sets of aligned sequences for protein molecules is successful in certain instances in elucidating certain structural and functional links, but in general, pairs of sites displaying highly covarying mutations in protein sequences do not necessarily correspond to sites that are spatially close in the protein structure. In this paper the authors identify two reasons why naive use of covariation analysis for protein sequences fails to reliably indicate sequence positions that are spatially proximate. The first reason involves the bias introduced in calculation of covariation measures due to the fact that biological sequences are generally related by a non-trivial phylogenetic tree. The authors present a null-model approach to solve this problem. The second reason involves linked chains of covariation which can result in pairs of sites displaying significant covariation even though they are not spatially proximate. They present a maximum entropy solution to this classic problem of causation versus correlation. The methodologies are validated in simulation.

  7. Sequence and structural analysis of BTB domain proteins

    PubMed Central

    Stogios, Peter J; Downs, Gregory S; Jauhal, Jimmy JS; Nandra, Sukhjeen K; Privé, Gilbert G

    2005-01-01

    Background The BTB domain (also known as the POZ domain) is a versatile protein-protein interaction motif that participates in a wide range of cellular functions, including transcriptional regulation, cytoskeleton dynamics, ion channel assembly and gating, and targeting proteins for ubiquitination. Several BTB domain structures have been experimentally determined, revealing a highly conserved core structure. Results We surveyed the protein architecture, genomic distribution and sequence conservation of BTB domain proteins in 17 fully sequenced eukaryotes. The BTB domain is typically found as a single copy in proteins that contain only one or two other types of domain, and this defines the BTB-zinc finger (BTB-ZF), BTB-BACK-kelch (BBK), voltage-gated potassium channel T1 (T1-Kv), MATH-BTB, BTB-NPH3 and BTB-BACK-PHR (BBP) families of proteins, among others. In contrast, the Skp1 and ElonginC proteins consist almost exclusively of the core BTB fold. There are numerous lineage-specific expansions of BTB proteins, as seen by the relatively large number of BTB-ZF and BBK proteins in vertebrates, MATH-BTB proteins in Caenorhabditis elegans, and BTB-NPH3 proteins in Arabidopsis thaliana. Using the structural homology between Skp1 and the PLZF BTB homodimer, we present a model of a BTB-Cul3 SCF-like E3 ubiquitin ligase complex that shows that the BTB dimer or the T1 tetramer is compatible in this complex. Conclusion Despite widely divergent sequences, the BTB fold is structurally well conserved. The fold has adapted to several different modes of self-association and interactions with non-BTB proteins. PMID:16207353

  8. Single-molecule protein sequencing through fingerprinting: computational assessment

    NASA Astrophysics Data System (ADS)

    Yao, Yao; Docter, Margreet; van Ginkel, Jetty; de Ridder, Dick; Joo, Chirlmin

    2015-10-01

    Proteins are vital in all biological systems as they constitute the main structural and functional components of cells. Recent advances in mass spectrometry have brought the promise of complete proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule (SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different amino acids, which demands 20 molecular reporters. We computationally demonstrate that it suffices to measure only two types of amino acids to identify proteins and suggest an experimental scheme using SM fluorescence. When achieved, this highly sensitive approach will result in a paradigm shift in proteomics, with major impact in the biological and medical sciences.

  9. Internal organization of large protein families: relationship between the sequence, structure and function based clustering

    PubMed Central

    Cai, Xiao-hui; Jaroszewski, Lukasz; Wooley, John; Godzik, Adam

    2011-01-01

    The protein universe can be organized in families that group proteins sharing common ancestry. Such families display variable levels of structural and functional divergence, from homogenous families, where all members have the same function and very similar structure, to very divergent families, where large variations in function and structure are observed. For practical purposes of structure and function prediction, it would be beneficial to identify sub-groups of proteins with highly similar structures (iso-structural) and/or functions (iso-functional) within divergent protein families. We compared three algorithms in their ability to cluster large protein families and discuss whether any of these methods could reliably identify such iso-structural or iso-functional groups. We show that clustering using profile-sequence and profile-profile comparison methods closely reproduces clusters based on similarities between 3D structures or clusters of proteins with similar biological functions. In contrast, the still commonly used sequence-based methods with fixed thresholds result in vast overestimates of structural and functional diversity in protein families. As a result, these methods also overestimate the number of protein structures that have to be determined to fully characterize structural space of such families. The fact that one can build reliable models based on apparently distantly related templates is crucial for extracting maximal amount of information from new sequencing projects. PMID:21671455

  10. Comprehensive analysis of sequences of a protein switch.

    PubMed

    Chen, Szu-Hua; Meller, Jaroslaw; Elber, Ron

    2016-01-01

    Switches form a special class of proteins that dramatically change their three-dimensional structures upon a small perturbation. One possible perturbation that we explore is that of a single point mutation. Building on the pioneering experimental work of Alexander et al. (Alexander et al. PNAS, 2007; 104,11963-11968) that determines switch sequences between α and α+β folds we conduct a comprehensive sequence sampling by a Markov Chain with multiple fitness criteria to identify new switches given the experimental folds. We screen for switch sequences using a combination of contact potential, secondary structure prediction, and finally molecular dynamics simulations. Statistical properties of switch sequences are discussed and illustrated to be most sensitive to mutation at the N- and C- termini of the switch protein. Based on this analysis, a particularly stable putative switch pair is identified and proposed for further experimental analysis. PMID:26073558

  11. Structure and Sequence Search on Aptamer-Protein Docking

    NASA Astrophysics Data System (ADS)

    Xiao, Jiajie; Bonin, Keith; Guthold, Martin; Salsbury, Freddie

    2015-03-01

    Interactions between proteins and deoxyribonucleic acid (DNA) play a significant role in the living systems, especially through gene regulation. However, short nucleic acids sequences (aptamers) with specific binding affinity to specific proteins exhibit clinical potential as therapeutics. Our capillary and gel electrophoresis selection experiments show that specific sequences of aptamers can be selected that bind specific proteins. Computationally, given the experimentally-determined structure and sequence of a thrombin-binding aptamer, we can successfully dock the aptamer onto thrombin in agreement with experimental structures of the complex. In order to further study the conformational flexibility of this thrombin-binding aptamer and to potentially develop a predictive computational model of aptamer-binding, we use GPU-enabled molecular dynamics simulations to both examine the conformational flexibility of the aptamer in the absence of binding to thrombin, and to determine our ability to fold an aptamer. This study should help further de-novo predictions of aptamer sequences by enabling the study of structural and sequence-dependent effects on aptamer-protein docking specificity.

  12. Bioinformatics comparison of sulfate-reducing metabolism nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Nguyen, A.; Cheung, E.; Sullivan, R.; Holden, T.; Lieberman, D.; Cheung, T.

    2015-09-01

    The sulfate-reducing bacteria can be traced back to 3.5 billion years ago. The thermodynamics details of the sulfur cycle have been well documented. A recent sulfate-reducing bacteria report (Robator, Jungbluth, et al , 2015 Jan, Front. Microbiol) with Genbank nucleotide data has been analyzed in terms of the sulfite reductase (dsrAB) via fractal dimension and entropy values. Comparison to oil field sulfate-reducing sequences was included. The AUCG translational mass fractal dimension versus ATCG transcriptional mass fractal dimension for the low temperature dsrB and dsrA sequences reported in Reference Thirteen shows correlation R-sq ~ 0.79 , with a probably of about 3% in simulation. A recent report of using Cystathionine gamma-lyase sequence to produce CdS quantum dot in a biological method, where the sulfur is reduced just like in the H2S production process, was included for comparison. The AUCG mass fractal dimension versus ATCG mass fractal dimension for the Cystathionine gamma-lyase sequences was found to have R-sq of 0.72, similar to the low temperature dissimilatory sulfite reductase dsr group with 3% probability, in contrary to the oil field group having R-sq ~ 0.94, a high probable outcome in the simulation. The other two simulation histograms, namely, fractal dimension versus entropy R-sq outcome values, and di-nucleotide entropy versus mono-nucleotide entropy R-sq outcome values are also discussed in the data analysis focusing on low probability outcomes.

  13. Extracting protein alignment models from the sequence database.

    PubMed Central

    Neuwald, A F; Liu, J S; Lipman, D J; Lawrence, C E

    1997-01-01

    Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans ; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences. PMID:9108146

  14. Determinants of the rate of protein sequence evolution

    PubMed Central

    Zhang, Jianzhi; Yang, Jian-Rong

    2015-01-01

    The rate and mechanism of protein sequence evolution have been central questions in evolutionary biology since the 1960s. Although the rate of protein sequence evolution depends primarily on the level of functional constraint, exactly what constitutes functional constraint has remained unclear. The increasing availability of genomic data has allowed for much needed empirical examinations on the nature of functional constraint. These studies found that the evolutionary rate of a protein is predominantly influenced by its expression level rather than functional importance. A combination of theoretical and empirical analyses have identified multiple mechanisms behind these observations and demonstrated a prominent role that selection against errors in molecular and cellular processes plays in protein evolution. PMID:26055156

  15. Purification and sequencing of the active site tryptic peptide from penicillin-binding protein 1b of Escherichia coli

    SciTech Connect

    Nicholas, R.A.; Suzuki, H.; Hirota, Y.; Strominger, J.L.

    1985-07-02

    This paper reports the sequence of the active site peptide of penicillin-binding protein 1b from Escherichia coli. Purified penicillin-binding protein 1b was labeled with (/sup 14/C)penicillin G, digested with trypsin, and partially purified by gel filtration. Upon further purification by high-pressure liquid chromatography, two radioactive peaks were observed, and the major peak, representing over 75% of the applied radioactivity, was submitted to amino acid analysis and sequencing. The sequence Ser-Ile-Gly-Ser-Leu-Ala-Lys was obtained. The active site nucleophile was identified by digesting the purified peptide with aminopeptidase M and separating the radioactive products on high-pressure liquid chromatography. Amino acid analysis confirmed that the serine residue in the middle of the sequence was covalently bonded to the (/sup 14/C)penicilloyl moiety. A comparison of this sequence to active site sequences of other penicillin-binding proteins and beta-lactamases is presented.

  16. Co-evolution of metabolism and protein sequences.

    PubMed

    Schütte, Moritz; Klitgord, Niels; Segrè, Daniel; Ebenhöh, Oliver

    2010-01-01

    The set of chemicals producible and usable by metabolic pathways must have evolved in parallel with the enzymes that catalyze them. One implication of this common historical path should be a correspondence between the innovation steps that gradually added new metabolic reactions to the biosphere-level biochemical toolkit, and the gradual sequence changes that must have slowly shaped the corresponding enzyme structures. However, global signatures of a long-term co-evolution have not been identified. Here we search for such signatures by computing correlations between inter-reaction distances on a metabolic network, and sequence distances of the corresponding enzyme proteins. We perform our calculations using the set of all known metabolic reactions, available from the KEGG database. Reaction-reaction distance on the metabolic network is computed as the length of the shortest path on a projection of the metabolic network, in which nodes are reactions and edges indicate whether two reactions share a common metabolite, after removal of cofactors. Estimating the distance between enzyme sequences in a meaningful way requires some special care: for each enzyme commission (EC) number, we select from KEGG a consensus set of protein sequences using the cluster of orthologous groups of proteins (COG) database. We define the evolutionary distance between protein sequences as an asymmetric transition probability between two enzymes, derived from the corresponding pair-wise BLAST scores. By comparing the distances between sequences to the minimal distances on the metabolic reaction graph, we find a small but statistically significant correlation between the two measures. This suggests that the evolutionary walk in enzyme sequence space has locally mirrored, to some extent, the gradual expansion of metabolism. PMID:20238426

  17. EST2Prot: Mapping EST sequences to proteins

    PubMed Central

    Shafer, Paul; Lin, David M; Yona, Golan

    2006-01-01

    Background EST libraries are used in various biological studies, from microarray experiments to proteomic and genetic screens. These libraries usually contain many uncharacterized ESTs that are typically ignored since they cannot be mapped to known genes. Consequently, new discoveries are possibly overlooked. Results We describe a system (EST2Prot) that uses multiple elements to map EST sequences to their corresponding protein products. EST2Prot uses UniGene clusters, substring analysis, information about protein coding regions in existing DNA sequences and protein database searches to detect protein products related to a query EST sequence. Gene Ontology terms, Swiss-Prot keywords, and protein similarity data are used to map the ESTs to functional descriptors. Conclusion EST2Prot extends and significantly enriches the popular UniGene mapping by utilizing multiple relations between known biological entities. It produces a mapping between ESTs and proteins in real-time through a simple web-interface. The system is part of the Biozon database and is accessible at . PMID:16515706

  18. Cloning and sequence analysis of the major outer membrane protein genes of two Chlamydia psittaci strains.

    PubMed

    Zhang, Y X; Morrison, S G; Caldwell, H D; Baehr, W

    1989-05-01

    We cloned and sequenced the gene encoding the major outer membrane protein (MOMP) of two Chlamydia psittaci strains, guinea pig inclusion conjunctivitis (GPIC) strain 1, and meningopneumonitis (Mn) strain Cal-10. Intraspecies alignment of the two C. psittaci MOMP genes revealed 80.6% similarity, and interspecies comparison of C. trachomatis and C. psittaci MOMP genes yielded about 68% similarity. As found previously for C. trachomatis MOMP sequences, stretches of predominantly conserved sequences of GPIC and Mn MOMPs were interrupted by four variable domains whose locations were identical to those of C. trachomatis MOMPs. Seven of eight cysteine residues were found at precisely the same positions in GPIC, Mn, and C. trachomatis MOMPs, emphasizing their importance in structure and function of the protein. Collectively, these results indicate that C. psittaci and C. trachomatis MOMP genes diverged from a common ancestor.

  19. Identification of staphylococcal species based on variations in protein sequences (mass spectrometry) and DNA sequence (sodA microarray).

    PubMed

    Kooken, Jennifer; Fox, Karen; Fox, Alvin; Altomare, Diego; Creek, Kim; Wunschel, David; Pajares-Merino, Sara; Martínez-Ballesteros, Ilargi; Garaizar, Javier; Oyarzabal, Omar; Samadpour, Mansour

    2014-02-01

    This report is among the first using sequence variation in newly discovered protein markers for staphylococcal (or indeed any other bacterial) speciation. Variation, at the DNA sequence level, in the sodA gene (commonly used for staphylococcal speciation) provided excellent correlation. Relatedness among strains was also assessed using protein profiling using microcapillary electrophoresis and pulsed field electrophoresis. A total of 64 strains were analyzed including reference strains representing the 11 staphylococcal species most commonly isolated from man (Staphylococcus aureus and 10 coagulase negative species [CoNS]). Matrix assisted time of flight ionization/ionization mass spectrometry (MALDI TOF MS) and liquid chromatography-electrospray ionization tandem mass spectrometry (LC ESI MS/MS) were used for peptide analysis of proteins isolated from gel bands. Comparison of experimental spectra of unknowns versus spectra of peptides derived from reference strains allowed bacterial identification after MALDI TOF MS analysis. After LC-MS/MS analysis of gel bands bacterial speciation was performed by comparing experimental spectra versus virtual spectra using the software X!Tandem. Finally LC-MS/MS was performed on whole proteomes and data analysis also employing X!tandem. Aconitate hydratase and oxoglutarate dehydrogenase served as marker proteins on focused analysis after gel separation. Alternatively on full proteomics analysis elongation factor Tu generally provided the highest confidence in staphylococcal speciation.

  20. ANTHEPROT: an integrated protein sequence analysis software with client/server capabilities.

    PubMed

    Deléage, G; Combet, C; Blanchet, C; Geourjon, C

    2001-07-01

    Programs devoted to the analysis of protein sequences exist either as stand-alone programs or as Web servers. However, stand-alone programs can hardly accommodate for the analysis that involves comparisons on databanks, which require regular updates. Moreover, Web servers cannot be as efficient as stand-alone programs when dealing with real-time graphic display. We describe here a stand-alone software program called ANTHEPROT, which is intended to perform protein sequence analysis with a high integration level and clients/server capabilities. It is an interactive program with a graphical user interface that allows handling of protein sequence and data in a very interactive and convenient manner. It provides many methods and tools, which are integrated into a graphical user interface. ANTHEPROT is available for Windows-based systems. It is able to connect to a Web server in order to perform large-scale sequence comparison on up-to-date databanks. ANTHEPROT is freely available to academic users and may be downloaded at http://pbil.ibcp.fr/ANTHEPROT.

  1. Rapid Evolution of Virus Sequences in Intrinsically Disordered Protein Regions

    PubMed Central

    Gitlin, Leonid; Hagai, Tzachi; LaBarbera, Anthony; Solovey, Mark; Andino, Raul

    2014-01-01

    Nodamura Virus (NoV) is a nodavirus originally isolated from insects that can replicate in a wide variety of hosts, including mammals. Because of their simplicity and ability to replicate in many diverse hosts, NoV, and the Nodaviridae in general, provide a unique window into the evolution of viruses and host-virus interactions. Here we show that the C-terminus of the viral polymerase exhibits extreme structural and evolutionary flexibility. Indeed, fewer than 10 positively charged residues from the 110 amino acid-long C-terminal region of protein A are required to support RNA1 replication. Strikingly, this region can be replaced by completely unrelated protein sequences, yet still produce a functional replicase. Structure predictions, as well as evolutionary and mutational analyses, indicate that the C-terminal region is structurally disordered and evolves faster than the rest of the viral proteome. Thus, the function of an intrinsically unstructured protein region can be independent of most of its primary sequence, conferring both functional robustness and sequence plasticity on the protein. Our results provide an experimental explanation for rapid evolution of unstructured regions, which enables an effective exploration of the sequence space, and likely function space, available to the virus. PMID:25502394

  2. Rapid evolution of virus sequences in intrinsically disordered protein regions.

    PubMed

    Gitlin, Leonid; Hagai, Tzachi; LaBarbera, Anthony; Solovey, Mark; Andino, Raul

    2014-12-01

    Nodamura Virus (NoV) is a nodavirus originally isolated from insects that can replicate in a wide variety of hosts, including mammals. Because of their simplicity and ability to replicate in many diverse hosts, NoV, and the Nodaviridae in general, provide a unique window into the evolution of viruses and host-virus interactions. Here we show that the C-terminus of the viral polymerase exhibits extreme structural and evolutionary flexibility. Indeed, fewer than 10 positively charged residues from the 110 amino acid-long C-terminal region of protein A are required to support RNA1 replication. Strikingly, this region can be replaced by completely unrelated protein sequences, yet still produce a functional replicase. Structure predictions, as well as evolutionary and mutational analyses, indicate that the C-terminal region is structurally disordered and evolves faster than the rest of the viral proteome. Thus, the function of an intrinsically unstructured protein region can be independent of most of its primary sequence, conferring both functional robustness and sequence plasticity on the protein. Our results provide an experimental explanation for rapid evolution of unstructured regions, which enables an effective exploration of the sequence space, and likely function space, available to the virus. PMID:25502394

  3. Educational Software for the Analysis of DNA and Protein Sequences.

    ERIC Educational Resources Information Center

    Maloy, Stanley; Olson, Sue

    1989-01-01

    Describes the development of the microcomputer-based educational software, DNAzoom, which was designed to introduce undergraduates in molecular biology to computer analysis of DNA protein sequences. Highlights include graphical presentation of data, the functional use of color, a menu-oriented interface, and students' evaluations of the software.…

  4. VISTAS: a package for VIsualizing STructures and sequences of proteins.

    PubMed

    Perkins, D N; Attwood, T K

    1995-02-01

    VISTAS is a suite of programs for protein sequence and structure analysis. The system allows the simultaneous display, in separate windows, of multiple sequence alignments, of known or model 3D structures, and of 2D graphic representations of sequence and/or alignment properties. The displays are fully integrated, and therefore manipulations in one window can be reflected in each of the others. Beyond its display facilities, VISTAS brings together a number of existing tools under a single, user-friendly umbrella: these include a fully functional interactive color alignment procedure, conserved motif selection, a range of database-scanning routines, and interactive access to the OWL composite sequence database and to the PRINTS protein fingerprint database. Exploration of the sequence database is thus straightforward, and predefined structural motifs from the fingerprint database may be readily visualized. Of particular note is the ability to calculate conservation criteria from sequence alignments and to display the information in a 3D context: this renders VISTAS a powerful tool for aiding mutagenesis studies and for facilitating refinement of molecular models.

  5. Comparison of DNA Quantification Methods for Next Generation Sequencing

    PubMed Central

    Robin, Jérôme D.; Ludlow, Andrew T.; LaRanger, Ryan; Wright, Woodring E.; Shay, Jerry W.

    2016-01-01

    Next Generation Sequencing (NGS) is a powerful tool that depends on loading a precise amount of DNA onto a flowcell. NGS strategies have expanded our ability to investigate genomic phenomena by referencing mutations in cancer and diseases through large-scale genotyping, developing methods to map rare chromatin interactions (4C; 5C and Hi-C) and identifying chromatin features associated with regulatory elements (ChIP-seq, Bis-Seq, ChiA-PET). While many methods are available for DNA library quantification, there is no unambiguous gold standard. Most techniques use PCR to amplify DNA libraries to obtain sufficient quantities for optical density measurement. However, increased PCR cycles can distort the library’s heterogeneity and prevent the detection of rare variants. In this analysis, we compared new digital PCR technologies (droplet digital PCR; ddPCR, ddPCR-Tail) with standard methods for the titration of NGS libraries. DdPCR-Tail is comparable to qPCR and fluorometry (QuBit) and allows sensitive quantification by analysis of barcode repartition after sequencing of multiplexed samples. This study provides a direct comparison between quantification methods throughout a complete sequencing experiment and provides the impetus to use ddPCR-based quantification for improvement of NGS quality. PMID:27048884

  6. Biophysical and structural considerations for protein sequence evolution

    PubMed Central

    2011-01-01

    Background Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field. Results Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed rates across sites. Conclusions Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model. PMID:22171550

  7. nWayComp: a genome-wide sequence comparison tool for multiple strains/species of phylogenetically related microorganisms.

    PubMed

    Yao, Jiqiang; Lin, Hong; Doddapaneni, Harshavardhan; Civerolo, Edwin L

    2007-01-01

    The increasing number of whole genomic sequences of microorganisms has led to the complexity of genome-wide annotation and gene sequence comparison among multiple microorganisms. To address this problem, we have developed nWayComp software that compares DNA and protein sequences of phylogenetically-related microorganisms. This package integrates a series of bioinformatics tools such as BLAST, ClustalW, ALIGN, PHYLIP and PRIMER3 for sequence comparison. It searches for homologous sequences among multiple organisms and identifies genes that are unique to a particular organism. The homologous gene sets are then ranked in the descending order of the sequence similarity. For each set of homologous sequences, a table of sequence identity among homologous genes along with sequence variations such as SNPs and INDELS is developed, and a phylogenetic tree is constructed. In addition, a common set of primers that can amplify all the homologous sequences are generated. The nWayComp package provides users with a quick and convenient tool to compare genomic sequences among multiple organisms at the whole-genome level. PMID:17688445

  8. Sequence heterogeneity accelerates protein search for targets on DNA

    NASA Astrophysics Data System (ADS)

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-12-01

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  9. Sequence heterogeneity accelerates protein search for targets on DNA

    SciTech Connect

    Shvets, Alexey A.; Kolomeisky, Anatoly B.

    2015-12-28

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity, and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry, and heterogeneity of a genome.

  10. Cladistic analysis of iridoviruses based on protein and DNA sequences.

    PubMed

    Wang, J W; Deng, R Q; Wang, X Z; Huang, Y S; Xing, K; Feng, J H; He, J G; Long, Q X

    2003-11-01

    Cladograms of iridoviruses were inferred from bootstrap analysis of molecular data sets comprising all published protein and DNA sequences of the major capsid protein, ATPase and DNA polymerase genes of members of the Iridoviridae family Iridovirus. All data sets yielded cladograms supporting the separation of the Iridovirus, Ranavirus and Lymphocystivirus genera, and the cladogram based on data derived from major capsid proteins further divided both the Iridovirus and Ranavirus genera into two groups. Tests of alternative hypotheses of topological constraints were also performed to further investigate relationships between infectious spleen and kidney necrosis virus (ISKNV), an unclassified fish iridovirus for which the complete genome sequence data is available, and other iridoviruses. Cladograms inferred and results of Shimodaira-Hasegawa tests indicated that ISKNV is more closely related to the Ranavirus genus than it is to the other genera of the family.

  11. CRYSTALP2: sequence-based protein crystallization propensity prediction

    PubMed Central

    Kurgan, Lukasz; Razib, Ali A; Aghakhani, Sara; Dick, Scott; Mizianty, Marcin; Jahandideh, Samad

    2009-01-01

    Background Current protocols yield crystals for <30% of known proteins, indicating that automatically identifying crystallizable proteins may improve high-throughput structural genomics efforts. We introduce CRYSTALP2, a kernel-based method that predicts the propensity of a given protein sequence to produce diffraction-quality crystals. This method utilizes the composition and collocation of amino acids, isoelectric point, and hydrophobicity, as estimated from the primary sequence, to generate predictions. CRYSTALP2 extends its predecessor, CRYSTALP, by enabling predictions for sequences of unrestricted size and provides improved prediction quality. Results A significant majority of the collocations used by CRYSTALP2 include residues with high conformational entropy, or low entropy and high potential to mediate crystal contacts; notably, such residues are utilized by surface entropy reduction methods. We show that the collocations provide complementary information to the hydrophobicity and isoelectric point. Tests on four datasets show that CRYSTALP2 outperforms several existing sequence-based predictors (CRYSTALP, OB-score, and SECRET). CRYSTALP2's accuracy, MCC, and AROC range between 69.3 and 77.5%, 0.39 and 0.55, and 0.72 and 0.79, respectively. Our predictions are similar in quality and are complementary to the predictions of the most recent ParCrys and XtalPred methods. Our results also suggest that, as work in protein crystallization continues (thereby enlarging the population of proteins with known crystallization propensities), the prediction quality of the CRYSTALP2 method should increase. The prediction model and the datasets used in this contribution can be downloaded from . Conclusion CRYSTALP2 provides relatively accurate crystallization propensity predictions for a given protein chain that either outperform or complement the existing approaches. The proposed method can be used to support current efforts towards improving the success rate in obtaining

  12. A novel randomized iterative strategy for aligning multiple protein sequences.

    PubMed

    Berger, M P; Munson, P J

    1991-10-01

    The rigorous alignment of multiple protein sequences becomes impractical even with a modest number of sequences, since computer memory and time requirements increase as the product of the lengths of the sequences. We have devised a strategy to approach such an optimal alignment, which modifies the intensive computer storage and time requirements of dynamic programming. Our algorithm randomly divides a group of unaligned sequences into two subgroups, between which an optimal alignment is then obtained by a Needleman-Wunsch style of algorithm. Our algorithm uses a matrix with dimensions corresponding to the lengths of the two aligned sequence subgroups. The pairwise alignment process is repeated using different random divisions of the whole group into two subgroups. Compared with the rigorous approach of solving the n-dimensional lattice by dynamic programming, our iterative algorithm results in alignments that match or are close to the optimal solution, on a limited set of test problems. We have implemented this algorithm in a computer program that runs on the IBM PC class of machines, together with a user-friendly environment for interactively selecting sequences or groups of sequences to be aligned either simultaneously or progressively.

  13. Molecular evolution of streptococcal M protein: cloning and nucleotide sequence of the type 24 M protein gene and relation to other genes of Streptococcus pyogenes.

    PubMed Central

    Mouw, A R; Beachey, E H; Burdett, V

    1988-01-01

    The structural gene for the type 24 M protein of group A streptococci has been cloned and expressed in Escherichia coli. The complete nucleotide sequence of the gene and the 3' and 5' flanking regions was determined. The sequence includes an open reading frame of 1,617 base pairs encoding a pre-M24 protein of 539 amino acids and a predicted Mr of 58,738. The structural gene contains two distinct tandemly reiterated elements. The first repeated element consists of 5.3 units, and the second contains 2.7 units. Each element shows little variation of the basic 35-amino-acid unit. Comparison of the sequence of the M24 protein with the sequence of the M6 protein (S. K. Hollingshead, V. A. Fischetti, and J. R. Scott, J. Biol. Chem. 261:1677-1686, 1986) indicates that these molecules have are conserved except in the regions coding for the antigenic (type specific) determinant and they have three regions of homology within the structural genes: 38 of 42 amino acids within the amino terminal signal sequence, the second repeated element of the M24 protein is found in the M6 molecule at the same position in the protein, and the carboxy terminal 164 amino acids, including a membrane anchor sequence, are conserved in both proteins. In addition, the sequences flanking the two genes are strongly conserved. Images PMID:3276665

  14. Protein landscape at Drosophila melanogaster telomere-associated sequence repeats.

    PubMed

    Antão, José M; Mason, James M; Déjardin, Jérôme; Kingston, Robert E

    2012-06-01

    The specific set of proteins bound at each genomic locus contributes decisively to regulatory processes and to the identity of a cell. Understanding of the function of a particular locus requires the knowledge of what factors interact with that locus and how the protein composition changes in different cell types or during the response to internal and external signals. Proteomic analysis of isolated chromatin segments (PICh) was developed as a tool to target, purify, and identify proteins associated with a defined locus and was shown to allow the purification of human telomeric chromatin. Here we have developed this method to identify proteins that interact with the Drosophila telomere-associated sequence (TAS) repeats. Several of the purified factors were validated as novel TAS-bound proteins by chromatin immunoprecipitation, and the Brahma complex was confirmed as a dominant modifier of telomeric position effect through the use of a genetic test. These results offer information on the efficacy of applying the PICh protocol to loci with sequence more complex than that found at human telomeres and identify proteins that bind to the TAS repeats, which might contribute to TAS biology and chromatin silencing. PMID:22493064

  15. Sequence Recognition of DNA by Protein-Induced Conformational Transitions

    SciTech Connect

    Watkins, Derrick; Mohan, Srividya; Koudelka, Gerald B.; Williams, Loren Dean

    2010-11-09

    The binding of proteins to specific sequences of DNA is an important feature of virtually all DNA transactions. Proteins recognize specific DNA sequences using both direct readout (sensing types and positions of DNA functional groups) and indirect readout (sensing DNA conformation and deformability). Previously we showed that the P22 c2 repressor N-terminal domain (P22R NTD) forces the central non-contacted 5{prime}-ATAT-3{prime} sequence of the DNA operator into the B{prime} state, a state known to affect DNA hydration, rigidity and bending. Usually the B{prime} state, with a narrow minor groove and a spine of hydration, is reserved for A-tract DNA (TpA steps disrupt A-tracts). Here, we have co-crystallized P22R NTD with an operator containing a central 5{prime}-ACGT-3{prime} sequence in the non-contacted region. C {center_dot} G base pairs have not previously been observed in the B{prime} state and are thought to prevent it. However, P22R NTD induces a narrow minor groove and a spine of hydration to 5{prime}-ACGT-3{prime}. We observe that C {center_dot} G base pairs have distinctive destabilizing and disordering effects on the spine of hydration. It appears that the reduced stability of the spine results in a higher energy cost for the B to B{prime} transition. The differential effect of DNA sequence on the barrier to this transition allows the protein to sense the non-contacted DNA sequence.

  16. Vesicular stomatitis virus NS proteins: structural similarity without extensive sequence homology.

    PubMed Central

    Gill, D S; Banerjee, A K

    1985-01-01

    The complete nucleotide sequence of the NS mRNA of vesicular stomatitis virus (New Jersey serotype) was established from two cDNA clones spanning the entire coding region of the mRNA. The gene is 856 nucleotides long and can code for a polypeptide of 274 amino acids. Comparison with the nucleotide sequence of the NS gene of the Indiana serotype revealed only 41% sequence homology. The deduced amino acid sequences of the NS proteins were only 32% homologous, with no identical stretches of more than five amino acids. However, at the C-terminal domain there was a conserved region of 21 amino acids with greater than 90% homology. Surprisingly, relative hydropathicity plots also demonstrated the presence of a large number of hydrophilic amino acids sequestered similarly over the N-terminal half of the protein. In addition, the total number of serine and threonine residues, presumptive phosphorylation sites, was similar and included seven serine and three threonine residues located at identical positions. It appears that during divergent evolution of these two vesicular stomatitis virus serotypes from a common ancestor, considerable mutation occurred in the main body of the gene but the overall structure of the protein was retained. The function of the NS protein in relation to the evolution of the two viruses is discussed. Images PMID:2989560

  17. Sequence analysis and structural implications of rotavirus capsid proteins.

    PubMed

    Parbhoo, N; Dewar, J B; Gildenhuys, S

    2016-01-01

    Rotavirus is the major cause of severe virus-associated gastroenteritis worldwide in children aged 5 and younger. Many children lose their lives annually due to this infection and the impact is particularly pronounced in developing countries. The mature rotavirus is a non-enveloped triple-layered nucleocapsid containing 11 double stranded RNA segments. Here a global view on the sequence and structure of the three main capsid proteins, VP2, VP6 and VP7 is shown by generating a consensus sequence for each of these rotavirus proteins, for each species obtained from published data of representative rotavirus genotypes from across the world and across species. Degree of conservation between species was represented on homology models for each of the proteins. VP7 shows the highest level of variation with 14-45 amino acids showing conservation of less than 60%. These changes are localised to the outer surface alluding to a possible mechanism in evading the immune system. The middle layer, VP6 shows lower variability with only 14-32 sites having lower than 70% conservation. The inner structural layer made up of VP2 showed the lowest variability with only 1-16 sites having less than 70% conservation across species. The results correlate with each protein's multiple structural roles in the infection cycle. Thus, although the nucleotide sequences vary due to the error-prone nature of replication and lack of proof reading, the corresponding amino acid sequence of VP2, 6 and 7 remain relatively conserved. Benefits of this knowledge about the conservation include the ability to target proteins at sites that cannot undergo mutational changes without influencing viral fitness; as well as possibility to study systems that are highly evolved for structure and function in order to determine how to generate and manipulate such systems for use in various biotechnological applications. PMID:27640436

  18. FAB overlapping: a strategy for sequencing homologous proteins

    NASA Astrophysics Data System (ADS)

    Ferranti, P.; Malorni, A.; Marino, G.; Pucci, P.; di Luccia, A.; Ferrara, L.

    1991-12-01

    Extensive similarity has been shown to exist between the primary structures of closely related proteins from different species, the only differences being restricted to a few amino acid variations. A new mass spectrometric procedure, which has been called FAB-overlapping, has been developed for sequencing highly homologous proteins based on the detection of these small differences as compared with a known protein used as a reference. Several complementary peptide maps are constructed using fast atom bombardment mass spectrometry (FAB-MS) analysis of different proteolytic digests of the unknown protein and the mass values are related to those expected on the basis of the sequence of the reference protein. The mass signals exhibiting unusual mass values identify those regions where variations have taken place; fine location of the mutations can be obtained by coupling simple protein chemistry methodologies with FAB-MS. Using the FAB-overlapping procedure, it was possible to determine the sequence of [alpha]1, [alpha]3 and [beta] globins from water buffalo (Bubalus bubalis hemoglobins (phenotype AA). Two amino acid substitutions were detected in the buffalo [beta] chain (Lys16 --> His and Asn118 --> His) whereas the [alpha]1 chains were found the [alpha]1 and [alpha]3 chains were found to contain four amino acid replacements, three of which were identical (Glu23 --> Asp, Glu71 --> Gly, Phe117 --> Cys), and the insertion of an alanine residue in position 124. The only differences between [alpha]1 and [alpha]3 globins were identified in the C -terminal region; [alpha]1 contains a Phe residue at position 130 whereas [alpha]3 shows serine at position 132.

  19. PROFESS: a PROtein function, evolution, structure and sequence database.

    PubMed

    Triplet, Thomas; Shortridge, Matthew D; Griep, Mark A; Stark, Jaime L; Powers, Robert; Revesz, Peter

    2010-07-06

    The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are approximately 1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein-protein interaction networks. Database URL: http://cse.unl.edu/~profess/

  20. Computational identification of MoRFs in protein sequences

    PubMed Central

    Malhis, Nawar; Gsponer, Jörg

    2015-01-01

    Motivation: Intrinsically disordered regions of proteins play an essential role in the regulation of various biological processes. Key to their regulatory function is the binding of molecular recognition features (MoRFs) to globular protein domains in a process known as a disorder-to-order transition. Predicting the location of MoRFs in protein sequences with high accuracy remains an important computational challenge. Method: In this study, we introduce MoRFCHiBi, a new computational approach for fast and accurate prediction of MoRFs in protein sequences. MoRFCHiBi combines the outcomes of two support vector machine (SVM) models that take advantage of two different kernels with high noise tolerance. The first, SVMS, is designed to extract maximal information from the general contrast in amino acid compositions between MoRFs, their surrounding regions (Flanks), and the remainders of the sequences. The second, SVMT, is used to identify similarities between regions in a query sequence and MoRFs of the training set. Results: We evaluated the performance of our predictor by comparing its results with those of two currently available MoRF predictors, MoRFpred and ANCHOR. Using three test sets that have previously been collected and used to evaluate MoRFpred and ANCHOR, we demonstrate that MoRFCHiBi outperforms the other predictors with respect to different evaluation metrics. In addition, MoRFCHiBi is downloadable and fast, which makes it useful as a component in other computational prediction tools. Availability and implementation: http://www.chibi.ubc.ca/morf/. Contact: gsponer@chibi.ubc.ca. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25637562

  1. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing

    PubMed Central

    Dasenko, Mark A.

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles

  2. Successful Recovery of Nuclear Protein-Coding Genes from Small Insects in Museums Using Illumina Sequencing.

    PubMed

    Kanda, Kojun; Pflug, James M; Sproul, John S; Dasenko, Mark A; Maddison, David R

    2015-01-01

    In this paper we explore high-throughput Illumina sequencing of nuclear protein-coding, ribosomal, and mitochondrial genes in small, dried insects stored in natural history collections. We sequenced one tenebrionid beetle and 12 carabid beetles ranging in size from 3.7 to 9.7 mm in length that have been stored in various museums for 4 to 84 years. Although we chose a number of old, small specimens for which we expected low sequence recovery, we successfully recovered at least some low-copy nuclear protein-coding genes from all specimens. For example, in one 56-year-old beetle, 4.4 mm in length, our de novo assembly recovered about 63% of approximately 41,900 nucleotides in a target suite of 67 nuclear protein-coding gene fragments, and 70% using a reference-based assembly. Even in the least successfully sequenced carabid specimen, reference-based assembly yielded fragments that were at least 50% of the target length for 34 of 67 nuclear protein-coding gene fragments. Exploration of alternative references for reference-based assembly revealed few signs of bias created by the reference. For all specimens we recovered almost complete copies of ribosomal and mitochondrial genes. We verified the general accuracy of the sequences through comparisons with sequences obtained from PCR and Sanger sequencing, including of conspecific, fresh specimens, and through phylogenetic analysis that tested the placement of sequences in predicted regions. A few possible inaccuracies in the sequences were detected, but these rarely affected the phylogenetic placement of the samples. Although our sample sizes are low, an exploratory regression study suggests that the dominant factor in predicting success at recovering nuclear protein-coding genes is a high number of Illumina reads, with success at PCR of COI and killing by immersion in ethanol being secondary factors; in analyses of only high-read samples, the primary significant explanatory variable was body length, with small beetles

  3. Functional analysis of bipartite begomovirus coat protein promoter sequences

    SciTech Connect

    Lacatus, Gabriela; Sunter, Garry

    2008-06-20

    We demonstrate that the AL2 gene of Cabbage leaf curl virus (CaLCuV) activates the CP promoter in mesophyll and acts to derepress the promoter in vascular tissue, similar to that observed for Tomato golden mosaic virus (TGMV). Binding studies indicate that sequences mediating repression and activation of the TGMV and CaLCuV CP promoter specifically bind different nuclear factors common to Nicotiana benthamiana, spinach and tomato. However, chromatin immunoprecipitation demonstrates that TGMV AL2 can interact with both sequences independently. Binding of nuclear protein(s) from different crop species to viral sequences conserved in both bipartite and monopartite begomoviruses, including TGMV, CaLCuV, Pepper golden mosaic virus and Tomato yellow leaf curl virus suggests that bipartite begomoviruses bind common host factors to regulate the CP promoter. This is consistent with a model in which AL2 interacts with different components of the cellular transcription machinery that bind viral sequences important for repression and activation of begomovirus CP promoters.

  4. Quantifying sequence and structural features of protein-RNA interactions.

    PubMed

    Li, Songling; Yamashita, Kazuo; Amada, Karlou Mar; Standley, Daron M

    2014-09-01

    Increasing awareness of the importance of protein-RNA interactions has motivated many approaches to predict residue-level RNA binding sites in proteins based on sequence or structural characteristics. Sequence-based predictors are usually high in sensitivity but low in specificity; conversely structure-based predictors tend to have high specificity, but lower sensitivity. Here we quantified the contribution of both sequence- and structure-based features as indicators of RNA-binding propensity using a machine-learning approach. In order to capture structural information for proteins without a known structure, we used homology modeling to extract the relevant structural features. Several novel and modified features enhanced the accuracy of residue-level RNA-binding propensity beyond what has been reported previously, including by meta-prediction servers. These features include: hidden Markov model-based evolutionary conservation, surface deformations based on the Laplacian norm formalism, and relative solvent accessibility partitioned into backbone and side chain contributions. We constructed a web server called aaRNA that implements the proposed method and demonstrate its use in identifying putative RNA binding sites. PMID:25063293

  5. Substrate-Driven Mapping of the Degradome by Comparison of Sequence Logos

    PubMed Central

    Fuchs, Julian E.; von Grafenstein, Susanne; Huber, Roland G.; Kramer, Christian; Liedl, Klaus R.

    2013-01-01

    Sequence logos are frequently used to illustrate substrate preferences and specificity of proteases. Here, we employed the compiled substrates of the MEROPS database to introduce a novel metric for comparison of protease substrate preferences. The constructed similarity matrix of 62 proteases can be used to intuitively visualize similarities in protease substrate readout via principal component analysis and construction of protease specificity trees. Since our new metric is solely based on substrate data, we can engraft the protease tree including proteolytic enzymes of different evolutionary origin. Thereby, our analyses confirm pronounced overlaps in substrate recognition not only between proteases closely related on sequence basis but also between proteolytic enzymes of different evolutionary origin and catalytic type. To illustrate the applicability of our approach we analyze the distribution of targets of small molecules from the ChEMBL database in our substrate-based protease specificity trees. We observe a striking clustering of annotated targets in tree branches even though these grouped targets do not necessarily share similarity on protein sequence level. This highlights the value and applicability of knowledge acquired from peptide substrates in drug design of small molecules, e.g., for the prediction of off-target effects or drug repurposing. Consequently, our similarity metric allows to map the degradome and its associated drug target network via comparison of known substrate peptides. The substrate-driven view of protein-protein interfaces is not limited to the field of proteases but can be applied to any target class where a sufficient amount of known substrate data is available. PMID:24244149

  6. Isolation and characterization of adrenoleukodystrophy protein (ALDP) related sequences in the human genome

    SciTech Connect

    Geraghty, M.T.; Stetten, G.; Kearns, W.

    1994-09-01

    X-linked adrenoleukodystrophy (ALD) is a disorder of peroxisomal {beta}-oxidation of very long chain fatty acids. It presents either as progressive dementia in childhood or as progressive paraparesis in later years. Adrenal insufficiency occurs in both phenotypes. The gene of the ALD protein has been mapped to Xq28 and has recently been cloned and characterized. The ALD protein has significant homology to the peroxisomal membrane protein, PMP70 and belongs to the ATP binding cassette superfamily of transporters. We screened a human genomic library with an ALDP cDNA and isolated 5 different but highly similar clones containing sequences corresponding to the 3{prime} end of the ALDP gene. Comparison of the sequences over the region corresponding to exon 9 through the 3{prime} end of the ALDP gene reveals {approximately}96% nucleotide identity in both exonic and intronic regions. Splice sites and open reading frames are maintained. Using both FISH and human-rodent DNA mapping panels, we positively assign these ALDP-related sequences to chromosomes 2, 16 and 22, and provisionally to 1 and 20. Southern blot of primate DNA probed with a partial ALDP cDNA (exon 2-10) shows that expansion of ALDP-related sequences occurred in higher primates (chimp, gorilla and human). Although Northern blots show multiple ALDP-hybridizing transcripts in certain tissues, we have no evidence to date for expression of these ALDP-related sequences. In conclusion, our data show there has been an unusual and recent dispersal to multiple chromosomes of structural gene sequences related to the ALDP gene. The functional significance of these sequences remains to be determined but their existence complicates PCR and mutation analysis of the ALDP gene.

  7. Properties of Sequence Conservation in Upstream Regulatory and Protein Coding Sequences among Paralogs in Arabidopsis thaliana

    NASA Astrophysics Data System (ADS)

    Richardson, Dale N.; Wiehe, Thomas

    Whole genome duplication (WGD) has catalyzed the formation of new species, genes with novel functions, altered expression patterns, complexified signaling pathways and has provided organisms a level of genetic robustness. We studied the long-term evolution and interrelationships of 5’ upstream regulatory sequences (URSs), protein coding sequences (CDSs) and expression correlations (EC) of duplicated gene pairs in Arabidopsis. Three distinct methods revealed significant evolutionary conservation between paralogous URSs and were highly correlated with microarray-based expression correlation of the respective gene pairs. Positional information on exact matches between sequences unveiled the contribution of micro-chromosomal rearrangements on expression divergence. A three-way rank analysis of URS similarity, CDS divergence and EC uncovered specific gene functional biases. Transcription factor activity was associated with gene pairs exhibiting conserved URSs and divergent CDSs, whereas a broad array of metabolic enzymes was found to be associated with gene pairs showing diverged URSs but conserved CDSs.

  8. Primary sequence analysis of Clostridium cellulovorans cellulose binding protein A.

    PubMed Central

    Shoseyov, O; Takagi, M; Goldstein, M A; Doi, R H

    1992-01-01

    The cbpA gene for the Clostridium cellulovorans cellulose binding protein (CbpA), which is part of the multisubunit cellulase complex, has been cloned and sequenced. When cbpA was expressed in Escherichia coli, proteins capable of binding to crystalline cellulose and of interacting with anti-CbpA were observed. The cbpA gene consists of 5544 base pairs and encodes a protein containing 1848 amino acids with a molecular mass of 189,036 Da. The open reading frame is preceded by a Gram-positive-type ribosome binding site. A signal peptide sequence of 28 amino acids is present at its N terminus. The encoded protein is highly hydrophobic with extremely high levels of threonine and valine residues. There are two types of putative cellulose binding domains of approximately 100 amino acids that are slightly hydrophilic and eight conserved, highly hydrophobic beta-sheet regions of approximately 140 amino acids. These latter hydrophobic regions may be the CbpA domains that interact with the different enzymatic subunits of the cellulase complex. Images PMID:1565642

  9. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison

    PubMed Central

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  10. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison.

    PubMed

    Kato, Mikio

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  11. Identification of Sequences Encoding Symbiodinium minutum Mitochondrial Proteins

    PubMed Central

    Butterfield, Erin R.; Howe, Christopher J.; Nisbet, R. Ellen R.

    2016-01-01

    The dinoflagellates are an extremely diverse group of algae closely related to the Apicomplexa and the ciliates. Much work has previously been undertaken to determine the presence of various biochemical pathways within dinoflagellate mitochondria. However, these studies were unable to identify several key transcripts including those encoding proteins involved in the pyruvate dehydrogenase complex, iron–sulfur cluster biosynthesis, and protein import. Here, we analyze the draft nuclear genome of the dinoflagellate Symbiodinium minutum, as well as RNAseq data to identify nuclear genes encoding mitochondrial proteins. The results confirm the presence of a complete tricarboxylic acid cycle in the dinoflagellates. Results also demonstrate the difficulties in using the genome sequence for the identification of genes due to the large number of introns, but show that it is highly useful for the determination of gene duplication events. PMID:26798115

  12. Identification of Sequences Encoding Symbiodinium minutum Mitochondrial Proteins.

    PubMed

    Butterfield, Erin R; Howe, Christopher J; Nisbet, R Ellen R

    2016-01-21

    The dinoflagellates are an extremely diverse group of algae closely related to the Apicomplexa and the ciliates. Much work has previously been undertaken to determine the presence of various biochemical pathways within dinoflagellate mitochondria. However, these studies were unable to identify several key transcripts including those encoding proteins involved in the pyruvate dehydrogenase complex, iron-sulfur cluster biosynthesis, and protein import. Here, we analyze the draft nuclear genome of the dinoflagellate Symbiodinium minutum, as well as RNAseq data to identify nuclear genes encoding mitochondrial proteins. The results confirm the presence of a complete tricarboxylic acid cycle in the dinoflagellates. Results also demonstrate the difficulties in using the genome sequence for the identification of genes due to the large number of introns, but show that it is highly useful for the determination of gene duplication events.

  13. DNA topology confers sequence specificity to nonspecific architectural proteins.

    PubMed

    Wei, Juan; Czapla, Luke; Grosner, Michael A; Swigon, David; Olson, Wilma K

    2014-11-25

    Topological constraints placed on short fragments of DNA change the disorder found in chain molecules randomly decorated by nonspecific, architectural proteins into tightly organized 3D structures. The bacterial heat-unstable (HU) protein builds up, counter to expectations, in greater quantities and at particular sites along simulated DNA minicircles and loops. Moreover, the placement of HU along loops with the "wild-type" spacing found in the Escherichia coli lactose (lac) and galactose (gal) operons precludes access to key recognition elements on DNA. The HU protein introduces a unique spatial pathway in the DNA upon closure. The many ways in which the protein induces nearly the same closed circular configuration point to the statistical advantage of its nonspecificity. The rotational settings imposed on DNA by the repressor proteins, by contrast, introduce sequential specificity in HU placement, with the nonspecific protein accumulating at particular loci on the constrained duplex. Thus, an architectural protein with no discernible DNA sequence-recognizing features becomes site-specific and potentially assumes a functional role upon loop formation. The locations of HU on the closed DNA reflect long-range mechanical correlations. The protein responds to DNA shape and deformability—the stiff, naturally straight double-helical structure—rather than to the unique features of the constituent base pairs. The structures of the simulated loops suggest that HU architecture, like nucleosomal architecture, which modulates the ability of regulatory proteins to recognize their binding sites in the context of chromatin, may influence repressor-operator interactions in the context of the bacterial nucleoid. PMID:25385626

  14. DNA topology confers sequence specificity to nonspecific architectural proteins

    PubMed Central

    Wei, Juan; Czapla, Luke; Grosner, Michael A.; Swigon, David; Olson, Wilma K.

    2014-01-01

    Topological constraints placed on short fragments of DNA change the disorder found in chain molecules randomly decorated by nonspecific, architectural proteins into tightly organized 3D structures. The bacterial heat-unstable (HU) protein builds up, counter to expectations, in greater quantities and at particular sites along simulated DNA minicircles and loops. Moreover, the placement of HU along loops with the “wild-type” spacing found in the Escherichia coli lactose (lac) and galactose (gal) operons precludes access to key recognition elements on DNA. The HU protein introduces a unique spatial pathway in the DNA upon closure. The many ways in which the protein induces nearly the same closed circular configuration point to the statistical advantage of its nonspecificity. The rotational settings imposed on DNA by the repressor proteins, by contrast, introduce sequential specificity in HU placement, with the nonspecific protein accumulating at particular loci on the constrained duplex. Thus, an architectural protein with no discernible DNA sequence-recognizing features becomes site-specific and potentially assumes a functional role upon loop formation. The locations of HU on the closed DNA reflect long-range mechanical correlations. The protein responds to DNA shape and deformability—the stiff, naturally straight double-helical structure—rather than to the unique features of the constituent base pairs. The structures of the simulated loops suggest that HU architecture, like nucleosomal architecture, which modulates the ability of regulatory proteins to recognize their binding sites in the context of chromatin, may influence repressor–operator interactions in the context of the bacterial nucleoid. PMID:25385626

  15. Comparison of the Folding Mechanism of Highly Homologous Proteins in the Lipid-binding Protein Family

    EPA Science Inventory

    The folding mechanism of two closely related proteins in the intracellular lipid binding protein family, human bile acid binding protein (hBABP) and rat bile acid binding protein (rBABP) were examined. These proteins are 77% identical (93% similar) in sequence Both of these singl...

  16. Phosphatidylinositol transfer proteins: sequence motifs in structural and evolutionary analyses

    PubMed Central

    Wyckoff, Gerald J.; Solidar, Ada; Yoden, Marilyn D.

    2016-01-01

    Phosphatidylinositol transfer proteins (PITP) are a family of monomeric proteins that bind and transfer phosphatidylinositol and phosphatidylcholine between membrane compartments. They are required for production of inositol and diacylglycerol second messengers, and are found in most metazoan organisms. While PITPs are known to carry out crucial cell-signaling roles in many organisms, the structure, function and evolution of the majority of family members remains unexplored; primarily because the ubiquity and diversity of the family thwarts traditional methods of global alignment. To surmount this obstacle, we instead took a novel approach, using MEME and a parsimony-based analysis to create a cladogram of conserved sequence motifs in 56 PITP family proteins from 26 species. In keeping with previous functional annotations, three clades were supported within our evolutionary analysis; two classes of soluble proteins and a class of membrane-associated proteins. By, focusing on conserved regions, the analysis allowed for in depth queries regarding possible functional roles of PITP proteins in both intra- and extra- cellular signaling. PMID:27429707

  17. Size and sequence and the volume change of protein folding.

    PubMed

    Rouget, Jean-Baptiste; Aksel, Tural; Roche, Julien; Saldana, Jean-Louis; Garcia, Angel E; Barrick, Doug; Royer, Catherine A

    2011-04-20

    The application of hydrostatic pressure generally leads to protein unfolding, implying, in accordance with Le Chatelier's principle, that the unfolded state has a smaller molar volume than the folded state. However, the origin of the volume change upon unfolding, ΔV(u), has yet to be determined. We have examined systematically the effects of protein size and sequence on the value of ΔV(u) using as a model system a series of deletion variants of the ankyrin repeat domain of the Notch receptor. The results provide strong evidence in support of the notion that the major contributing factor to pressure effects on proteins is their imperfect internal packing in the folded state. These packing defects appear to be specifically localized in the 3D structure, in contrast to the uniformly distributed effects of temperature and denaturants that depend upon hydration of exposed surface area upon unfolding. Given its local nature, the extent to which pressure globally affects protein structure can inform on the degree of cooperativity and long-range coupling intrinsic to the folded state. We also show that the energetics of the protein's conformations can significantly modulate their volumetric properties, providing further insight into protein stability. PMID:21446709

  18. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure.

    PubMed

    Capra, John A; Laskowski, Roman A; Thornton, Janet M; Singh, Mona; Funkhouser, Thomas A

    2009-12-01

    Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (http://compbio.cs.princeton.edu/concavity/).

  19. Sequence comparisons in the aminoacyl-tRNA synthetases with emphasis on regions of likely homology with sequences in the Rossmann fold in the methionyl and tyrosyl enzymes.

    PubMed

    Walker, E J; Jeffrey, P D

    1988-02-01

    Amino acid sequences of aminoacyl-tRNA synthetases specific for 12 different amino acids have now been published. Differences in origin at the species and organelle level result in 20 distinct sequences being available for comparison. Some of these were compared in small groups as they were determined and, although some homologies were detected, it was generally concluded that there was surprisingly little sequence homology in this functionally related group of enzymes. We have made comparisons of all of the available sequences by using a combination of computer and manual alignment methods and knowledge of the sequences in the Rossmann fold region of methionyl-tRNA synthetase from E. coli and tyrosyl-tRNA synthetase from B. stearothermophilus, enzymes whose three-dimensional structures have been described. It emerges that all of the aminoacyl-tRNA synthetase sequences thus examined show considerable homology with each other over at least parts of this region, some over virtually all of it. We conclude that a great deal more similarity than had previously been suspected exists in these proteins. In particular, the alignments we have made strongly imply the existence of a mononucleotide binding site of the Rossmann fold configuration in all of the synthetases compared. PMID:3283733

  20. Prediction of neddylation sites from protein sequences and sequence-derived properties

    PubMed Central

    2015-01-01

    Background Neddylation is a reversible post-translational modification that plays a vital role in maintaining cellular machinery. It is shown to affect localization, binding partners and structure of target proteins. Disruption of protein neddylation was observed in various diseases such as Alzheimer's and cancer. Therefore, understanding the neddylation mechanism and determining neddylation targets possibly bears a huge importance in further understanding the cellular processes. This study is the first attempt to predict neddylated sites from protein sequences by using several sequence and sequence-based structural features. Results We have developed a neddylation site prediction method using a support vector machine based on various sequence properties, position-specific scoring matrices, and disorder. Using 21 amino acid long lysine-centred windows, our model was able to predict neddylation sites successfully, with an average 5-fold stratified cross validation performance of 0.91, 0.91, 0.75, 0.44, 0.95 for accuracy, specificity, sensitivity, Matthew's correlation coefficient and area under curve, respectively. Independent test set results validated the robustness of reported new method. Additionally, we observed that neddylation sites are commonly flexible and there is a significant positively charged amino acid presence in neddylation sites. Conclusions In this study, a neddylation site prediction method was developed for the first time in literature. Common characteristics of neddylation sites and their discriminative properties were explored for further in silico studies on neddylation. Lastly, up-to-date neddylation dataset was provided for researchers working on post-translational modifications in the accompanying supplementary material of this article. PMID:26679222

  1. Identification of Ciliary Localization Sequences within the Third Intracellular Loop of G Protein-coupled Receptors

    PubMed Central

    Berbari, Nicolas F.; Johnson, Andrew D.; Lewis, Jacqueline S.; Askwith, Candice C.

    2008-01-01

    Primary cilia are sensory organelles present on most mammalian cells. The functions of cilia are defined by the signaling proteins localized to the ciliary membrane. Certain G protein–coupled receptors (GPCRs), including somatostatin receptor 3 (Sstr3) and serotonin receptor 6 (Htr6), localize to cilia. As Sstr3 and Htr6 are the only somatostatin and serotonin receptor subtypes that localize to cilia, we hypothesized they contain ciliary localization sequences. To test this hypothesis we expressed chimeric receptors containing fragments of Sstr3 and Htr6 in the nonciliary receptors Sstr5 and Htr7, respectively, in ciliated cells. We found the third intracellular loop of Sstr3 or Htr6 is sufficient for ciliary localization. Comparison of these loops revealed a loose consensus sequence. To determine whether this consensus sequence predicts ciliary localization of other GPCRs, we compared it with the third intracellular loop of all human GPCRs. We identified the consensus sequence in melanin-concentrating hormone receptor 1 (Mchr1) and confirmed Mchr1 localizes to primary cilia in vitro and in vivo. Thus, we have identified a putative GPCR ciliary localization sequence and used this sequence to identify a novel ciliary GPCR. As Mchr1 mediates feeding behavior and metabolism, our results implicate ciliary signaling in the regulation of body weight. PMID:18256283

  2. Sequence Heterogeneity Accelerates Protein Search for Targets on DNA

    NASA Astrophysics Data System (ADS)

    Shvets, Alexey; Kolomeisky, Anatoly

    The process of protein search for specific binding sites on DNA is fundamentally important since it marks the beginning of all major biological processes. We present a theoretical investigation that probes the role of DNA sequence symmetry, heterogeneity and chemical composition in the protein search dynamics. Using a discrete-state stochastic approach with a first-passage events analysis, which takes into account the most relevant physical-chemical processes, a full analytical description of the search dynamics is obtained. It is found that, contrary to existing views, the protein search is generally faster on DNA with more heterogeneous sequences. In addition, the search dynamics might be affected by the chemical composition near the target site. The physical origins of these phenomena are discussed. Our results suggest that biological processes might be effectively regulated by modifying chemical composition, symmetry and heterogeneity of a genome. The work was supported by the Welch Foundation (Grant C-1559), by the NSF (Grant CHE-1360979), and by the Center for Theoretical Biological Physics sponsored by the NSF (Grant PHY-1427654).

  3. A minimal sequence code for switching protein structure and function.

    PubMed

    Alexander, Patrick A; He, Yanan; Chen, Yihong; Orban, John; Bryan, Philip N

    2009-12-15

    We present here a structural and mechanistic description of how a protein changes its fold and function, mutation by mutation. Our approach was to create 2 proteins that (i) are stably folded into 2 different folds, (ii) have 2 different functions, and (iii) are very similar in sequence. In this simplified sequence space we explore the mutational path from one fold to another. We show that an IgG-binding, 4beta+alpha fold can be transformed into an albumin-binding, 3-alpha fold via a mutational pathway in which neither function nor native structure is completely lost. The stabilities of all mutants along the pathway are evaluated, key high-resolution structures are determined by NMR, and an explanation of the switching mechanism is provided. We show that the conformational switch from 4beta+alpha to 3-alpha structure can occur via a single amino acid substitution. On one side of the switch point, the 4beta+alpha fold is >90% populated (pH 7.2, 20 degrees C). A single mutation switches the conformation to the 3-alpha fold, which is >90% populated (pH 7.2, 20 degrees C). We further show that a bifunctional protein exists at the switch point with affinity for both IgG and albumin. PMID:19923431

  4. Complete genome sequence of the hyperthermophilic archaeon Thermococcus kodakaraensis KOD1 and comparison with Pyrococcus genomes

    PubMed Central

    Fukui, Toshiaki; Atomi, Haruyuki; Kanai, Tamotsu; Matsumi, Rie; Fujiwara, Shinsuke; Imanaka, Tadayuki

    2005-01-01

    The genus Thermococcus, comprised of sulfur-reducing hyperthermophilic archaea, belongs to the order Thermococcales in Euryarchaeota along with the closely related genus Pyrococcus. The members of Thermococcus are ubiquitously present in natural high-temperature environments, and are therefore considered to play a major role in the ecology and metabolic activity of microbial consortia within hot-water ecosystems. To obtain insight into this important genus, we have determined and annotated the complete 2,088,737-base genome of Thermococcus kodakaraensis strain KOD1, followed by a comparison with the three complete genomes of Pyrococcus spp. A total of 2306 coding DNA sequences (CDSs) have been identified, among which half (1165 CDSs) are annotatable, whereas the functions of 41% (936 CDSs) cannot be predicted from the primary structures. The genome contains seven genes for probable transposases and four virus-related regions. Several proteins within these genetic elements show high similarities to those in Pyrococcus spp., implying the natural occurrence of horizontal gene transfer of such mobile elements among the order Thermococcales. Comparative genomics clarified that 1204 proteins, including those for information processing and basic metabolisms, are shared among T. kodakaraensis and the three Pyrococcus spp. On the other hand, among the set of 689 proteins unique to T. kodakaraensis, there are several intriguing proteins that might be responsible for the specific trait of the genus Thermococcus, such as proteins involved in additional pyruvate oxidation, nucleotide metabolisms, unique or additional metal ion transporters, improved stress response system, and a distinct restriction system. PMID:15710748

  5. Phenotypic comparisons of consensus variants versus laboratory resurrections of Precambrian proteins.

    PubMed

    Risso, Valeria A; Gavira, Jose A; Gaucher, Eric A; Sanchez-Ruiz, Jose M

    2014-06-01

    Consensus-sequence engineering has generated protein variants with enhanced stability, and sometimes, with modulated biological function. Consensus mutations are often interpreted as the introduction of ancestral amino acid residues. However, the precise relationship between consensus engineering and ancestral protein resurrection is not fully understood. Here, we report the properties of proteins encoded by consensus sequences derived from a multiple sequence alignment of extant, class A β-lactamases, as compared with the properties of ancient Precambrian β-lactamases resurrected in the laboratory. These comparisons considered primary sequence, secondary, and tertiary structure, as well as stability and catalysis against different antibiotics. Out of the three consensus variants generated, one could not be expressed and purified (likely due to misfolding and/or low stability) and only one displayed substantial stability having substrate promiscuity, although to a lower extent than ancient β-lactamases. These results: (i) highlight the phenotypic differences between consensus variants and laboratory resurrections of ancestral proteins; (ii) question interpretations of consensus proteins as phenotypic proxies of ancestral proteins; and (iii) support the notion that ancient proteins provide a robust approach toward the preparation of protein variants having large numbers of mutational changes while possessing unique biomolecular properties.

  6. Phenotypic comparisons of consensus variants versus laboratory resurrections of Precambrian proteins.

    PubMed

    Risso, Valeria A; Gavira, Jose A; Gaucher, Eric A; Sanchez-Ruiz, Jose M

    2014-06-01

    Consensus-sequence engineering has generated protein variants with enhanced stability, and sometimes, with modulated biological function. Consensus mutations are often interpreted as the introduction of ancestral amino acid residues. However, the precise relationship between consensus engineering and ancestral protein resurrection is not fully understood. Here, we report the properties of proteins encoded by consensus sequences derived from a multiple sequence alignment of extant, class A β-lactamases, as compared with the properties of ancient Precambrian β-lactamases resurrected in the laboratory. These comparisons considered primary sequence, secondary, and tertiary structure, as well as stability and catalysis against different antibiotics. Out of the three consensus variants generated, one could not be expressed and purified (likely due to misfolding and/or low stability) and only one displayed substantial stability having substrate promiscuity, although to a lower extent than ancient β-lactamases. These results: (i) highlight the phenotypic differences between consensus variants and laboratory resurrections of ancestral proteins; (ii) question interpretations of consensus proteins as phenotypic proxies of ancestral proteins; and (iii) support the notion that ancient proteins provide a robust approach toward the preparation of protein variants having large numbers of mutational changes while possessing unique biomolecular properties. PMID:24710963

  7. MutaBind estimates and interprets the effects of sequence variants on protein-protein interactions.

    PubMed

    Li, Minghui; Simonetti, Franco L; Goncearenco, Alexander; Panchenko, Anna R

    2016-07-01

    Proteins engage in highly selective interactions with their macromolecular partners. Sequence variants that alter protein binding affinity may cause significant perturbations or complete abolishment of function, potentially leading to diseases. There exists a persistent need to develop a mechanistic understanding of impacts of variants on proteins. To address this need we introduce a new computational method MutaBind to evaluate the effects of sequence variants and disease mutations on protein interactions and calculate the quantitative changes in binding affinity. The MutaBind method uses molecular mechanics force fields, statistical potentials and fast side-chain optimization algorithms. The MutaBind server maps mutations on a structural protein complex, calculates the associated changes in binding affinity, determines the deleterious effect of a mutation, estimates the confidence of this prediction and produces a mutant structural model for download. MutaBind can be applied to a large number of problems, including determination of potential driver mutations in cancer and other diseases, elucidation of the effects of sequence variants on protein fitness in evolution and protein design. MutaBind is available at http://www.ncbi.nlm.nih.gov/projects/mutabind/. PMID:27150810

  8. Sequence comparison on a cluster of workstations using the PVM system

    SciTech Connect

    Guan, X.; Mural, R.J.; Uberbacher, E.C.

    1995-02-01

    We have implemented a distributed sequence comparison algorithm on a cluster of workstations using the PVM paradigm. This implementation has achieved similar performance to the intel iPSC/860 Hypercube, a massively parallel computer. The distributed sequence comparison algorithm serves as a search tool for two Internet servers GRAIL and GENQUEST. This paper describes the implementation and the performance of the algorithm.

  9. Role of sequence and membrane composition in structure of transmembrane domain of Amyloid Precursor Protein

    NASA Astrophysics Data System (ADS)

    Straub, John

    2013-03-01

    Aggregation of proteins of known sequence is linked to a variety of neurodegenerative disorders. The amyloid β (A β) protein associated with Alzheimer's Disease (AD) is derived from cleavage of the 99 amino acid C-terminal fragment of Amyloid Precursor Protein (APP-C99) by γ-secretase. Certain familial mutations of APP-C99 have been shown to lead to altered production of A β protein and the early onset of AD. We describe simulation studies exploring the structure of APP-C99 in micelle and membrane environments. Our studies explore how changes in sequence and membrane composition influence (1) the structure of monomeric APP-C99 and (2) APP-C99 homodimer structure and stability. Comparison of simulation results with recent NMR studies of APP-C99 monomers and dimers in micelle and bicelle environments provide insight into how critical aspects of APP-C99 structure and dimerization correlate with secretase processing, an essential component of the A β protein aggregation pathway and AD.

  10. Bioinformatic tools for DNA/protein sequence analysis, functional assignment of genes and protein classification.

    PubMed

    Rehm, B H

    2001-12-01

    The development of efficient DNA sequencing methods has led to the achievement of the DNA sequence of entire genomes from (to date) 55 prokaryotes, 5 eukaryotic organisms and 10 eukaryotic chromosomes. Thus, an enormous amount of DNA sequence data is available and even more will be forthcoming in the near future. Analysis of this overwhelming amount of data requires bioinformatic tools in order to identify genes that encode functional proteins or RNA. This is an important task, considering that even in the well-studied Escherichia coli more than 30% of the identified open reading frames are hypothetical genes. Future challenges of genome sequence analysis will include the understanding of gene regulation and metabolic pathway reconstruction including DNA chip technology, which holds tremendous potential for biomedicine and the biotechnological production of valuable compounds. The overwhelming volume of information often confuses scientists. This review intends to provide a guide to choosing the most efficient way to analyze a new sequence or to collect information on a gene or protein of interest by applying current publicly available databases and Web services. Recently developed tools that allow functional assignment of genes, mainly based on sequence similarity of the deduced amino acid sequence, using the currently available and increasing biological databases will be discussed.

  11. Evolution of Protein-binding DNA Sequences through Competitive Binding

    NASA Astrophysics Data System (ADS)

    Peng, Weiqun; Gerland, Ulrich; Hwa, Terence; Levine, Herbert

    2002-03-01

    The dynamics of in vitro DNA evolution controlled via competitive binding of DNA sequences to proteins has been explored in a recent serial transfer experiment footnote B. Dubertret, S.Liu, Q. Ouyang, A. Libchaber, Phys. Rev. Lett. 86, 6022 (2001).. Motivated by the experiment, we investigate a continuum model for this evolution process in various parameter regimes. We establish a self-consistent mean-field evolution equation, determine its dynamical properties and finite population size corrections. In addition, we discuss the experimental implications of our results.

  12. No genome-wide protein sequence convergence for echolocation.

    PubMed

    Zou, Zhengting; Zhang, Jianzhi

    2015-05-01

    Toothed whales and two groups of bats independently acquired echolocation, the ability to locate and identify objects by reflected sound. Echolocation requires physiologically complex and coordinated vocal, auditory, and neural functions, but the molecular basis of the capacity for echolocation is not well understood. A recent study suggested that convergent amino acid substitutions widespread in the proteins of echolocators underlay the convergent origins of mammalian echolocation. Here, we show that genomic signatures of molecular convergence between echolocating lineages are generally no stronger than those between echolocating and comparable nonecholocating lineages. The same is true for the group of 29 hearing-related proteins claimed to be enriched with molecular convergence. Reexamining the previous selection test reveals several flaws and invalidates the asserted evidence for adaptive convergence. Together, these findings indicate that the reported genomic signatures of convergence largely reflect the background level of sequence convergence unrelated to the origins of echolocation. PMID:25631925

  13. 3-d structure-based amino acid sequence alignment of esterases, lipases and related proteins

    SciTech Connect

    Gentry, M.K.; Doctor, B.P.; Cygler, M.; Schrag, J.D.; Sussman, J.L.

    1993-05-13

    Acetylcholinesterase and butyrylcholinesterase, enzymes with potential as pretreatment drugs for organophosphate toxicity, are members of a larger family of homologous proteins that includes carboxylesterases, cholesterol esterases, lipases, and several nonhydrolytic proteins. A computer-generated alignment of 18 of the proteins, the acetylcholinesases, butyrylcholinesterases, carboxylesterases, some esterases, and the nonenzymatic proteins has been previously presented. More recently, the three-dimensional structures of two enzymes enzymes in this group, acetylcholinesterase from Torpedo californica and lipase from Geotrichum candidum, have been determined. Based on the x-ray structures and the superposition of these two enzymes, it was possible to obtain an improved amino acid sequence alignment of 32 members of this family of proteins. Examination of this alignment reveals that 24 amino acids are invariant in all of the hydrolytic proteins, and an additional 49 are well conserved. Conserved amino acids include those of the active site, the disulfide bridges, the salt bridges, in the core of the proteins, and at the edges of secondary structural elements. Comparison of the three-dimensional structures makes it possible to find a well-defined structural basis for the conservation of many of these amino acids.

  14. Characterization of Mapuera virus: structure, proteins and nucleotide sequence of the gene encoding the nucleocapsid protein.

    PubMed

    Henderson, G W; Laird, C; Dermott, E; Rima, B K

    1995-10-01

    The molecular biology of Mapuera virus was studied at both the protein and nucleic acid levels. Seven virus-encoded proteins were detected in infected Vero cells. The sizes and characteristics of each of the proteins determined from various radiolabelling experiments allowed preliminary identification of the proteins as the large (L; 190 kDa), haemagglutinin neuraminidase (HN; 74 kDa), nucleocapsid (N; 66 kDa), fusion (F0; 63 kDa), phosphoprotein (P; 49 kDa), matrix (M; 43 kDa) and non-structural (V; 35 kDa) proteins. Western blot analysis showed that the HN, N and P proteins were major antigens recognized in the mouse. A cDNA library of total virus-infected cellular mRNA was created and screening of the library resulted in the detection of cDNA sequences representing the N mRNA transcript of Mapuera virus. The N mRNA sequence determined from the clones was 1731 nt in length and contained an ORF that encoded 537 amino acids, the complete 3' untranslated region and part of the 5' non-coding region. The calculated M(r) of the N protein was 59 kDa, which is close to the 66 kDa protein observed by SDS-PAGE. PMID:7595354

  15. A novel statistical measure for sequence comparison on the basis of k-word counts.

    PubMed

    Yang, Xiwu; Wang, Tianming

    2013-02-01

    Numerous efficient methods based on word counts for sequence analysis have been proposed to characterize DNA sequences to help in comparison, retrieval from the databases and reconstructing evolutionary relations. However, most of them seem unrelated to any intrinsic characteristics of DNA. In this paper, we proposed a novel statistical measure for sequence comparison on the basis of k-word counts. This new measure removed the influence of sequences' lengths and uncovered bulk property of DNA sequences. The proposed measure was tested by similarity search and phylogenetic analysis. The experimental assessment demonstrated that our similarity measure was efficient.

  16. Sequence similarity network reveals common ancestry of multidomain proteins.

    PubMed

    Song, Nan; Joseph, Jacob M; Davis, George B; Durand, Dannie

    2008-04-01

    We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain

  17. Gleditsia sinensis: Transcriptome Sequencing, Construction, and Application of Its Protein-Protein Interaction Network

    PubMed Central

    Zhu, Liucun; Zhang, Ying; Guo, Wenna; Wang, Qiang

    2014-01-01

    Gleditsia sinensis is a genus of deciduous tree in the family Caesalpinioideae, native to China, and is of great economic importance. However, despite its economic value, gene sequence information is strongly lacking. In the present study, transcriptome sequencing of G. sinensis was performed resulting in approximately 75.5 million clean reads assembled into 142155 unique transcripts generating 58583 unigenes. The average length of the unigenes was 900 bp, with an N50 of 549 bp. The obtained unigene sequences were then compared to four protein databases to include NCBI nonredundant protein (NRDB), Swiss-prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), and the Cluster of Orthologous Groups (COG). Using BLAST procedure, 31385 unigenes (53.6%) were generated to have functional annotations. Additionally, sequence homologies between identified unigenes and genes of known species in a protein-protein interaction (PPI) network facilitated G. sinensis PPI network construction. Based on this network construction, new stress resistance genes (including cold, drought, and high salinity) were predicted. The present study is the first investigation of genome-wide gene expression in G. sinensis with the results providing a basis for future functional genomic studies relating to this species. PMID:24982878

  18. Transitive Homology-Guided Structural Studies Lead to Discovery of Cro Proteins With 40% Sequence Identify But Different Folds

    SciTech Connect

    Roessler, C.G.; Hall, B.M.; Anderson, W.J.; Ingram, W.M.; Roberts, S.A.; Montfort, W.R.; Cordes, M.H.J.

    2009-05-27

    Proteins that share common ancestry may differ in structure and function because of divergent evolution of their amino acid sequences. For a typical diverse protein superfamily, the properties of a few scattered members are known from experiment. A satisfying picture of functional and structural evolution in relation to sequence changes, however, may require characterization of a larger, well chosen subset. Here, we employ a 'stepping-stone' method, based on transitive homology, to target sequences intermediate between two related proteins with known divergent properties. We apply the approach to the question of how new protein folds can evolve from preexisting folds and, in particular, to an evolutionary change in secondary structure and oligomeric state in the Cro family of bacteriophage transcription factors, initially identified by sequence-structure comparison of distant homologs from phages P22 and {lambda}. We report crystal structures of two Cro proteins, Xfaso 1 and Pfl 6, with sequences intermediate between those of P22 and {lambda}. The domains show 40% sequence identity but differ by switching of {alpha}-helix to {beta}-sheet in a C-terminal region spanning {approx}25 residues. Sedimentation analysis also suggests a correlation between helix-to-sheet conversion and strengthened dimerization.

  19. Generic Comparison of Protein Inference Engines*

    PubMed Central

    Claassen, Manfred; Reiter, Lukas; Hengartner, Michael O.; Buhmann, Joachim M.; Aebersold, Ruedi

    2012-01-01

    Protein identifications, instead of peptide-spectrum matches, constitute the biologically relevant result of shotgun proteomics studies. How to appropriately infer and report protein identifications has triggered a still ongoing debate. This debate has so far suffered from the lack of appropriate performance measures that allow us to objectively assess protein inference approaches. This study describes an intuitive, generic and yet formal performance measure and demonstrates how it enables experimentalists to select an optimal protein inference strategy for a given collection of fragment ion spectra. We applied the performance measure to systematically explore the benefit of excluding possibly unreliable protein identifications, such as single-hit wonders. Therefore, we defined a family of protein inference engines by extending a simple inference engine by thousands of pruning variants, each excluding a different specified set of possibly unreliable identifications. We benchmarked these protein inference engines on several data sets representing different proteomes and mass spectrometry platforms. Optimally performing inference engines retained all high confidence spectral evidence, without posterior exclusion of any type of protein identifications. Despite the diversity of studied data sets consistently supporting this rule, other data sets might behave differently. In order to ensure maximal reliable proteome coverage for data sets arising in other studies we advocate abstaining from rigid protein inference rules, such as exclusion of single-hit wonders, and instead consider several protein inference approaches and assess these with respect to the presented performance measure in the specific application context. PMID:22057310

  20. The presynaptic cytomatrix protein Bassoon: sequence and chromosomal localization of the human BSN gene.

    PubMed

    Winter, C; tom Dieck, S; Boeckers, T M; Bockmann, J; Kämpf, U; Sanmartí-Vila, L; Langnaese, K; Altrock, W; Stumm, M; Soyke, A; Wieacker, P; Garner, C C; Gundelfinger, E D

    1999-05-01

    Bassoon is a novel 420-kDa protein recently identified as a component of the cytoskeleton at presynaptic neurotransmitter release sites. Analysis of the rat and mouse sequences revealed a polyglutamine stretch in the C-terminal part of the protein. Since it is known for some proteins that abnormal amplification of such polyglutamine regions can cause late-onset neurodegeneration, we cloned and localized the human BASSOON gene (BSN). Phage clones spanning most of the open reading frame and the 3' untranslated region were isolated from a human genomic library and used for chromosomal localization of BSN to chromosome 3p21 by FISH. The localization was confirmed by PCR on rodent/human somatic cell hybrids; it is consistent with the localization of the murine Bsn gene at chromosome 9F. Sequencing revealed a polyglutamine stretch of only five residues in human, and PCR amplifications from 50 individuals showed no obvious length polymorphism in this region. Analysis of the primary structure of Bassoon and comparison to previous database entries provide evidence for a newly emerging protein family.

  1. CRYSpred: accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics.

    PubMed

    Mizianty, Marcin J; Kurgan, Lukasz A

    2012-01-01

    Relatively low success rates of X-ray crystallography, which is the most popular method for solving proteins structures, motivate development of novel methods that support selection of tractable protein targets. This aspect is particularly important in the context of the current structural genomics efforts that allow for a certain degree of flexibility in the target selection. We propose CRYSpred, a novel in-silico crystallization propensity predictor that uses a set of 15 novel features which utilize a broad range of inputs including charge, hydrophobicity, and amino acid composition derived from the protein chain, and the solvent accessibility and disorder predicted from the protein sequence. Our method outperforms seven modern crystallization propensity predictors on three, independent from training dataset, benchmark test datasets. The strong predictive performance offered by the CRYSpred is attributed to the careful design of the features, utilization of the comprehensive set of inputs, and the usage of the Support Vector Machine classifier. The inputs utilized by CRYSpred are well-aligned with the existing rules-of-thumb that are used in the structural genomics studies. PMID:21919861

  2. CRYSpred: accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics.

    PubMed

    Mizianty, Marcin J; Kurgan, Lukasz A

    2012-01-01

    Relatively low success rates of X-ray crystallography, which is the most popular method for solving proteins structures, motivate development of novel methods that support selection of tractable protein targets. This aspect is particularly important in the context of the current structural genomics efforts that allow for a certain degree of flexibility in the target selection. We propose CRYSpred, a novel in-silico crystallization propensity predictor that uses a set of 15 novel features which utilize a broad range of inputs including charge, hydrophobicity, and amino acid composition derived from the protein chain, and the solvent accessibility and disorder predicted from the protein sequence. Our method outperforms seven modern crystallization propensity predictors on three, independent from training dataset, benchmark test datasets. The strong predictive performance offered by the CRYSpred is attributed to the careful design of the features, utilization of the comprehensive set of inputs, and the usage of the Support Vector Machine classifier. The inputs utilized by CRYSpred are well-aligned with the existing rules-of-thumb that are used in the structural genomics studies.

  3. Prediction of Membrane Transport Proteins and Their Substrate Specificities Using Primary Sequence Information

    PubMed Central

    Mishra, Nitish K.; Chang, Junil; Zhao, Patrick X.

    2014-01-01

    Background Membrane transport proteins (transporters) move hydrophilic substrates across hydrophobic membranes and play vital roles in most cellular functions. Transporters represent a diverse group of proteins that differ in topology, energy coupling mechanism, and substrate specificity as well as sequence similarity. Among the functional annotations of transporters, information about their transporting substrates is especially important. The experimental identification and characterization of transporters is currently costly and time-consuming. The development of robust bioinformatics-based methods for the prediction of membrane transport proteins and their substrate specificities is therefore an important and urgent task. Results Support vector machine (SVM)-based computational models, which comprehensively utilize integrative protein sequence features such as amino acid composition, dipeptide composition, physico-chemical composition, biochemical composition, and position-specific scoring matrices (PSSM), were developed to predict the substrate specificity of seven transporter classes: amino acid, anion, cation, electron, protein/mRNA, sugar, and other transporters. An additional model to differentiate transporters from non-transporters was also developed. Among the developed models, the biochemical composition and PSSM hybrid model outperformed other models and achieved an overall average prediction accuracy of 76.69% with a Mathews correlation coefficient (MCC) of 0.49 and a receiver operating characteristic area under the curve (AUC) of 0.833 on our main dataset. This model also achieved an overall average prediction accuracy of 78.88% and MCC of 0.41 on an independent dataset. Conclusions Our analyses suggest that evolutionary information (i.e., the PSSM) and the AAIndex are key features for the substrate specificity prediction of transport proteins. In comparison, similarity-based methods such as BLAST, PSI-BLAST, and hidden Markov models do not provide

  4. Analysis on the preference for sequence matching between mRNA sequences and the corresponding introns in ribosomal protein genes.

    PubMed

    Zhang, Qiang; Li, Hong; Zhao, Xiaoqing; Zheng, Yan; Meng, Hu; Jia, Yun; Xue, Hui; Bo, Sulin

    2016-03-01

    Introns after splicing still play an important role. Introns can accomplish gene expression and regulation by interaction with corresponding mRNA sequences. Based on the Smith-Waterman method, local comparing makes us get the optimal matched segments between intron sequences and mRNA sequences. Analyzing the distribution regulation of the optimal matching region on mRNA sequences of ribosomal protein genes about 27 species, we find a strong interaction between UTR region sequences and introns. There are a lot of the optimal matching regions and low matching ones, and the latter are supposed to be the combined regions of protein complexes. The optimal matching frequency distributions have obvious differences nearby the mRNA functional sites such as translation initiation and termination sites, exon-exon joints and EJC regions. This conclusion shows that intron sequences and mature mRNA sequences are co-evolved and interactive to play their functions. PMID:26707402

  5. Folding and function of the myelin proteins from primary sequence data.

    PubMed

    Inouye, H; Kirschner, D A

    1991-01-01

    To explain how the myelin proteins are involved in the organization and function of the myelin sheath requires knowing their molecular structures. Except for P2 basic protein of PNS myelin, however, their structures are not yet known. As an aid to predicting their molecular folding and possible functions, we have developed a FORTRAN program to analyze the primary sequence data for proteins, and have applied this to the myelin proteins in particular. In this program, propensities for the secondary structure conformations as well as physical-chemical parameters are assigned to the amino acids and the pattern of these parameters is examined by calculating their average values, autocorrelation functions and Fourier transforms. To compare two proteins, their sequences are aligned using a unitary scoring matrix, and homologies are searched by plotting a two-dimensional map of the correlation coefficients. Comparison of the corresponding myelin basic proteins (MBP) and P0 glycoproteins (P0) for rodent and shark showed that the conserved residues included most of the amino acids which were predicted to form the alpha or beta conformations, while the altered residues were mainly in the hydrophilic and turn or coil regions. In both rodent and shark the putative extracellular domain of P0 glycoprotein displayed consecutive peaks of beta propensity similar to that for the immunoglobulins, while the cytoplasmic domain showed alpha-beta-alpha folding. To trace the immunoglobulin fold along the P0 sequence, we compared the beta propensity curve of P0 with that of the immunoglobulin M603, whose three-dimensional structure has been determined. We propose that the flat beta-sheets of P0 are orientated parallel to the membrane surface to facilitate their homotypic interaction in the extracellular space. An extra beta-fold in the extracellular domain of shark P0 compared with rodent P0 was found, and this may result in a greater attraction between the apposed extracellular surfaces

  6. Further Examples of Evolution by Gene Duplication Revealed through DNA Sequence Comparisons

    PubMed Central

    Ohta, T.

    1994-01-01

    To test the theory that evolution by gene duplication occurs as a result of positive Darwinian selection that accompanies the acceleration of mutant substitutions, DNA sequences of recent duplication were analyzed by estimating the numbers of synonymous and nonsynonymous substitutions. For the troponin C family, at the period of differentiation of the fast and slow isoforms, amino acid substitutions were shown to have been accelerated relative to synonymous substitutions. Comparison of the first exon of α-actin genes revealed that amino acid substitutions were accelerated when the smooth muscle, skeletal and cardiac isoforms differentiated. Analysis of members of the heat shock protein 70 gene family of mammals indicates that heat shock responsive genes including duplicated copies are evolving rapidly, contrary to the cognitive genes which have been evolutionarily conservative. For the α(1)-antitrypsin reactive center, the acceleration of amino acid substitution has been found for gene pairs of recent duplication. PMID:7896112

  7. Direct Chloroplast Sequencing: Comparison of Sequencing Platforms and Analysis Tools for Whole Chloroplast Barcoding

    PubMed Central

    Brozynska, Marta; Furtado, Agnelo; Henry, Robert James

    2014-01-01

    Direct sequencing of total plant DNA using next generation sequencing technologies generates a whole chloroplast genome sequence that has the potential to provide a barcode for use in plant and food identification. Advances in DNA sequencing platforms may make this an attractive approach for routine plant identification. The HiSeq (Illumina) and Ion Torrent (Life Technology) sequencing platforms were used to sequence total DNA from rice to identify polymorphisms in the whole chloroplast genome sequence of a wild rice plant relative to cultivated rice (cv. Nipponbare). Consensus chloroplast sequences were produced by mapping sequence reads to the reference rice chloroplast genome or by de novo assembly and mapping of the resulting contigs to the reference sequence. A total of 122 polymorphisms (SNPs and indels) between the wild and cultivated rice chloroplasts were predicted by these different sequencing and analysis methods. Of these, a total of 102 polymorphisms including 90 SNPs were predicted by both platforms. Indels were more variable with different sequencing methods, with almost all discrepancies found in homopolymers. The Ion Torrent platform gave no apparent false SNP but was less reliable for indels. The methods should be suitable for routine barcoding using appropriate combinations of sequencing platform and data analysis. PMID:25329378

  8. Direct chloroplast sequencing: comparison of sequencing platforms and analysis tools for whole chloroplast barcoding.

    PubMed

    Brozynska, Marta; Furtado, Agnelo; Henry, Robert James

    2014-01-01

    Direct sequencing of total plant DNA using next generation sequencing technologies generates a whole chloroplast genome sequence that has the potential to provide a barcode for use in plant and food identification. Advances in DNA sequencing platforms may make this an attractive approach for routine plant identification. The HiSeq (Illumina) and Ion Torrent (Life Technology) sequencing platforms were used to sequence total DNA from rice to identify polymorphisms in the whole chloroplast genome sequence of a wild rice plant relative to cultivated rice (cv. Nipponbare). Consensus chloroplast sequences were produced by mapping sequence reads to the reference rice chloroplast genome or by de novo assembly and mapping of the resulting contigs to the reference sequence. A total of 122 polymorphisms (SNPs and indels) between the wild and cultivated rice chloroplasts were predicted by these different sequencing and analysis methods. Of these, a total of 102 polymorphisms including 90 SNPs were predicted by both platforms. Indels were more variable with different sequencing methods, with almost all discrepancies found in homopolymers. The Ion Torrent platform gave no apparent false SNP but was less reliable for indels. The methods should be suitable for routine barcoding using appropriate combinations of sequencing platform and data analysis.

  9. Cloning and sequence of the human nuclear protein cyclin: homology with DNA-binding proteins.

    PubMed Central

    Almendral, J M; Huebsch, D; Blundell, P A; Macdonald-Bravo, H; Bravo, R

    1987-01-01

    A full-length cDNA clone for the human nuclear protein cyclin has been isolated by using polyclonal antibodies and sequenced. The sequence predicts a protein of 261 amino acids (Mr 29,261) with a high content of acidic (41, aspartic and glutamic acids) versus basic (24, lysine and arginine) amino acids. The identity of the cDNA clone was confirmed by in vitro hybrid-arrested translation of cyclin mRNA. Blot-hybridization analysis of mouse 3T3 and human MOLT-4 cell RNA revealed a mRNA species of approximately the same size as the cDNA insert. Expression of cyclin mRNA was undetectable or very low in quiescent cells, increasing after 8-10 hr of serum stimulation. Inhibition of DNA synthesis by hydroxyurea in serum-stimulated cells did not affect the increase in cyclin mRNA but inhibited 90% the expression of H3 mRNA. These results suggest that expression of cyclin and histone mRNAs are controlled by different mechanisms. A region of the cyclin sequence shows a significant homology with the putative DNA binding site of several proteins, specially with the transcriptional-regulator cAMP-binding protein of Escherichia coli, suggesting that cyclin could play a similar role in eukaryotic cells. Images PMID:2882507

  10. Sequence-Specific Protein Aggregation Generates Defined Protein Knockdowns in Plants1[OPEN

    PubMed Central

    Vuylsteke, Marnik; Aesaert, Stijn; Rombaut, Debbie; De Smet, Frederik; Xu, Jie; Van Lijsebettens, Mieke; Rousseau, Frederic

    2016-01-01

    Protein aggregation is determined by short (5–15 amino acids) aggregation-prone regions (APRs) of the polypeptide sequence that self-associate in a specific manner to form β-structured inclusions. Here, we demonstrate that the sequence specificity of APRs can be exploited to selectively knock down proteins with different localization and function in plants. Synthetic aggregation-prone peptides derived from the APRs of either the negative regulators of the brassinosteroid (BR) signaling, the glycogen synthase kinase 3/Arabidopsis SHAGGY-like kinases (GSK3/ASKs), or the starch-degrading enzyme α-glucan water dikinase were designed. Stable expression of the APRs in Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) induced aggregation of the target proteins, giving rise to plants displaying constitutive BR responses and increased starch content, respectively. Overall, we show that the sequence specificity of APRs can be harnessed to generate aggregation-associated phenotypes in a targeted manner in different subcellular compartments. This study points toward the potential application of induced targeted aggregation as a useful tool to knock down protein functions in plants and, especially, to generate beneficial traits in crops. PMID:27208282

  11. Design of Protein Multi-specificity Using an Independent Sequence Search Reduces the Barrier to Low Energy Sequences.

    PubMed

    Sevy, Alexander M; Jacobs, Tim M; Crowe, James E; Meiler, Jens

    2015-07-01

    Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a 'single state' design (SSD) paradigm. Multi-specificity design (MSD), on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON). The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain) an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design "promiscuous", polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes. PMID:26147100

  12. Next-Generation Sequencing for Binary Protein–Protein Interactions

    PubMed Central

    Suter, Bernhard; Zhang, Xinmin; Pesce, C. Gustavo; Mendelsohn, Andrew R.; Dinesh-Kumar, Savithramma P.; Mao, Jian-Hua

    2015-01-01

    The yeast two-hybrid (Y2H) system exploits host cell genetics in order to display binary protein–protein interactions (PPIs) via defined and selectable phenotypes. Numerous improvements have been made to this method, adapting the screening principle for diverse applications, including drug discovery and the scale-up for proteome wide interaction screens in human and other organisms. Here we discuss a systematic workflow and analysis scheme for screening data generated by Y2H and related assays that includes high-throughput selection procedures, readout of comprehensive results via next-generation sequencing (NGS), and the interpretation of interaction data via quantitative statistics. The novel assays and tools will serve the broader scientific community to harness the power of NGS technology to address PPI networks in health and disease. We discuss examples of how this next-generation platform can be applied to address specific questions in diverse fields of biology and medicine. PMID:26734059

  13. Sequence analysis and expression of the M1 and M2 matrix protein genes of hirame rhabdovirus (HIRRV)

    USGS Publications Warehouse

    Nishizawa, T.; Kurath, G.; Winton, J.R.

    1997-01-01

    We have cloned and sequenced a 2318 nucleotide region of the genomic RNA of hirame rhabdovirus (HIRRV), an important viral pathogen of Japanese flounder Paralichthys olivaceus. This region comprises approximately two-thirds of the 3' end of the nucleocapsid protein (N) gene and the complete matrix protein (M1 and M2) genes with the associated intergenic regions. The partial N gene sequence was 812 nucleotides in length with an open reading frame (ORF) that encoded the carboxyl-terminal 250 amino acids of the N protein. The M1 and M2 genes were 771 and 700 nucleotides in length, respectively, with ORFs encoding proteins of 227 and 193 amino acids. The M1 gene sequence contained an additional small ORF that could encode a highly basic, arginine-rich protein of 25 amino acids. Comparisons of the N, M1, and M2 gene sequences of HIRRV with the corresponding sequences of the fish rhabdoviruses, infectious hematopoietic necrosis virus (IHNV) or viral hemorrhagic septicemia virus (VHSV) indicated that HIRRV was more closely related to IHNV than to VHSV, but was clearly distinct from either. The putative consensus gene termination sequence for IHNV and VHSV, AGAYAG(A)(7), was present in the N-M1, M1-M2, and M2-G intergenic regions of HIRRV as were the putative transcription initiation sequences YGGCAC and AACA. An Escherichia coli expression system was used to produce recombinant proteins from the M1 and M2 genes of HIRRV. These were the same size as the authentic M1 and M2 proteins and reacted with anti-HIRRV rabbit serum in western blots. These reagents can be used for further study of the fish immune response and to test novel control methods.

  14. Comparison of aragonitic molluscan shell proteins.

    PubMed

    Furuhashi, Takeshi; Miksik, Ivan; Smrz, Miloslav; Germann, Bettina; Nebija, Dashnor; Lachmann, Bodo; Noe, Christian

    2010-02-01

    Acidic macromolecules, as a nucleation factor for mollusc shell formation, are a major focus of research. It remains unclear, however, whether acidic macromolecules are present only in calcified shell organic matrices, and which acidic macromolecules are crucial for the nucleation process by binding to chitin as structural components. To clarify these questions, we applied 2D gel electrophoresis and amino acid analysis to soluble shell organic matrices from nacre shell, non-nacre aragonitic shell and non-calcified squid shells. The 2D gel electrophoresis results showed that the acidity of soluble proteins differs even between nacre shells, and some nacre (Haliotis gigantea) showed a basic protein migration pattern. Non-calcified shells also contained some moderately acidic proteins. The results did not support the correlation between the acidity of soluble shell proteins and shell structure.

  15. Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines.

    PubMed

    You, Zhu-Hong; Li, Jianqiang; Gao, Xin; He, Zhou; Zhu, Lin; Lei, Ying-Ke; Ji, Zhiwei

    2015-01-01

    Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally. PMID:26000305

  16. Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier.

    PubMed

    Dhole, Kaustubh; Singh, Gurdeep; Pai, Priyadarshini P; Mondal, Sukanta

    2014-05-01

    Protein-protein interactions are of central importance for virtually every process in a living cell. Information about the interaction sites in proteins improves our understanding of disease mechanisms and can provide the basis for new therapeutic approaches. Since a multitude of unique residue-residue contacts facilitate the interactions, protein-protein interaction sites prediction has become one of the most important and challenging problems of computational biology. Although much progress in this field has been reported, this problem is yet to be satisfactorily solved. Here, a novel method (LORIS: L1-regularized LOgistic Regression based protein-protein Interaction Sites predictor) is proposed, that identifies interaction residues, using sequence features and is implemented via the L1-logreg classifier. Results show that LORIS is not only quite effective, but also, performs better than existing state-of-the art methods. LORIS, available as standalone package, can be useful for facilitating drug-design and targeted mutation related studies, which require a deeper knowledge of protein interactions sites. PMID:24486250

  17. Correlation between sequence, structure and function for trisporoid processing proteins in the model zygomycete Mucor mucedo.

    PubMed

    Ellenberger, Sabrina; Schuster, Stefan; Wöstemeyer, Johannes

    2013-03-01

    Terpenoids, steroids, carotenoids, phytoenes and other chemically related substance groups fulfill multiple functions in all realms of the organismic world. This analysis focuses on trisporoids that operate as pheromones in the phylogenetically ancient fungal group of mucoralean zygomycetes. Trisporoids serve as pheromones for recognizing complementary mating partners and for inducing the differentiation program towards sexual spore formation. Trisporoids are synthesized by oxidative degradation of β-carotene. Structurally, they are related to retinoids in mammals and abscisic acid in vascular plants. In order to evaluate evolutionary relationships between proteins involved in trisporoid binding and also for checking possibilities to recognize functionally related proteins by sequence and structure comparisons, we compared representative proteins of different origins. Towards this goal, we calculated three-dimensional structures for 4-dihydromethyltrisporate dehydrogenase (TSP1) and 4-dihydrotrisporin dehydrogenase (TSP2), the two proteins involved in trisporic acid synthesis that have unequivocally been correlated with their catalytic function for the model zygomycete Mucor mucedo. TSP1 is an aldo-keto reductase with a TIM-barrel structure, TSP2 belongs to short-chain dehydrogenases, characterized by a Rossmann fold. Evidently, functional conservation, even implying very similar substrates and identical cosubstrates of enzymes in a single organism, turns out to be essentially independent of basic protein structure. The binding sites for NADP and trisporoid ligands in the proteins were determined by docking studies, revealing those regions affecting substrate specificity. Despite the pronounced differences in amino acid sequence and tertiary structure, the surfaces around the active sites are comparable between TSP1 and TSP2. Two binding regions were identified, one sterically open and a second closed one. In contrast to TSP1, all docking models for TSP2 place the

  18. Impaired nuclear import of mammalian Dlx4 proteins as a consequence of rapid sequence divergence

    SciTech Connect

    Coubrough, Melissa L.; Bendall, Andrew J. . E-mail: abendall@uoguelph.ca

    2006-11-15

    Dlx genes encode a developmentally important family of transcription factors with a variety of functions and sites of action during vertebrate embryogenesis. The murine Dlx4 gene is an enigmatic member of the family; little is known about the normal developmental function(s) of Dlx4. Here, we show that Dlx4 is expressed in the murine placenta and in a trophoblast cell line where the protein localizes to both the nucleus and cytoplasm. Despite the presence of several leucine/valine-rich motifs that match known nuclear export sequences, cytoplasmic Dlx4 is not due to CRM-1-mediated nuclear export. Rather, nuclear import of Dlx4 is compromised by specific residues that flank the nuclear localization signal. One of these residues represents a novel conserved feature of the Dlx4 protein in placental mammals, and the second represents novel variation within mouse Dlx4 isoforms. Comparison of orthologous protein sequences reveals a particularly high rate of non-synonymous change in the coding regions of mammalian Dlx4 genes. Since impaired nuclear localization is unlikely to enhance the function of a nuclear transcription factor, these data point to reduced selection pressure as the basis for the rapid divergence of the Dlx4 gene within the mammalian clade.

  19. Protein sequences classification by means of feature extraction with substitution matrices

    PubMed Central

    2010-01-01

    Background This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. Results In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. Conclusions The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks. PMID:20377887

  20. Attenuation of very virulent infectious bursal disease virus and comparison of full sequences of virulent and attenuated strains.

    PubMed

    Lazarus, D; Pasmanik-Chor, M; Gutter, B; Gallili, G; Barbakov, M; Krispel, S; Pitcovski, J

    2008-04-01

    A very virulent strain of infectious bursal disease virus (IBDVks) was isolated from the bursae of Fabricius of IBDV-affected broiler chickens. Following 43 serial passages in specific pathogen-free embryonated eggs, an attenuated strain was established (IBDVmb). Dosages of IBDVmb in the range 10(2) to 10(4) embryo infective dose of 50% were found to be safe and protective for commercial chicks. Chickens vaccinated with live vaccine containing IBDVmb responded with precipitating and type-specific neutralizing antibodies, and were immune to subsequent challenge with a very virulent IBDV. IBDVmb has been used as an attenuated vaccine throughout the world since 1993. A comparison of the full sequences of the virulent and attenuated strains (IBDVks and IBDVmb, respectively) revealed seven nucleotides that were different, four of them leading to changes in the amino-acid sequence. Comparison of the protein sequence of these strains and published sequences of very virulent and attenuated phenotypes lead us to suggest that the novel difference responsible for virulence of the Israeli strains are: residue 272 (VP2, very conserved site) and residue 527 (VP4), both in segment A, and in segment B (VP1) residues 96 and 161 (both conserved). Our study strengthens the possibility that more than one protein is involved in IBDV attenuation. In all reports, including ours, virulence was reduced without affecting antigenicity of the neutralizing epitopes in VP2. This could have practical implications for attenuated-vaccine development.

  1. X-ray sequence and crystal structure of luffaculin 1, a novel type 1 ribosome-inactivating protein

    PubMed Central

    Hou, Xiaomin; Chen, Minghuang; Chen, Liqing; Meehan, Edward J; Xie, Jieming; Huang, Mingdong

    2007-01-01

    Background Protein sequence can be obtained through Edman degradation, mass spectrometry, or cDNA sequencing. High resolution X-ray crystallography can also be used to derive protein sequence information, but faces the difficulty in distinguishing the Asp/Asn, Glu/Gln, and Val/Thr pairs. Luffaculin 1 is a new type 1 ribosome-inactivating protein (RIP) isolated from the seeds of Luffa acutangula. Besides rRNA N-glycosidase activity, luffaculin 1 also demonstrates activities including inhibiting tumor cells' proliferation and inducing tumor cells' differentiation. Results The crystal structure of luffaculin 1 was determined at 1.4 Å resolution. Its amino-acid sequence was derived from this high resolution structure using the following criteria: 1) high resolution electron density; 2) comparison of electron density between two molecules that exist in the same crystal; 3) evaluation of the chemical environment of residues to break down the sequence assignment ambiguity in residue pairs Glu/Gln, Asp/Asn, and Val/Thr; 4) comparison with sequences of the homologous proteins. Using the criteria 1 and 2, 66% of the residues can be assigned. By incorporating with criterion 3, 86% of the residues were assigned, suggesting the effectiveness of chemical environment evaluation in breaking down residue ambiguity. In total, 94% of the luffaculin 1 sequence was assigned with high confidence using this improved X-ray sequencing strategy. Two N-acetylglucosamine moieties, linked respectively to the residues Asn77 and Asn84, can be identified in the structure. Residues Tyr70, Tyr110, Glu159 and Arg162 define the active site of luffaculin 1 as an RNA N-glycosidase. Conclusion X-ray sequencing method can be effective to derive sequence information of proteins. The evaluation of the chemical environment of residues is a useful method to break down the assignment ambiguity in Glu/Gln, Asp/Asn, and Val/Thr pairs. The sequence and the crystal structure confirm that luffaculin 1 is a new

  2. Comparison of simple sequence repeats in 19 Archaea.

    PubMed

    Trivedi, S

    2006-01-01

    All organisms that have been studied until now have been found to have differential distribution of simple sequence repeats (SSRs), with more SSRs in intergenic than in coding sequences. SSR distribution was investigated in Archaea genomes where complete chromosome sequences of 19 Archaea were analyzed with the program SPUTNIK to find di- to penta-nucleotide repeats. The number of repeats was determined for the complete chromosome sequences and for the coding and non-coding sequences. Different from what has been found for other groups of organisms, there is an abundance of SSRs in coding regions of the genome of some Archaea. Dinucleotide repeats were rare and CG repeats were found in only two Archaea. In general, trinucleotide repeats are the most abundant SSR motifs; however, pentanucleotide repeats are abundant in some Archaea. Some of the tetranucleotide and pentanucleotide repeat motifs are organism specific. In general, repeats are short and CG-rich repeats are present in Archaea having a CG-rich genome. Among the 19 Archaea, SSR density was not correlated with genome size or with optimum growth temperature. Pentanucleotide density had an inverse correlation with the CG content of the genome. PMID:17183484

  3. Quantitative comparison between a multiecho sequence and a single-echo sequence for susceptibility-weighted phase imaging.

    PubMed

    Gilbert, Guillaume; Savard, Geneviève; Bard, Céline; Beaudoin, Gilles

    2012-06-01

    The aim of this study was to investigate the benefits arising from the use of a multiecho sequence for susceptibility-weighted phase imaging using a quantitative comparison with a standard single-echo acquisition. Four healthy adult volunteers were imaged on a clinical 3-T system using a protocol comprising two different three-dimensional susceptibility-weighted gradient-echo sequences: a standard single-echo sequence and a multiecho sequence. Both sequences were repeated twice in order to evaluate the local noise contribution by a subtraction of the two acquisitions. For the multiecho sequence, the phase information from each echo was independently unwrapped, and the background field contribution was removed using either homodyne filtering or the projection onto dipole fields method. The phase information from all echoes was then combined using a weighted linear regression. R2 maps were also calculated from the multiecho acquisitions. The noise standard deviation in the reconstructed phase images was evaluated for six manually segmented regions of interest (frontal white matter, posterior white matter, globus pallidus, putamen, caudate nucleus and lateral ventricle). The use of the multiecho sequence for susceptibility-weighted phase imaging led to a reduction of the noise standard deviation for all subjects and all regions of interest investigated in comparison to the reference single-echo acquisition. On average, the noise reduction ranged from 18.4% for the globus pallidus to 47.9% for the lateral ventricle. In addition, the amount of noise reduction was found to be strongly inversely correlated to the estimated R2 value (R=-0.92). In conclusion, the use of a multiecho sequence is an effective way to decrease the noise contribution in susceptibility-weighted phase images, while preserving both contrast and acquisition time. The proposed approach additionally permits the calculation of R2 maps.

  4. Comparison of immunoturbidimetric and immunonephelometric assays for specific proteins.

    PubMed

    Mali, Bahera; Armbruster, David; Serediak, Ernie; Ottenbreit, Tammy

    2009-10-01

    Immunoturbidimetric assays for specific proteins are available on "open system" clinical chemistry analyzers. The analytical performance of nine immunoturbidimetric specific protein assays (C3, C4, CRP, Haptoglobin, IgA, IgG, IgM, RF, and Transferrin) was compared to immunonephelometry. Testing was performed on the Abbott ARCHITECT ci8200 and the Dade Behring BNII nephelometer and evaluated for precision, linearity, limit of detection, prozone phenomenon, method comparison, workflow, and proficiency testing survey comparison. Immunoturbidimetric assays performance was satisfactory for total precision, linearity, limit of detection and the prozone effect was not observed. Method comparison was acceptable for the immunoglobulins, CRP and transferrin but less favorable for the other assays, likely due to methodology and antibody specificity differences. Immunourbidimetric specific protein assays allow for efficient test consolidation on a general purpose clinical chemistry analyzer.

  5. RNase-mediated protein footprint sequencing reveals protein-binding sites throughout the human transcriptome.

    PubMed

    Silverman, Ian M; Li, Fan; Alexander, Anissa; Goff, Loyal; Trapnell, Cole; Rinn, John L; Gregory, Brian D

    2014-01-07

    Although numerous approaches have been developed to map RNA-binding sites of individual RNA-binding proteins (RBPs), few methods exist that allow assessment of global RBP-RNA interactions. Here, we describe PIP-seq, a universal, high-throughput, ribonuclease-mediated protein footprint sequencing approach that reveals RNA-protein interaction sites throughout a transcriptome of interest. We apply PIP-seq to the HeLa transcriptome and compare binding sites found using different cross-linkers and ribonucleases. From this analysis, we identify numerous putative RBP-binding motifs, reveal novel insights into co-binding by RBPs, and uncover a significant enrichment for disease-associated polymorphisms within RBP interaction sites.

  6. Exhaustive comparison and classification of ligand-binding surfaces in proteins

    PubMed Central

    Murakami, Yoichi; Kinoshita, Kengo; Kinjo, Akira R; Nakamura, Haruki

    2013-01-01

    Many proteins function by interacting with other small molecules (ligands). Identification of ligand-binding sites (LBS) in proteins can therefore help to infer their molecular functions. A comprehensive comparison among local structures of LBSs was previously performed, in order to understand their relationships and to classify their structural motifs. However, similar exhaustive comparison among local surfaces of LBSs (patches) has never been performed, due to computational complexity. To enhance our understanding of LBSs, it is worth performing such comparisons among patches and classifying them based on similarities of their surface configurations and electrostatic potentials. In this study, we first developed a rapid method to compare two patches. We then clustered patches corresponding to the same PDB chemical component identifier for a ligand, and selected a representative patch from each cluster. We subsequently exhaustively as compared the representative patches and clustered them using similarity score, PatSim. Finally, the resultant PatSim scores were compared with similarities of atomic structures of the LBSs and those of the ligand-binding protein sequences and functions. Consequently, we classified the patches into ∼2000 well-characterized clusters. We found that about 63% of these clusters are used in identical protein folds, although about 25% of the clusters are conserved in distantly related proteins and even in proteins with cross-fold similarity. Furthermore, we showed that patches with higher PatSim score have potential to be involved in similar biological processes. PMID:23934772

  7. Phylogenetic relationships of Cryptosporidium determined by ribosomal RNA sequence comparison.

    PubMed

    Johnson, A M; Fielke, R; Lumb, R; Baverstock, P R

    1990-04-01

    Reverse transcription of total cellular RNA was used to obtain a partial sequence of the small subunit ribosomal RNA of Cryptosporidium, a protist currently placed in the phylum Apicomplexa. The semi-conserved regions were aligned with homologous sequences in a range of other eukaryotes, and the evolutionary relationships of Cryptosporidium were determined by two different methods of phylogenetic analysis. The prokaryotes Escherichia coli and Halobacterium cuti were included as outgroups. The results do not show an especially close relationship of Cryptosporidium to other members of the phylum Apicomplexa. PMID:2332273

  8. A potent antimicrobial protein from onion seeds showing sequence homology to plant lipid transfer proteins.

    PubMed

    Cammue, B P; Thevissen, K; Hendriks, M; Eggermont, K; Goderis, I J; Proost, P; Van Damme, J; Osborn, R W; Guerbette, F; Kader, J C

    1995-10-01

    An antimicrobial protein of about 10 kD, called Ace-AMP1, was isolated from onion (Allium cepa L.) seeds. Based on the near-complete amino acid sequence of this protein, oligonucleotides were designed for polymerase chain reaction-based cloning of the corresponding cDNA. The mature protein is homologous to plant nonspecific lipid transfer proteins (nsLTPs), but it shares only 76% of the residues that are conserved among all known plant nsLTPs and is unusually rich in arginine. Ace-AMP1 inhibits all 12 tested plant pathogenic fungi at concentrations below 10 micrograms mL-1. Its antifungal activity is either not at all or is weakly affected by the presence of different cations at concentrations approximating physiological ionic strength conditions. Ace-AMP1 is also active on two Gram-positive bacteria but is apparently not toxic for Gram-negative bacteria and cultured human cells. In contrast to nsLTPs such as those isolated from radish or maize seeds, Ace-AMP1 was unable to transfer phospholipids from liposomes to mitochondria. On the other hand, lipid transfer proteins from wheat and maize seeds showed little or no antimicrobial activity, whereas the radish lipid transfer protein displayed antifungal activity only in media with low cation concentrations. The relevance of these findings with regard to the function of nsLTPs is discussed. PMID:7480341

  9. A potent antimicrobial protein from onion seeds showing sequence homology to plant lipid transfer proteins.

    PubMed Central

    Cammue, B P; Thevissen, K; Hendriks, M; Eggermont, K; Goderis, I J; Proost, P; Van Damme, J; Osborn, R W; Guerbette, F; Kader, J C

    1995-01-01

    An antimicrobial protein of about 10 kD, called Ace-AMP1, was isolated from onion (Allium cepa L.) seeds. Based on the near-complete amino acid sequence of this protein, oligonucleotides were designed for polymerase chain reaction-based cloning of the corresponding cDNA. The mature protein is homologous to plant nonspecific lipid transfer proteins (nsLTPs), but it shares only 76% of the residues that are conserved among all known plant nsLTPs and is unusually rich in arginine. Ace-AMP1 inhibits all 12 tested plant pathogenic fungi at concentrations below 10 micrograms mL-1. Its antifungal activity is either not at all or is weakly affected by the presence of different cations at concentrations approximating physiological ionic strength conditions. Ace-AMP1 is also active on two Gram-positive bacteria but is apparently not toxic for Gram-negative bacteria and cultured human cells. In contrast to nsLTPs such as those isolated from radish or maize seeds, Ace-AMP1 was unable to transfer phospholipids from liposomes to mitochondria. On the other hand, lipid transfer proteins from wheat and maize seeds showed little or no antimicrobial activity, whereas the radish lipid transfer protein displayed antifungal activity only in media with low cation concentrations. The relevance of these findings with regard to the function of nsLTPs is discussed. PMID:7480341

  10. A local average distance descriptor for flexible protein structure comparison

    PubMed Central

    2014-01-01

    Background Protein structures are flexible and often show conformational changes upon binding to other molecules to exert biological functions. As protein structures correlate with characteristic functions, structure comparison allows classification and prediction of proteins of undefined functions. However, most comparison methods treat proteins as rigid bodies and cannot retrieve similarities of proteins with large conformational changes effectively. Results In this paper, we propose a novel descriptor, local average distance (LAD), based on either the geodesic distances (GDs) or Euclidean distances (EDs) for pairwise flexible protein structure comparison. The proposed method was compared with 7 structural alignment methods and 7 shape descriptors on two datasets comprising hinge bending motions from the MolMovDB, and the results have shown that our method outperformed all other methods regarding retrieving similar structures in terms of precision-recall curve, retrieval success rate, R-precision, mean average precision and F1-measure. Conclusions Both ED- and GD-based LAD descriptors are effective to search deformed structures and overcome the problems of self-connection caused by a large bending motion. We have also demonstrated that the ED-based LAD is more robust than the GD-based descriptor. The proposed algorithm provides an alternative approach for blasting structure database, discovering previously unknown conformational relationships, and reorganizing protein structure classification. PMID:24694083

  11. Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

    PubMed

    Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G

    2013-01-01

    The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.

  12. Evolution of EF-hand calcium-modulated proteins. III. Exon sequences confirm most dendrograms based on protein sequences: calmodulin dendrograms show significant lack of parallelism

    NASA Technical Reports Server (NTRS)

    Nakayama, S.; Kretsinger, R. H.

    1993-01-01

    In the first report in this series we presented dendrograms based on 152 individual proteins of the EF-hand family. In the second we used sequences from 228 proteins, containing 835 domains, and showed that eight of the 29 subfamilies are congruent and that the EF-hand domains of the remaining 21 subfamilies have diverse evolutionary histories. In this study we have computed dendrograms within and among the EF-hand subfamilies using the encoding DNA sequences. In most instances the dendrograms based on protein and on DNA sequences are very similar. Significant differences between protein and DNA trees for calmodulin remain unexplained. In our fourth report we evaluate the sequences and the distribution of introns within the EF-hand family and conclude that exon shuffling did not play a significant role in its evolution.

  13. Proteomic Analysis of Lyme Disease: Global Protein Comparison of Three Strains of Borrelia burgdorferi

    SciTech Connect

    Jacobs, Jon M.; Yang, Xiaohua; Luft, Benjamin J.; Dunn, John J.; Camp, David G.; Smith, Richard D.

    2005-04-01

    The Borrelia burgdorferi spirochete is the causative agent of Lyme disease, the most common tick-borne disease in the United States. It has been studied extensively to help understand its pathogenicity of infection and how it can persist in different mammalian hosts. We report the proteomic analysis of the archetype B. burgdorferi B31 strain and two other strains (ND40, and JD-1) having different Borrelia pathotypes using strong cation exchange fractionation of proteolytic peptides followed by high-resolution, reversed phase capillary liquid chromatography coupled with ion trap tandem mass spectrometric (LC-MS/MS) analysis. Protein identification was facilitated by the availability of the complete B31 genome sequence. A total of 665 Borrelia proteins were identified representing ~38 % coverage of the theoretical B31 proteome. A significant overlap was observed between the identified proteins in direct comparisons between any two strains (>72%), but distinct differences were observed among identified hypothetical and outer membrane proteins of the three strains. Such a concurrent proteomic overview of three Borrelia strains based upon only the B31 genome sequence is shown to provide significant insights into the presence or absence of specific proteins and a broad overall comparison among strains.

  14. The evolution of proteins from random amino acid sequences: II. Evidence from the statistical distributions of the lengths of modern protein sequences.

    PubMed

    White, S H

    1994-04-01

    This paper continues an examination of the hypothesis that modern proteins evolved from random heteropeptide sequences. In support of the hypothesis, White and Jacobs (1993, J Mol Evol 36:79-95) have shown that any sequence chosen randomly from a large collection of nonhomologous proteins has a 90% or better chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. The goal of the present study was to investigate the possibility that the random-origin hypothesis could explain the lengths of modern protein sequences without invoking specific mechanisms such as gene duplication or exon splicing. The sets of sequences examined were taken from the 1989 PIR database and consisted of 1,792 "super-family" proteins selected to have little sequence identity, 623 E. coli sequences, and 398 human sequences. The length distributions of the proteins could be described with high significance by either of two closely related probability density functions: The gamma distribution with parameter 2 or the distribution for the sum of two exponential random independent variables. A simple theory for the distributions was developed which assumes that (1) protoprotein sequences had exponentially distributed random independent lengths, (2) the length dependence of protein stability determined which of these protoproteins could fold into compact primitive proteins and thereby attain the potential for biochemical activity, (3) the useful protein sequences were preserved by the primitive genome, and (4) the resulting distribution of sequence lengths is reflected by modern proteins. The theory successfully predicts the two observed distributions which can be distinguished by the functional form of the dependence of protein stability on length. The theory leads to three interesting conclusions. First, it predicts that a tetra-nucleotide was the signal for primitive translation termination. This prediction is

  15. Beta.-glucosidase coding sequences and protein from orpinomyces PC-2

    DOEpatents

    Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong; Ximenes, Eduardo A.

    2001-02-06

    Provided is a novel .beta.-glucosidase from Orpinomyces sp. PC2, nucleotide sequences encoding the mature protein and the precursor protein, and methods for recombinant production of this .beta.-glucosidase.

  16. Comparison of Metalloproteinase Protein and Activity Profiling

    PubMed Central

    Giricz, Orsi; Lauer, Janelle L.; Fields, Gregg B.

    2010-01-01

    Proteolytic enzymes play fundamental roles in many biological processes. Members of the matrix metalloproteinase (MMP) family have been shown to take part in processes crucial in disease progression. The present study used the ExcelArray Human MMP/TIMP Array to quantify MMP and tissue inhibitor of metalloproteinase (TIMP) production in the lysates and media of 14 cancer and one normal cell line. The overall patterns were very similar in terms of which MMPs and TIMPs were secreted in the media versus associated with the cells in the individual samples. However, more MMP was found in the media, both in amount and in variety. TIMP-1 was produced in all cell lines. MMP activity assays with three different FRET substrates were then utilized to determine if protein production correlated with function for the WM-266-4 and BJ cell lines. Metalloproteinase activity was observed for both cell lines with a general MMP substrate (Knight SSP), consistent with protein production data. However, although both cell lines promoted the hydrolysis of a more selective MMP substrate (NFF-3), metalloproteinase activity was only confirmed in the BJ cell line. The use of inhibitors to confirm metalloproteinase activities pointed to the strengths and weaknesses of in situ FRET substrate assays. PMID:20920458

  17. Molecular cloning and sequencing of the gene encoding the fimbrial subunit protein of Bacteroides gingivalis.

    PubMed Central

    Dickinson, D P; Kubiniec, M A; Yoshimura, F; Genco, R J

    1988-01-01

    The gene encoding the fimbrial subunit protein of Bacteroides gingivalis 381, fimbrilin, has been cloned and sequenced. The gene was present as a single copy on the bacterial chromosome, and the codon usage in the gene conformed closely to that expected for an abundant protein. The predicted size of the mature protein was 35,924 daltons, and the secretory form may have had a 10-amino-acid, hydrophilic leader sequence similar to the leader sequences of the MePhe fimbriae family. The protein sequence had no marked similarity to known fimbrial sequences, and no homologous sequences could be found in other black-pigmented Bacteroides species, suggesting that fimbrillin represents a class of fimbrial subunit protein of limited distribution. Images PMID:2895100

  18. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and 3-dimensional structural information

    PubMed Central

    Pei, Jimin; Grishin, Nick V.

    2015-01-01

    SUMMARY Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of 3-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  19. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information.

    PubMed

    Pei, Jimin; Grishin, Nick V

    2014-01-01

    Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of three-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D Web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  20. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information.

    PubMed

    Pei, Jimin; Grishin, Nick V

    2014-01-01

    Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of three-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D Web server and package are available at http://prodata.swmed.edu/PROMALS3D.

  1. Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict Protein-Protein Interactions from Protein Sequence

    PubMed Central

    Huang, Yu-An; You, Zhu-Hong; Gao, Xin; Wong, Leon; Wang, Lirong

    2015-01-01

    Increasing demand for the knowledge about protein-protein interactions (PPIs) is promoting the development of methods for predicting protein interaction network. Although high-throughput technologies have generated considerable PPIs data for various organisms, it has inevitable drawbacks such as high cost, time consumption, and inherently high false positive rate. For this reason, computational methods are drawing more and more attention for predicting PPIs. In this study, we report a computational method for predicting PPIs using the information of protein sequences. The main improvements come from adopting a novel protein sequence representation by using discrete cosine transform (DCT) on substitution matrix representation (SMR) and from using weighted sparse representation based classifier (WSRC). When performing on the PPIs dataset of Yeast, Human, and H. pylori, we got excellent results with average accuracies as high as 96.28%, 96.30%, and 86.74%, respectively, significantly better than previous methods. Promising results obtained have proven that the proposed method is feasible, robust, and powerful. To further evaluate the proposed method, we compared it with the state-of-the-art support vector machine (SVM) classifier. Extensive experiments were also performed in which we used Yeast PPIs samples as training set to predict PPIs of other five species datasets. PMID:26634213

  2. Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict Protein-Protein Interactions from Protein Sequence.

    PubMed

    Huang, Yu-An; You, Zhu-Hong; Gao, Xin; Wong, Leon; Wang, Lirong

    2015-01-01

    Increasing demand for the knowledge about protein-protein interactions (PPIs) is promoting the development of methods for predicting protein interaction network. Although high-throughput technologies have generated considerable PPIs data for various organisms, it has inevitable drawbacks such as high cost, time consumption, and inherently high false positive rate. For this reason, computational methods are drawing more and more attention for predicting PPIs. In this study, we report a computational method for predicting PPIs using the information of protein sequences. The main improvements come from adopting a novel protein sequence representation by using discrete cosine transform (DCT) on substitution matrix representation (SMR) and from using weighted sparse representation based classifier (WSRC). When performing on the PPIs dataset of Yeast, Human, and H. pylori, we got excellent results with average accuracies as high as 96.28%, 96.30%, and 86.74%, respectively, significantly better than previous methods. Promising results obtained have proven that the proposed method is feasible, robust, and powerful. To further evaluate the proposed method, we compared it with the state-of-the-art support vector machine (SVM) classifier. Extensive experiments were also performed in which we used Yeast PPIs samples as training set to predict PPIs of other five species datasets. PMID:26634213

  3. 3D reconstruction software comparison for short sequences

    NASA Astrophysics Data System (ADS)

    Strupczewski, Adam; Czupryński, BłaŻej

    2014-11-01

    Large scale multiview reconstruction is recently a very popular area of research. There are many open source tools that can be downloaded and run on a personal computer. However, there are few, if any, comparisons between all the available software in terms of accuracy on small datasets that a single user can create. The typical datasets for testing of the software are archeological sites or cities, comprising thousands of images. This paper presents a comparison of currently available open source multiview reconstruction software for small datasets. It also compares the open source solutions with a simple structure from motion pipeline developed by the authors from scratch with the use of OpenCV and Eigen libraries.

  4. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    PubMed Central

    Sharma, Anuj; Manolakos, Elias S.

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  5. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    PubMed

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  6. Microwave-assisted acid and base hydrolysis of intact proteins containing disulfide bonds for protein sequence analysis by mass spectrometry.

    PubMed

    Reiz, Bela; Li, Liang

    2010-09-01

    Controlled hydrolysis of proteins to generate peptide ladders combined with mass spectrometric analysis of the resultant peptides can be used for protein sequencing. In this paper, two methods of improving the microwave-assisted protein hydrolysis process are described to enable rapid sequencing of proteins containing disulfide bonds and increase sequence coverage, respectively. It was demonstrated that proteins containing disulfide bonds could be sequenced by MS analysis by first performing hydrolysis for less than 2 min, followed by 1 h of reduction to release the peptides originally linked by disulfide bonds. It was shown that a strong base could be used as a catalyst for microwave-assisted protein hydrolysis, producing complementary sequence information to that generated by microwave-assisted acid hydrolysis. However, using either acid or base hydrolysis, amide bond breakages in small regions of the polypeptide chains of the model proteins (e.g., cytochrome c and lysozyme) were not detected. Dynamic light scattering measurement of the proteins solubilized in an acid or base indicated that protein-protein interaction or aggregation was not the cause of the failure to hydrolyze certain amide bonds. It was speculated that there were some unknown local structures that might play a role in preventing an acid or base from reacting with the peptide bonds therein.

  7. Conservation of Shannon's redundancy for proteins. [information theory applied to amino acid sequences

    NASA Technical Reports Server (NTRS)

    Gatlin, L. L.

    1974-01-01

    Concepts of information theory are applied to examine various proteins in terms of their redundancy in natural originators such as animals and plants. The Monte Carlo method is used to derive information parameters for random protein sequences. Real protein sequence parameters are compared with the standard parameters of protein sequences having a specific length. The tendency of a chain to contain some amino acids more frequently than others and the tendency of a chain to contain certain amino acid pairs more frequently than other pairs are used as randomness measures of individual protein sequences. Non-periodic proteins are generally found to have random Shannon redundancies except in cases of constraints due to short chain length and genetic codes. Redundant characteristics of highly periodic proteins are discussed. A degree of periodicity parameter is derived.

  8. Identification of Disulfide Bonds in Protein Proteolytic Degradation Products Using de Novo-Protein Unique Sequence Tags Approach

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Purvine, Samuel O.; Smith, Richard D.

    2010-08-01

    Disulfide bonds are a form of posttranslational modification that often determines protein structure(s) and function(s). In this work, we report a mass spectrometry method for identification of disulfides in degradation products of proteins, and specifically endogenous peptides in the human blood plasma peptidome. LC-Fourier transform tandem mass spectrometry (FT MS/MS) was used for acquiring mass spectra that were de novo sequenced and then searched against the IPI human protein database. Through the use of unique sequence tags (UStags) we unambiguously correlated the spectra to specific database proteins. Examination of the UStags’ prefix and/or suffix sequences that contain cysteine(s) in conjunction with sequences of the UStags-specified database proteins is shown to enable the unambigious determination of disulfide bonds. Using this method, we identified the intermolecular and intramolecular disulfides in human blood plasma peptidome peptides that have molecular weights of up to ~10 kDa.

  9. Identification of disulfide bonds in protein proteolytic degradation products using de novo-protein unique sequence tags approach.

    PubMed

    Shen, Yufeng; Tolić, Nikola; Purvine, Samuel O; Smith, Richard D

    2010-08-01

    Disulfide bonds are a form of post-translational modification that often determines protein structure(s) and function(s). In this work, we report a mass spectrometry method for identification of disulfides in degradation products of proteins, specifically endogenous peptides in the human blood plasma peptidome. LC-Fourier transform tandem mass spectrometry (FT MS/MS) was used for acquiring mass spectra that were de novo sequenced and then searched against the IPI human protein database. Through the use of unique sequence tags (UStags), we unambiguously correlated the spectra to specific database proteins. Examination of the UStags' prefix and/or suffix sequences that contain cysteine(s) in conjunction with sequences of the UStags-specified database proteins is shown to enable the unambigious determination of disulfide bonds. Using this method, we identified the intermolecular and intramolecular disulfides in human blood plasma peptidome peptides that have molecular weights of up to approximately 10 kDa. PMID:20590115

  10. Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words.

    PubMed

    Santoni, Daniele; Felici, Giovanni; Vergni, Davide

    2016-02-21

    Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones.

  11. Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words.

    PubMed

    Santoni, Daniele; Felici, Giovanni; Vergni, Davide

    2016-02-21

    Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones. PMID:26656109

  12. Close Sequence Comparisons are Sufficient to Identify Humancis-Regulatory Elements

    SciTech Connect

    Prabhakar, Shyam; Poulin, Francis; Shoukry, Malak; Afzal, Veena; Rubin, Edward M.; Couronne, Olivier; Pennacchio, Len A.

    2005-12-01

    Cross-species DNA sequence comparison is the primary method used to identify functional noncoding elements in human and other large genomes. However, little is known about the relative merits of evolutionarily close and distant sequence comparisons, due to the lack of a universal metric for sequence conservation, and also the paucity of empirically defined benchmark sets of cis-regulatory elements. To address this problem, we developed a general-purpose algorithm (Gumby) that detects slowly-evolving regions in primate, mammalian and more distant comparisons without requiring adjustment of parameters, and ranks conserved elements by P-value using Karlin-Altschul statistics. We benchmarked Gumby predictions against previously identified cis-regulatory elements at diverse genomic loci, and also tested numerous extremely conserved human-rodent sequences for transcriptional enhancer activity using reporter-gene assays in transgenic mice. Human regulatory elements were identified with acceptable sensitivity and specificity by comparison with 1-5 other eutherian mammals or 6 other simian primates. More distant comparisons (marsupial, avian, amphibian and fish) failed to identify many of the empirically defined functional noncoding elements. We derived an intuitive relationship between ancient and recent noncoding sequence conservation from whole genome comparative analysis, which explains some of these findings. Lastly, we determined that, in addition to strength of conservation, genomic location and/or density of surrounding conserved elements must also be considered in selecting candidate enhancers for testing at embryonic time points.

  13. Quantitative Assessment of RNA-Protein Interactions with High Throughput Sequencing - RNA Affinity Profiling (HiTS-RAP)

    PubMed Central

    Ozer, Abdullah; Tome, Jacob M.; Friedman, Robin C.; Gheba, Dan; Schroth, Gary P.; Lis, John T.

    2016-01-01

    Because RNA-protein interactions play a central role in a wide-array of biological processes, methods that enable a quantitative assessment of these interactions in a high-throughput manner are in great demand. Recently, we developed the High Throughput Sequencing-RNA Affinity Profiling (HiTS-RAP) assay, which couples sequencing on an Illumina GAIIx with the quantitative assessment of one or several proteins’ interactions with millions of different RNAs in a single experiment. We have successfully used HiTS-RAP to analyze interactions of EGFP and NELF-E proteins with their corresponding canonical and mutant RNA aptamers. Here, we provide a detailed protocol for HiTS-RAP, which can be completed in about a month (8 days hands-on time) including the preparation and testing of recombinant proteins and DNA templates, clustering DNA templates on a flowcell, high-throughput sequencing and protein binding with GAIIx, and finally data analysis. We also highlight aspects of HiTS-RAP that can be further improved and points of comparison between HiTS-RAP and two other recently developed methods, RNA-MaP and RBNS. A successful HiTS-RAP experiment provides the sequence and binding curves for approximately 200 million RNAs in a single experiment. PMID:26182240

  14. Basal Murphy belt and Chilhowee Group -- Sequence stratigraphic comparison

    SciTech Connect

    Aylor, J.G. Jr. . Dept. of Geology)

    1994-03-01

    The lower Murphy belt in the central western Blue Ridge is interpreted to be correlative to the Early Cambrian Chilhowee Group of the westernmost Blue Ridge and Appalachian fold and thrust belt. Basal Murphy belt depositional sequence stratigraphy represents a second-order, type-2 transgressive systems tract initiated with deposition of lowstand turbidites of the Dean Formation. These transgressive deposits of the Nantahala and Brasstown Formations are interpreted as middle to outer continental shelf deposits. Cyclic and stacked third-order regressive, coarsening upwards sequences of the Nantahala Formation display an overall increase in feldspar content stratigraphically upsection. These transgressive siliciclastic deposits are interpreted to be conformably overlain by a carbonate highstand systems tract of the Murphy Marble. Palinspastic reconstruction indicates that the Nantahala and Brasstown Formations possibly represent a basinward extension of up to 3 km thick siliciclastic wedge. The wedge tapers to the southwest along the strike of the Murphy belt at 10[degree] and thins northwestward to 2 km in the Tennessee depocenter where it is represented by the Chilhowee Group. The Murphy belt basin is believed to represent a transitional rift-to-drift facies deposited on the lower plate of the southern Blue Ridge rift zone.

  15. Transporter taxonomy - a comparison of different transport protein classification schemes.

    PubMed

    Viereck, Michael; Gaulton, Anna; Digles, Daniela; Ecker, Gerhard F

    2014-06-01

    Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.

  16. Reconstruction of an ancestral Yersinia pestis genome and comparison with an ancient sequence

    PubMed Central

    2015-01-01

    Background We propose the computational reconstruction of a whole bacterial ancestral genome at the nucleotide scale, and its validation by a sequence of ancient DNA. This rare possibility is offered by an ancient sequence of the late middle ages plague agent. It has been hypothesized to be ancestral to extant Yersinia pestis strains based on the pattern of nucleotide substitutions. But the dynamics of indels, duplications, insertion sequences and rearrangements has impacted all genomes much more than the substitution process, which makes the ancestral reconstruction task challenging. Results We use a set of gene families from 13 Yersinia species, construct reconciled phylogenies for all of them, and determine gene orders in ancestral species. Gene trees integrate information from the sequence, the species tree and gene order. We reconstruct ancestral sequences for ancestral genic and intergenic regions, providing nearly a complete genome sequence for the ancestor, containing a chromosome and three plasmids. Conclusion The comparison of the ancestral and ancient sequences provides a unique opportunity to assess the quality of ancestral genome reconstruction methods. But the quality of the sequencing and assembly of the ancient sequence can also be questioned by this comparison. PMID:26450112

  17. Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry.

    PubMed

    Asara, John M; Schweitzer, Mary H; Freimark, Lisa M; Phillips, Matthew; Cantley, Lewis C

    2007-04-13

    Fossilized bones from extinct taxa harbor the potential for obtaining protein or DNA sequences that could reveal evolutionary links to extant species. We used mass spectrometry to obtain protein sequences from bones of a 160,000- to 600,000-year-old extinct mastodon (Mammut americanum) and a 68-million-year-old dinosaur (Tyrannosaurus rex). The presence of T. rex sequences indicates that their peptide bonds were remarkably stable. Mass spectrometry can thus be used to determine unique sequences from ancient organisms from peptide fragmentation patterns, a valuable tool to study the evolution and adaptation of ancient taxa from which genomic sequences are unlikely to be obtained.

  18. Complete sequence of the genome of the human isolate of Andes virus CHI-7913: comparative sequence and protein structure analysis.

    PubMed

    Tischler, Nicole D; Fernández, Jorge; Müller, Ilse; Martínez, Rodrigo; Galeno, Héctor; Villagra, Eliecer; Mora, Judith; Ramírez, Eugenio; Rosemblatt, Mario; Valenzuela, Pablo D

    2003-01-01

    We report here the complete genomic sequence of the Chilean human isolate of Andes virus CHI-7913. The S, M, and L genome segment sequences of this isolate are 1,802, 3,641 and 6,466 bases in length, with an overall GC content of 38.7%. These genome segments code for a nucleocapsid protein of 428 amino acids, a glycoprotein precursor protein of 1,138 amino acids and a RNA-dependent RNA polymerase of 2,152 amino acids. In addition, the genome also has other ORFs coding for putative proteins of 34 to 103 amino acids. The encoded proteins have greater than 98% overall similarity with the proteins of Andes virus isolates AH-1 and Chile R123. Among other sequenced Hantavirus, CHI-7913 is more closely related to Sin Nombre virus, with an overall protein similarity of 92%. The characteristics of the encoded proteins of this isolate, such as hydrophobic domains, glycosylation sites, and conserved amino acid motifs shared with other Hantavirus and other members of the Bunyaviridae family, are identified and discussed.

  19. PROMALS3D web server for accurate multiple protein sequence and structure alignments.

    PubMed

    Pei, Jimin; Tang, Ming; Grishin, Nick V

    2008-07-01

    Multiple sequence alignments are essential in computational sequence and structural analysis, with applications in homology detection, structure modeling, function prediction and phylogenetic analysis. We report PROMALS3D web server for constructing alignments for multiple protein sequences and/or structures using information from available 3D structures, database homologs and predicted secondary structures. PROMALS3D shows higher alignment accuracy than a number of other advanced methods. Input of PROMALS3D web server can be FASTA format protein sequences, PDB format protein structures and/or user-defined alignment constraints. The output page provides alignments with several formats, including a colored alignment augmented with useful information about sequence grouping, predicted secondary structures and consensus sequences. Intermediate results of sequence and structural database searches are also available. The PROMALS3D web server is available at: http://prodata.swmed.edu/promals3d/. PMID:18503087

  20. Application of 2D graphic representation of protein sequence based on Huffman tree method.

    PubMed

    Qi, Zhao-Hui; Feng, Jun; Qi, Xiao-Qin; Li, Ling

    2012-05-01

    Based on Huffman tree method, we propose a new 2D graphic representation of protein sequence. This representation can completely avoid loss of information in the transfer of data from a protein sequence to its graphic representation. The method consists of two parts. One is about the 0-1 codes of 20 amino acids by Huffman tree with amino acid frequency. The amino acid frequency is defined as the statistical number of an amino acid in the analyzed protein sequences. The other is about the 2D graphic representation of protein sequence based on the 0-1 codes. Then the applications of the method on ten ND5 genes and seven Escherichia coli strains are presented in detail. The results show that the proposed model may provide us with some new sights to understand the evolution patterns determined from protein sequences and complete genomes.

  1. Rapid removal of unincorporated label and proteins from DNA sequencing reactions.

    PubMed

    Kaczorowski, T; Sektas, M

    1996-04-01

    This article presents a simple and rapid method for removal of unincorporated label and proteins from DNA sequencing reactions by using Wizard purification resin. This method can be successfully applied for preparation of end-labeled oligonucleotides free of unincorporated label, which is important in experiments (including DNA sequencing) when the level of background should be as low as possible. Also, this method is effective in removal of proteins from DNA sequencing reactions. PMID:8734430

  2. Reprint of "Identification of staphylococcal species based on variations in protein sequences (mass spectrometry) and DNA sequence (sodA microarray)".

    PubMed

    Kooken, Jennifer; Fox, Karen; Fox, Alvin; Altomare, Diego; Creek, Kim; Wunschel, David; Pajares-Merino, Sara; Martínez-Ballesteros, Ilargi; Garaizar, Javier; Oyarzabal, Omar; Samadpour, Mansour

    2014-01-01

    This report is among the first using sequence variation in newly discovered protein markers for staphylococcal (or indeed any other bacterial) speciation. Variation, at the DNA sequence level, in the sodA gene (commonly used for staphylococcal speciation) provided excellent correlation. Relatedness among strains was also assessed using protein profiling using microcapillary electrophoresis and pulsed field electrophoresis. A total of 64 strains were analyzed including reference strains representing the 11 staphylococcal species most commonly isolated from man (Staphylococcus aureus and 10 coagulase negative species [CoNS]). Matrix assisted time of flight ionization/ionization mass spectrometry (MALDI TOF MS) and liquid chromatography-electrospray ionization tandem mass spectrometry (LC ESI MS/MS) were used for peptide analysis of proteins isolated from gel bands. Comparison of experimental spectra of unknowns versus spectra of peptides derived from reference strains allowed bacterial identification after MALDI TOF MS analysis. After LC-MS/MS analysis of gel bands bacterial speciation was performed by comparing experimental spectra versus virtual spectra using the software X!Tandem. Finally LC-MS/MS was performed on whole proteomes and data analysis also employing X!tandem. Aconitate hydratase and oxoglutarate dehydrogenase served as marker proteins on focused analysis after gel separation. Alternatively on full proteomics analysis elongation factor Tu generally provided the highest confidence in staphylococcal speciation.

  3. Development of a protein microarray using sequence-specific DNA binding domain on DNA chip surface

    SciTech Connect

    Choi, Yoo Seong; Pack, Seung Pil; Yoo, Young Je . E-mail: yjyoo@snu.ac.kr

    2005-04-22

    A protein microarray based on DNA microarray platform was developed to identify protein-protein interactions in vitro. The conventional DNA chip surface by 156-bp PCR product was prepared for a substrate of protein microarray. High-affinity sequence-specific DNA binding domain, GAL4 DNA binding domain, was introduced to the protein microarray as fusion partner of a target model protein, enhanced green fluorescent protein. The target protein was oriented immobilized directly on the DNA chip surface. Finally, monoclonal antibody of the target protein was used to identify the immobilized protein on the surface. This study shows that the conventional DNA chip can be used to make a protein microarray directly, and this novel protein microarray can be applicable as a tool for identifying protein-protein interactions.

  4. Hydrophobic blocks facilitate lipid compatibility and translocon recognition of transmembrane protein sequences.

    PubMed

    Stone, Tracy A; Schiller, Nina; von Heijne, Gunnar; Deber, Charles M

    2015-02-24

    Biophysical hydrophobicity scales suggest that partitioning of a protein segment from an aqueous phase into a membrane is governed by its perceived segmental hydrophobicity but do not establish specifically (i) how the segment is identified in vivo for translocon-mediated insertion or (ii) whether the destination lipid bilayer is biochemically receptive to the inserted sequence. To examine the congruence between these dual requirements, we designed and synthesized a library of Lys-tagged peptides of a core length sufficient to span a bilayer but with varying patterns of sequence, each composed of nine Leu residues, nine Ser residues, and one (central) Trp residue. We found that peptides containing contiguous Leu residues (Leu-block peptides, e.g., LLLLLLLLLWSSSSSSSSS), in comparison to those containing discontinuous stretches of Leu residues (non-Leu-block peptides, e.g., SLSLLSLSSWSLLSLSLLS), displayed greater helicity (circular dichroism spectroscopy), traveled slower during sodium dodecyl sulfate-polyacrylamide gel electrophoresis, had longer reverse phase high-performance liquid chromatography retention times on a C-18 column, and were helical when reconstituted into 1-palmitoyl-2-oleoylglycero-3-phosphocholine liposomes, each observation indicating superior lipid compatibility when a Leu-block is present. These parameters were largely paralleled in a biological membrane insertion assay using microsomal membranes from dog pancreas endoplasmic reticulum, where we found only the Leu-block sequences successfully inserted; intriguingly, an amphipathic peptide (SLLSSLLSSWLLSSLLSSL; Leu face, Ser face) with biophysical properties similar to those of Leu-block peptides failed to insert. Our overall results identify local sequence lipid compatibility rather than average hydrophobicity as a principal determinant of transmembrane segment potential, while demonstrating that further subtleties of hydrophobic and helical patterning, such as circumferential hydrophobicity in

  5. Hydrophobic Blocks Facilitate Lipid Compatibility and Translocon Recognition of Transmembrane Protein Sequences

    PubMed Central

    2016-01-01

    Biophysical hydrophobicity scales suggest that partitioning of a protein segment from an aqueous phase into a membrane is governed by its perceived segmental hydrophobicity but do not establish specifically (i) how the segment is identified in vivo for translocon-mediated insertion or (ii) whether the destination lipid bilayer is biochemically receptive to the inserted sequence. To examine the congruence between these dual requirements, we designed and synthesized a library of Lys-tagged peptides of a core length sufficient to span a bilayer but with varying patterns of sequence, each composed of nine Leu residues, nine Ser residues, and one (central) Trp residue. We found that peptides containing contiguous Leu residues (Leu-block peptides, e.g., LLLLLLLLLWSSSSSSSSS), in comparison to those containing discontinuous stretches of Leu residues (non-Leu-block peptides, e.g., SLSLLSLSSWSLLSLSLLS), displayed greater helicity (circular dichroism spectroscopy), traveled slower during sodium dodecyl sulfate–polyacrylamide gel electrophoresis, had longer reverse phase high-performance liquid chromatography retention times on a C-18 column, and were helical when reconstituted into 1-palmitoyl-2-oleoylglycero-3-phosphocholine liposomes, each observation indicating superior lipid compatibility when a Leu-block is present. These parameters were largely paralleled in a biological membrane insertion assay using microsomal membranes from dog pancreas endoplasmic reticulum, where we found only the Leu-block sequences successfully inserted; intriguingly, an amphipathic peptide (SLLSSLLSSWLLSSLLSSL; Leu face, Ser face) with biophysical properties similar to those of Leu-block peptides failed to insert. Our overall results identify local sequence lipid compatibility rather than average hydrophobicity as a principal determinant of transmembrane segment potential, while demonstrating that further subtleties of hydrophobic and helical patterning, such as circumferential hydrophobicity

  6. Sequence signatures and the probabilistic identification of proteins in the Myc-Max-Mad network.

    PubMed

    Atchley, William R; Fernandes, Andrew D

    2005-05-01

    Accurate identification of specific groups of proteins by their amino acid sequence is an important goal in genome research. Here we combine information theory with fuzzy logic search procedures to identify sequence signatures or predictive motifs for members of the Myc-Max-Mad transcription factor network. Myc is a well known oncoprotein, and this family is involved in cell proliferation, apoptosis, and differentiation. We describe a small set of amino acid sites from the N-terminal portion of the basic helix-loop-helix (bHLH) domain that provide very accurate sequence signatures for the Myc-Max-Mad transcription factor network and three of its member proteins. A predictive motif involving 28 contiguous bHLH sequence elements found 337 network proteins in the GenBank NR database with no mismatches or misidentifications. This motif also identifies at least one previously unknown fungal protein with strong affinity to the Myc-Max-Mad network. Another motif found 96% of known Myc protein sequences with only a single mismatch, including sequences from genomes previously not thought to contain Myc proteins. The predictive motif for Myc is very similar to the ancestral sequence for the Myc group estimated from phylogenetic analyses. Based on available crystal structure studies, this motif is discussed in terms of its functional consequences. Our results provide insight into evolutionary diversification of DNA binding and dimerization in a well characterized family of regulatory proteins and provide a method of identifying signature motifs in protein families.

  7. Secure distributed genome analysis for GWAS and sequence comparison computation

    PubMed Central

    2015-01-01

    Background The rapid increase in the availability and volume of genomic data makes significant advances in biomedical research possible, but sharing of genomic data poses challenges due to the highly sensitive nature of such data. To address the challenges, a competition for secure distributed processing of genomic data was organized by the iDASH research center. Methods In this work we propose techniques for securing computation with real-life genomic data for minor allele frequency and chi-squared statistics computation, as well as distance computation between two genomic sequences, as specified by the iDASH competition tasks. We put forward novel optimizations, including a generalization of a version of mergesort, which might be of independent interest. Results We provide implementation results of our techniques based on secret sharing that demonstrate practicality of the suggested protocols and also report on performance improvements due to our optimization techniques. Conclusions This work describes our techniques, findings, and experimental results developed and obtained as part of iDASH 2015 research competition to secure real-life genomic computations and shows feasibility of securely computing with genomic data in practice. PMID:26733307

  8. Investigation of the protein osteocalcin of Camelops hesternus: Sequence, structure and phylogenetic implications

    NASA Astrophysics Data System (ADS)

    Humpula, James F.; Ostrom, Peggy H.; Gandhi, Hasand; Strahler, John R.; Walker, Angela K.; Stafford, Thomas W.; Smith, James J.; Voorhies, Michael R.; George Corner, R.; Andrews, Phillip C.

    2007-12-01

    Ancient DNA sequences offer an extraordinary opportunity to unravel the evolutionary history of ancient organisms. Protein sequences offer another reservoir of genetic information that has recently become tractable through the application of mass spectrometric techniques. The extent to which ancient protein sequences resolve phylogenetic relationships, however, has not been explored. We determined the osteocalcin amino acid sequence from the bone of an extinct Camelid (21 ka, Camelops hesternus) excavated from Isleta Cave, New Mexico and three bones of extant camelids: bactrian camel ( Camelus bactrianus); dromedary camel ( Camelus dromedarius) and guanaco ( Llama guanacoe) for a diagenetic and phylogenetic assessment. There was no difference in sequence among the four taxa. Structural attributes observed in both modern and ancient osteocalcin include a post-translation modification, Hyp 9, deamidation of Gln 35 and Gln 39, and oxidation of Met 36. Carbamylation of the N-terminus in ancient osteocalcin may result in blockage and explain previous difficulties in sequencing ancient proteins via Edman degradation. A phylogenetic analysis using osteocalcin sequences of 25 vertebrate taxa was conducted to explore osteocalcin protein evolution and the utility of osteocalcin sequences for delineating phylogenetic relationships. The maximum likelihood tree closely reflected generally recognized taxonomic relationships. For example, maximum likelihood analysis recovered rodents, birds and, within hominins, the Homo-Pan-Gorilla trichotomy. Within Artiodactyla, character state analysis showed that a substitution of Pro 4 for His 4 defines the Capra-Ovis clade within Artiodactyla. Homoplasy in our analysis indicated that osteocalcin evolution is not a perfect indicator of species evolution. Limited sequence availability prevented assigning functional significance to sequence changes. Our preliminary analysis of osteocalcin evolution represents an initial step towards a

  9. Structured States of Disordered Proteins from Genomic Sequences.

    PubMed

    Toth-Petroczy, Agnes; Palmedo, Perry; Ingraham, John; Hopf, Thomas A; Berger, Bonnie; Sander, Chris; Marks, Debora S

    2016-09-22

    Protein flexibility ranges from simple hinge movements to functional disorder. Around half of all human proteins contain apparently disordered regions with little 3D or functional information, and many of these proteins are associated with disease. Building on the evolutionary couplings approach previously successful in predicting 3D states of ordered proteins and RNA, we developed a method to predict the potential for ordered states for all apparently disordered proteins with sufficiently rich evolutionary information. The approach is highly accurate (79%) for residue interactions as tested in more than 60 known disordered regions captured in a bound or specific condition. Assessing the potential for structure of more than 1,000 apparently disordered regions of human proteins reveals a continuum of structural order with at least 50% with clear propensity for three- or two-dimensional states. Co-evolutionary constraints reveal hitherto unseen structures of functional importance in apparently disordered proteins. PMID:27662088

  10. Structured States of Disordered Proteins from Genomic Sequences.

    PubMed

    Toth-Petroczy, Agnes; Palmedo, Perry; Ingraham, John; Hopf, Thomas A; Berger, Bonnie; Sander, Chris; Marks, Debora S

    2016-09-22

    Protein flexibility ranges from simple hinge movements to functional disorder. Around half of all human proteins contain apparently disordered regions with little 3D or functional information, and many of these proteins are associated with disease. Building on the evolutionary couplings approach previously successful in predicting 3D states of ordered proteins and RNA, we developed a method to predict the potential for ordered states for all apparently disordered proteins with sufficiently rich evolutionary information. The approach is highly accurate (79%) for residue interactions as tested in more than 60 known disordered regions captured in a bound or specific condition. Assessing the potential for structure of more than 1,000 apparently disordered regions of human proteins reveals a continuum of structural order with at least 50% with clear propensity for three- or two-dimensional states. Co-evolutionary constraints reveal hitherto unseen structures of functional importance in apparently disordered proteins.

  11. Comparison of surface and hydrogel-based protein microchips.

    PubMed

    Zubtsov, D A; Savvateeva, E N; Rubina, A Yu; Pan'kov, S V; Konovalova, E V; Moiseeva, O V; Chechetkin, V R; Zasedatelev, A S

    2007-09-15

    Protein microchips are designed for high-throughput evaluation of the concentrations and activities of various proteins. The rapid advance in microchip technology and a wide variety of existing techniques pose the problem of unified approach to the assessment and comparison of different platforms. Here we compare the characteristics of protein microchips developed for quantitative immunoassay with those of antibodies immobilized on glass surfaces and in hemispherical gel pads. Spotting concentrations of antibodies used for manufacturing of microchips of both types and concentrations of antigen in analyte solution were identical. We compared the efficiency of antibody immobilization, the intensity of fluorescence signals for both direct and sandwich-type immunoassays, and the reaction-diffusion kinetics of the formation of antibody-antigen complexes for surface and gel-based microchips. Our results demonstrate higher capacity and sensitivity for the hydrogel-based protein microchips, while fluorescence saturation kinetics for the two types of microarrays was comparable.

  12. Locating tandem repeats in weighted sequences in proteins.

    PubMed

    Zhang, Hui; Guo, Qing; Iliopoulos, Costas S

    2013-01-01

    A weighted biological sequence is a string in which a set of characters may appear at each position with respective probabilities of occurrence. We attempt to locate all the tandem repeats in a weighted sequence. A repeated substring is called a tandem repeat if each occurrence of the substring is directly adjacent to each other. By introducing the idea of equivalence classes in weighted sequences, we identify the tandem repeats of every possible length using an iterative partitioning technique. We also present the algorithm for recording the tandem repeats, and prove that the problem can be solved in O(n²) time. PMID:23815711

  13. Comparison of amino acid sequences of the trypsin inhibitors from taro (Colocasia esculenta), giant taro (Alocasia macrorrhiza) and giant swamp taro (Cyrtosperma chamissonis).

    PubMed

    Peng, L; Bradbury, J H; Hammer, B C; Shaw, D C

    1993-09-01

    The amino acid sequences of the trypsin inhibitors from taro Colocasia esculenta var. esculenta and giant swamp taro Cyrtosperma chamissonis have been determined and are compared with the protein sequence of the trypsin/chymotrypsin inhibitor from giant taro Alocasia macrorrhiza. Both inhibitors display polymorphism and there is evidence of two components in the giant swamp taro. The positional identity between the proteins is highest at 73-75% for the comparison of the giant taro (GT) with the polymorphic forms of the taro (T) inhibitors and lowest at 56-58% for the pairs of taro and giant swamp taro (GST) proteins. The comparisons show that the inhibitors from T and GT are more related to each other than to GST, which supports their taxonomic classification into different tribes. Location of the P1 site for the trypsin inhibitors of aroids is different from that of other Kunitz-type inhibitors and could be at Leu56.

  14. Full validation of therapeutic antibody sequences by middle-up mass measurements and middle-down protein sequencing.

    PubMed

    Resemann, Anja; Jabs, Wolfgang; Wiechmann, Anja; Wagner, Elsa; Colas, Olivier; Evers, Waltraud; Belau, Eckhard; Vorwerg, Lars; Evans, Catherine; Beck, Alain; Suckau, Detlev

    2016-01-01

    The regulatory bodies request full sequence data assessment both for innovator and biosimilar monoclonal antibodies (mAbs). Full sequence coverage is typically used to verify the integrity of the analytical data obtained following the combination of multiple LC-MS/MS datasets from orthogonal protease digests (so called "bottom-up" approaches). Top-down or middle-down mass spectrometric approaches have the potential to minimize artifacts, reduce overall analysis time and provide orthogonality to this traditional approach. In this work we report a new combined approach involving middle-up LC-QTOF and middle-down LC-MALDI in-source decay (ISD) mass spectrometry. This was applied to cetuximab, panitumumab and natalizumab, selected as representative US Food and Drug Administration- and European Medicines Agency-approved mAbs. The goal was to unambiguously confirm their reference sequences and examine the general applicability of this approach. Furthermore, a new measure for assessing the integrity and validity of results from middle-down approaches is introduced - the "Sequence Validation Percentage." Full sequence data assessment of the 3 antibodies was achieved enabling all 3 sequences to be fully validated by a combination of middle-up molecular weight determination and middle-down protein sequencing. Three errors in the reference amino acid sequence of natalizumab, causing a cumulative mass shift of only -2 Da in the natalizumab Fd domain, were corrected as a result of this work.

  15. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    SciTech Connect

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-28

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

  16. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    NASA Astrophysics Data System (ADS)

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-01

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

  17. How Many proteins are Missed in Quantitative proteomics Based on Ms/Ms sequencing Methods?

    PubMed Central

    Mulvey, Claire; Thur, Bettina; Crawford, Mark; Godovac-Zimmermann, Jasminka

    2014-01-01

    Current bottom-up quantitative proteomics methods based on MS/MS sequencing of peptides are shown to be strongly dependent on sample preparation. Using cytosolic proteins from MCF-7 breast cancer cells, it is shown that protein pre-fractionation based on pI and MW is more effective than pre-fractionation using only MW in increasing the number of observed proteins (947 vs. 704 proteins) and the number of spectral counts per protein. Combination of MS data from the different pre-fractionation methods results in further improvements (1238 proteins). We discuss that at present the main limitation on quantitation by MS/MS sequencing is not MS sensitivity and protein abundance, but rather extensive peptide overlap and limited MS/MS sequencing throughput, and that this favors internally calibrated methods such as SILAC, ICAT or ITRAQ over spectral counting methods in attempts to drastically improve proteome coverage of biological samples. PMID:25729266

  18. Comparison and analysis of the nucleotide sequences of pilin genes from Haemophilus influenzae type b strains Eagan and M43.

    PubMed Central

    Forney, L J; Marrs, C F; Bektesh, S L; Gilsdorf, J R

    1991-01-01

    Previous studies have demonstrated antigenic differences among the pili expressed by various strains of Haemophilus influenzae type b (Hib). In order to understand the molecular basis for these differences, the structural gene for pilin was cloned from Hib strain Eagan (p+) and the nucleotide sequence was compared to those of strains M43 (p+) and 770235 b0f+, which had been previously determined. The pilin gene of Hib strain Eagan (p+) had a 648-bp open reading frame that encoded a 20-amino-acid leader sequence followed by the 196 amino acids found in mature pilin. The translated sequence was three amino acids larger than pilins of strains M43 (p+) and 770235 b0f+ and was 78% identical and 95% homologous when conservative amino acid substitutions were considered. Differences between the amino acid sequences were not localized to any one region but rather were distributed throughout the proteins. Comparison of protein hydrophilicity profiles showed several hydrophilic regions with sequences that were conserved between strain Eagan (p+) and pilins of other Hib strains, and these regions represent potentially conserved antigenic domains. Southern blot analyses using an intragenic probe from the pilin gene of strain Eagan (p+) showed that the pilin gene was conserved among all type b and nontypeable strains of H. influenzae examined, and only a single copy was present in these strains. Homologous genes were not present in the phylogenetically related species Pasteurella multocida, Pasteurella haemolytica, and Actinobacillus pleuropneumoniae. These data indicate that the pilin gene was highly conserved among different strains of H. influenzae and that small differences in the pilin amino acid sequences account for the observed antigenic differences of assembled pili from these strains. Images PMID:2037360

  19. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures.

    PubMed

    Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier

    2016-01-01

    The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/.

  20. Comparative sequence analysis of double stranded RNA binding protein encoding gene of parapoxviruses from Indian camels.

    PubMed

    Nagarajan, G; Swami, Shelesh Kumar; Dahiya, Shyam Singh; Sivakumar, G; Tuteja, F C; Narnaware, S D; Mehta, S C; Singh, Raghvendar; Patil, N V

    2014-03-01

    The dsRNA binding protein (RBP) encoding gene of parapoxviruses (PPVs) from the Dromedary camels, inhabitating different geographical region of Rajasthan, India were amplified by polymerase chain reaction using the primers of pseudocowpoxvirus (PCPV) from Finnish reindeer and cloned into pGEM-T for sequence analysis. Analysis of RBP encoding gene revealed that PPV DNA from Bikaner shared 98.3% and 76.6% sequence identity at the amino acid level, with Pali and Udaipur PPV DNA, respectively. Reference strains of Bovine papular stomatitis virus (BPSV) and PCPV (reindeer PCPV and human PCPV) shared 52.8% and 86.9% amino acid identity with RBP gene of camel PPVs from Bikaner, respectively. But different strains of orf virus (ORFV) from different geographical areas of the world shared 69.5-71.7% amino acid identity with RBP gene of camel PPVs from Bikaner. These findings indicate that the camel PPVs described are closely related to bovine PPV (PCPV) in comparison to caprine and ovine PPV (ORFV). PMID:25685494

  1. In Silico Characterization of Pectate Lyase Protein Sequences from Different Source Organisms

    PubMed Central

    Dubey, Amit Kumar; Yadav, Sangeeta; Kumar, Manish; Singh, Vinay Kumar; Sarangi, Bijaya Ketan; Yadav, Dinesh

    2010-01-01

    A total of 121 protein sequences of pectate lyases were subjected to homology search, multiple sequence alignment, phylogenetic tree construction, and motif analysis. The phylogenetic tree constructed revealed different clusters based on different source organisms representing bacterial, fungal, plant, and nematode pectate lyases. The multiple accessions of bacterial, fungal, nematode, and plant pectate lyase protein sequences were placed closely revealing a sequence level similarity. The multiple sequence alignment of these pectate lyase protein sequences from different source organisms showed conserved regions at different stretches with maximum homology from amino acid residues 439–467, 715–816, and 829–910 which could be used for designing degenerate primers or probes specific for pectate lyases. The motif analysis revealed a conserved Pec_Lyase_C domain uniformly observed in all pectate lyases irrespective of variable sources suggesting its possible role in structural and enzymatic functions. PMID:21048874

  2. Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies.

    PubMed

    Torkamaneh, Davoud; Laroche, Jérôme; Belzile, François

    2016-01-01

    Next-generation sequencing (NGS) has revolutionized plant and animal research in many ways including new methods of high throughput genotyping. Genotyping-by-sequencing (GBS) has been demonstrated to be a robust and cost-effective genotyping method capable of producing thousands to millions of SNPs across a wide range of species. Undoubtedly, the greatest barrier to its broader use is the challenge of data analysis. Herein we describe a comprehensive comparison of seven GBS bioinformatics pipelines developed to process raw GBS sequence data into SNP genotypes. We compared five pipelines requiring a reference genome (TASSEL-GBS v1& v2, Stacks, IGST, and Fast-GBS) and two de novo pipelines that do not require a reference genome (UNEAK and Stacks). Using Illumina sequence data from a set of 24 re-sequenced soybean lines, we performed SNP calling with these pipelines and compared the GBS SNP calls with the re-sequencing data to assess their accuracy. The number of SNPs called without a reference genome was lower (13k to 24k) than with a reference genome (25k to 54k SNPs) while accuracy was high (92.3 to 98.7%) for all but one pipeline (TASSEL-GBSv1, 76.1%). Among pipelines offering a high accuracy (>95%), Fast-GBS called the greatest number of polymorphisms (close to 35,000 SNPs + Indels) and yielded the highest accuracy (98.7%). Using Ion Torrent sequence data for the same 24 lines, we compared the performance of Fast-GBS with that of TASSEL-GBSv2. It again called more polymorphisms (25.8K vs 22.9K) and these proved more accurate (95.2 vs 91.1%). Typically, SNP catalogues called from the same sequencing data using different pipelines resulted in highly overlapping SNP catalogues (79-92% overlap). In contrast, overlap between SNP catalogues obtained using the same pipeline but different sequencing technologies was less extensive (~50-70%). PMID:27547936

  3. Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies

    PubMed Central

    Torkamaneh, Davoud; Laroche, Jérôme; Belzile, François

    2016-01-01

    Next-generation sequencing (NGS) has revolutionized plant and animal research in many ways including new methods of high throughput genotyping. Genotyping-by-sequencing (GBS) has been demonstrated to be a robust and cost-effective genotyping method capable of producing thousands to millions of SNPs across a wide range of species. Undoubtedly, the greatest barrier to its broader use is the challenge of data analysis. Herein we describe a comprehensive comparison of seven GBS bioinformatics pipelines developed to process raw GBS sequence data into SNP genotypes. We compared five pipelines requiring a reference genome (TASSEL-GBS v1& v2, Stacks, IGST, and Fast-GBS) and two de novo pipelines that do not require a reference genome (UNEAK and Stacks). Using Illumina sequence data from a set of 24 re-sequenced soybean lines, we performed SNP calling with these pipelines and compared the GBS SNP calls with the re-sequencing data to assess their accuracy. The number of SNPs called without a reference genome was lower (13k to 24k) than with a reference genome (25k to 54k SNPs) while accuracy was high (92.3 to 98.7%) for all but one pipeline (TASSEL-GBSv1, 76.1%). Among pipelines offering a high accuracy (>95%), Fast-GBS called the greatest number of polymorphisms (close to 35,000 SNPs + Indels) and yielded the highest accuracy (98.7%). Using Ion Torrent sequence data for the same 24 lines, we compared the performance of Fast-GBS with that of TASSEL-GBSv2. It again called more polymorphisms (25.8K vs 22.9K) and these proved more accurate (95.2 vs 91.1%). Typically, SNP catalogues called from the same sequencing data using different pipelines resulted in highly overlapping SNP catalogues (79–92% overlap). In contrast, overlap between SNP catalogues obtained using the same pipeline but different sequencing technologies was less extensive (~50–70%). PMID:27547936

  4. Amino acid sequence of the encephalitogenic basic protein from human myelin

    PubMed Central

    Carnegie, P. R.

    1971-01-01

    Myelin from the central nervous system contains an unusual basic protein, which can induce experimental autoimmune encephalomyelitis. The basic protein from human brain was digested with trypsin and other enzymes and the sequence of the 170 amino acids was determined. The localization of the encephalitogenic determinants was described. Possible roles for the protein in the structure and function of myelin are discussed. PMID:4108501

  5. Using evolutionary sequence variation to make inferences about protein structure and function

    NASA Astrophysics Data System (ADS)

    Colwell, Lucy

    2015-03-01

    The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. The explosive growth in the number of available protein sequences raises the possibility of using the natural variation present in homologous protein sequences to infer these constraints and thus identify residues that control different protein phenotypes. Because in many cases phenotypic changes are controlled by more than one amino acid, the mutations that separate one phenotype from another may not be independent, requiring us to understand the correlation structure of the data. To address this we build a maximum entropy probability model for the protein sequence. The parameters of the inferred model are constrained by the statistics of a large sequence alignment. Pairs of sequence positions with the strongest interactions accurately predict contacts in protein tertiary structure, enabling all atom structural models to be constructed. We describe development of a theoretical inference framework that enables the relationship between the amount of available input data and the reliability of structural predictions to be better understood.

  6. Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison

    PubMed Central

    Li, Yushuang

    2016-01-01

    This paper constructed a novel injection from a DNA sequence to a 3D graph, named circular helix-like curve (CHC). The presented graphical representation is available for visualizing characterizations of a single DNA sequence and identifying similarities and differences among several DNAs. A 12-dimensional vector extracted from CHC, as a numerical characterization of CHC, was applied to analyze phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Successful experiments illustrated that CHC is an effective tool of biological sequence analysis and comparison. PMID:27403205

  7. Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison.

    PubMed

    Li, Yushuang; Xiao, Wenli

    2016-01-01

    This paper constructed a novel injection from a DNA sequence to a 3D graph, named circular helix-like curve (CHC). The presented graphical representation is available for visualizing characterizations of a single DNA sequence and identifying similarities and differences among several DNAs. A 12-dimensional vector extracted from CHC, as a numerical characterization of CHC, was applied to analyze phylogenetic relationships of 11 species, 74 ribosomal RNAs, 48 Hepatitis E viruses, and 18 eutherian mammals, respectively. Successful experiments illustrated that CHC is an effective tool of biological sequence analysis and comparison. PMID:27403205

  8. Characterization of DNA-protein interactions using high-throughput sequencing data from pulldown experiments

    NASA Astrophysics Data System (ADS)

    Moreland, Blythe; Oman, Kenji; Curfman, John; Yan, Pearlly; Bundschuh, Ralf

    Methyl-binding domain (MBD) protein pulldown experiments have been a valuable tool in measuring the levels of methylated CpG dinucleotides. Due to the frequent use of this technique, high-throughput sequencing data sets are available that allow a detailed quantitative characterization of the underlying interaction between methylated DNA and MBD proteins. Analyzing such data sets, we first found that two such proteins cannot bind closer to each other than 2 bp, consistent with structural models of the DNA-protein interaction. Second, the large amount of sequencing data allowed us to find rather weak but nevertheless clearly statistically significant sequence preferences for several bases around the required CpG. These results demonstrate that pulldown sequencing is a high-precision tool in characterizing DNA-protein interactions. This material is based upon work supported by the National Science Foundation under Grant No. DMR-1410172.

  9. Genetic analysis of the DNA recognition sequence of the P2 Cox protein.

    PubMed Central

    Cores de Vries, G; Wu, X S; Haggård-Ljungquist, E

    1991-01-01

    The Cox protein of temperate Escherichia coli phage P2 is involved in three important biological processes: (i) excision of the integrated prophage genome (G. Lindahl and M. Sunshine, Virology 49:180-187, 1972), (ii) transcriptional repression of the P2 Pc promoter, which controls the expression of the immunity repressor C and the integrase (S. Saha, E. Haggård-Ljungquist, and K. Nordström, EMBO J. 6:3191-3199, 1987), and (iii) transcriptional activation of the late PII promoter of the unrelated satellite phage P4 (S. Saha, E. Haggård-Ljungquist, and K. Nordström, Proc. Natl. Acad. Sci. USA 86:3973-3977, 1989). A comparison of the DNA regions protected by Cox from DNaseI degradation has revealed a presumptive Cox recognition sequence (Saha et al., Proc. Natl. Acad. Sci. USA). The binding region of Cox in the P2 Pc promoter contains three presumptive recognition sequences, "Cox boxes," located in tandem. P2 vir3 and P2 vir24 are virulent deletion mutants unable to plate on Cox-producing strains, most likely because the deletions locate the new early promoters too close to the Cox-binding region (Saha et al., EMBO J.). In this report, spontaneous P2 vir3 and vir24 mutants, no longer sensitive to repression by the Cox protein, have been isolated. These mutants plate with equal efficiency on strains with or without a Cox-producing plasmid, and they have been named cor for cox resistance. Three types are recognized; the four P2 vir3 cor mutants have a 1-base deletion in the first Cox box, while the P2 vir24 cor mutants were of two types; four have a base substitution in the first Cox box, and one has a base substitution in the second Cox box. The effect of the Cox protein on the mutated P2 vir3 and vir24 promoters was analyzed in vivo by using fusions to a promoterless cat (chloramphenicol acetyltransferase) gene. The activities of the P2 vir3 and vir24 early promoters, as opposed to the wild-type early Pe promoter, are drastically reduced by the Cox protein, and

  10. A comparative study of Whi5 and retinoblastoma proteins: from sequence and structure analysis to intracellular networks

    PubMed Central

    Hasan, Md Mehedi; Brocca, Stefania; Sacco, Elena; Spinelli, Michela; Papaleo, Elena; Lambrughi, Matteo; Alberghina, Lilia; Vanoni, Marco

    2014-01-01

    Cell growth and proliferation require a complex series of tight-regulated and well-orchestrated events. Accordingly, proteins governing such events are evolutionary conserved, even among distant organisms. By contrast, it is more singular the case of “core functions” exerted by functional analogous proteins that are not homologous and do not share any kind of structural similarity. This is the case of proteins regulating the G1/S transition in higher eukaryotes–i.e., the retinoblastoma (Rb) tumor suppressor Rb—and budding yeast, i.e., Whi5. The interaction landscape of Rb and Whi5 is quite large, with more than one hundred proteins interacting either genetically or physically with each protein. The Whi5 interactome has been used to construct a concept map of Whi5 function and regulation. Comparison of physical and genetic interactors of Rb and Whi5 allows highlighting a significant core of conserved, common functionalities associated with the interactors indicating that structure and function of the network—rather than individual proteins—are conserved during evolution. A combined bioinformatics and biochemical approach has shown that the whole Whi5 protein is highly disordered, except for a small region containing the protein family signature. The comparison with Whi5 homologs from Saccharomycetales has prompted the hypothesis of a modular organization of structural disorder, with most evolutionary conserved regions alternating with highly variable ones. The finding of a consensus sequence points to the conservation of a specific phosphorylation rhythm along with two disordered sequence motifs, probably acting as phosphorylation-dependent seeds in Whi5 folding/unfolding. Thus, the widely disordered Whi5 appears to act as a hierarchical, “date hub” that has evolutionary assayed an original way of modular organization before being supplanted by the globular, multi-domain structured Rb, more suitable to cover the role of a “party hub”. PMID

  11. Fast computational methods for predicting protein structure from primary amino acid sequence

    DOEpatents

    Agarwal, Pratul Kumar

    2011-07-19

    The present invention provides a method utilizing primary amino acid sequence of a protein, energy minimization, molecular dynamics and protein vibrational modes to predict three-dimensional structure of a protein. The present invention also determines possible intermediates in the protein folding pathway. The present invention has important applications to the design of novel drugs as well as protein engineering. The present invention predicts the three-dimensional structure of a protein independent of size of the protein, overcoming a significant limitation in the prior art.

  12. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases.

    PubMed

    Henzel, W J; Billeci, T M; Stults, J T; Wong, S C; Grimley, C; Watanabe, C

    1993-06-01

    A rapid method for the identification of known proteins separated by two-dimensional gel electrophoresis is described in which molecular masses of peptide fragments are used to search a protein sequence database. The peptides are generated by in situ reduction, alkylation, and tryptic digestion of proteins electroblotted from two-dimensional gels. Masses are determined at the subpicomole level by matrix-assisted laser desorption/ionization mass spectrometry of the unfractionated digest. A computer program has been developed that searches the protein sequence database for multiple peptides of individual proteins that match the measured masses. To ensure that the most recent database updates are included, a theoretical digest of the entire database is generated each time the program is executed. This method facilitates simultaneous processing of a large number of two-dimensional gel spots. The method was applied to a two-dimensional gel of a crude Escherichia coli extract that was electroblotted onto poly(vinylidene difluoride) membrane. Ten randomly chosen spots were analyzed. With as few as three peptide masses, each protein was uniquely identified from over 91,000 protein sequences. All identifications were verified by concurrent N-terminal sequencing of identical spots from a second blot. One of the spots contained an N-terminally blocked protein that required enzymatic cleavage, peptide separation, and Edman degradation for confirmation of its identity.

  13. Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools.

    PubMed

    Thomas, Paul D; Kejariwal, Anish; Guo, Nan; Mi, Huaiyu; Campbell, Michael J; Muruganujan, Anushya; Lazareva-Ulitsky, Betty

    2006-07-01

    The vast amount of protein sequence data now available, together with accumulating experimental knowledge of protein function, enables modeling of protein sequence and function evolution. The PANTHER database was designed to model evolutionary sequence-function relationships on a large scale. There are a number of applications for these data, and we have implemented web services that address three of them. The first is a protein classification service. Proteins can be classified, using only their amino acid sequences, to evolutionary groups at both the family and subfamily levels. Specific subfamilies, and often families, are further classified when possible according to their functions, including molecular function and the biological processes and pathways they participate in. The second application, then, is an expression data analysis service, where functional classification information can help find biological patterns in the data obtained from genome-wide experiments. The third application is a coding single-nucleotide polymorphism scoring service. In this case, information about evolutionarily related proteins is used to assess the likelihood of a deleterious effect on protein function arising from a single substitution at a specific amino acid position in the protein. All three web services are available at http://www.pantherdb.org/tools.

  14. Structure- and Sequence-Based Function Prediction for Non-Homologous Proteins

    PubMed Central

    Sael, Lee; Chitale, Meghana; Kihara, Daisuke

    2012-01-01

    The structural genomics projects have been accumulating an increasing number of protein structures, many of which remain functionally unknown. In parallel effort to experimental methods, computational methods are expected to make a significant contribution for functional elucidation of such proteins. However, conventional computational methods that transfer functions from homologous proteins do not help much for these uncharacterized protein structures because they do not have apparent structural or sequence similarity with the known proteins. Here, we briefly review two avenues of computational function prediction methods, i.e. structure-based methods and sequence-based methods. The focus is on our recently developments of local structure-based methods and sequence-based methods, which can effectively extract function information from distantly related proteins. Two structure-based methods, Pocket-Surfer and Patch-Surfer, identify similar known ligand binding sites for pocket regions in a query protein without using global protein fold similarity information. Two sequence-based methods, PFP and ESG, make use of weakly similar sequences that are conventionally discarded in homology based function annotation. Combined together with experimental methods we hope that computational methods will make leading contribution in functional elucidation of the protein structures. PMID:22270458

  15. mRNA sequence of three respiratory syncytial virus genes encoding two nonstructural proteins and a 22K structural protein.

    PubMed Central

    Elango, N; Satake, M; Venkatesan, S

    1985-01-01

    An mRNA sequence of two human respiratory syncytial viral nonstructural protein genes and of a gene for a 22,000-molecular-weight (22K) protein was obtained by cDNA cloning and DNA sequencing. Sequences corresponding to the 5' ends of the respective transcripts were deduced directly by primer extension and dideoxy nucleotide sequencing of the mRNAs. The availability of a bicistronic clone (pRSC6) confirmed the gene order for this portion of the genome. Contrary to other unsegmented negative-stranded RNA viruses, a 19-nucleotide intercistronic sequence was present between the NS1 and NS2 genes. The translation of cloned viral sequences in the bicistronic and monocistronic clones (pRSNS1 and pRSNS2) revealed two moderately hydrophobic proteins of 15,568 and 14,703 daltons. Their similarity in molecular size explained our earlier inability to resolve these proteins. A DNA sequence of an additional recombinant plasmid (pRSA2) revealed a long open reading frame encoding a 22,156-dalton protein containing 194 amino acids. It was relatively basic and moderately hydrophobic. A protein of this size was readily translated in vitro from a viral mRNA hybrid selected by this plasmid and corresponded to an unglycosylated 22K protein seen in purified extracellular virus but not associated with detergent- and salt-resistant cores. A second open reading frame of 90 amino acids partially overlapping with the C terminus of the 22K protein was also present within this sequence. This was reminiscent of the viral matrix protein gene which was previously shown by us to contain two overlapping reading frames. The finding of three additional viral transcripts encoding at least three identifiable proteins in human respiratory syncytial virus was a novel departure from the usual genetic organization of paramyxoviruses. The 5' ends of all three transcripts had a 5'NGGGCAAAU sequence that is common to all viral transcripts analyzed so far. Although there was no obvious homology immediately

  16. Protein meta-functional signatures from combining sequence, structure, evolution, and amino acid property information.

    PubMed

    Wang, Kai; Horst, Jeremy A; Cheng, Gong; Nickle, David C; Samudrala, Ram

    2008-09-26

    Protein function is mediated by different amino acid residues, both their positions and types, in a protein sequence. Some amino acids are responsible for the stability or overall shape of the protein, playing an indirect role in protein function. Others play a functionally important role as part of active or binding sites of the protein. For a given protein sequence, the residues and their degree of functional importance can be thought of as a signature representing the function of the protein. We have developed a combination of knowledge- and biophysics-based function prediction approaches to elucidate the relationships between the structural and the functional roles of individual residues and positions. Such a meta-functional signature (MFS), which is a collection of continuous values representing the functional significance of each residue in a protein, may be used to study proteins of known function in greater detail and to aid in experimental characterization of proteins of unknown function. We demonstrate the superior performance of MFS in predicting protein functional sites and also present four real-world examples to apply MFS in a wide range of settings to elucidate protein sequence-structure-function relationships. Our results indicate that the MFS approach, which can combine multiple sources of information and also give biological interpretation to each component, greatly facilitates the understanding and characterization of protein function.

  17. The cleavable pre-sequence of an imported chloroplast protein directs attached polypeptides into yeast mitochondria

    PubMed Central

    Hurt, Eduard C.; Soltanifar, Nouchine; Goldschmidt-Clermont, Michel; Rochaix, Jean-David; Schatz, Gottfried

    1986-01-01

    The cleavable pre-sequences of imported chloroplast and mitochondrial proteins have several features in common. This structural similarity prompted us to test whether a chloroplast pre-sequence (`transit peptide') can also be decoded by the mitochondrial import machinery. In the green alga, Chlamydomonas reinhardtii, the small subunit of ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) (a chloroplast protein) is nuclear-encoded and synthesized in the cytosol with a transient pre-sequence of 45 residues. The 31 amino-terminal residues of this chloroplast pre-sequence were fused to mouse dihydrofolate reductase (a cytosolic protein) and to yeast cytochrome oxidase subunit IV (an imported mitochondrial protein) from which the authentic pre-sequence had been removed. The chloroplast pre-sequence transported both attached proteins into the yeast mitochondrial matrix or inner membrane, although it functioned less efficiently than an authentic mitochondrial pre-sequence. We conclude that mitochondrial and chloroplast pre-sequences perform their function by a similar mechanism. ImagesFig. 3.Fig. 4.Fig. 5.Fig. 6. PMID:16453686

  18. Contributions of the Prion Protein Sequence, Strain, and Environment to the Species Barrier.

    PubMed

    Sharma, Aditi; Bruce, Kathryn L; Chen, Buxin; Gyoneva, Stefka; Behrens, Sven H; Bommarius, Andreas S; Chernoff, Yury O

    2016-01-15

    Amyloid propagation requires high levels of sequence specificity so that only molecules with very high sequence identity can form cross-β-sheet structures of sufficient stringency for incorporation into the amyloid fibril. This sequence specificity presents a barrier to the transmission of prions between two species with divergent sequences, termed a species barrier. Here we study the relative effects of protein sequence, seed conformation, and environment on the species barrier strength and specificity for the yeast prion protein Sup35p from three closely related species of the Saccharomyces sensu stricto group; namely, Saccharomyces cerevisiae, Saccharomyces bayanus, and Saccharomyces paradoxus. Through in vivo plasmid shuffle experiments, we show that the major characteristics of the transmission barrier and conformational fidelity are determined by the protein sequence rather than by the cellular environment. In vitro data confirm that the kinetics and structural preferences of aggregation of the S. paradoxus and S. bayanus proteins are influenced by anions in accordance with their positions in the Hofmeister series, as observed previously for S. cerevisiae. However, the specificity of the species barrier is primarily affected by the sequence and the type of anion present during the formation of the initial seed, whereas anions present during the seeded aggregation process typically influence kinetics rather than the specificity of prion conversion. Therefore, our work shows that the protein sequence and the conformation variant (strain) of the prion seed are the primary determinants of cross-species prion specificity both in vivo and in vitro.

  19. Contributions of the Prion Protein Sequence, Strain, and Environment to the Species Barrier.

    PubMed

    Sharma, Aditi; Bruce, Kathryn L; Chen, Buxin; Gyoneva, Stefka; Behrens, Sven H; Bommarius, Andreas S; Chernoff, Yury O

    2016-01-15

    Amyloid propagation requires high levels of sequence specificity so that only molecules with very high sequence identity can form cross-β-sheet structures of sufficient stringency for incorporation into the amyloid fibril. This sequence specificity presents a barrier to the transmission of prions between two species with divergent sequences, termed a species barrier. Here we study the relative effects of protein sequence, seed conformation, and environment on the species barrier strength and specificity for the yeast prion protein Sup35p from three closely related species of the Saccharomyces sensu stricto group; namely, Saccharomyces cerevisiae, Saccharomyces bayanus, and Saccharomyces paradoxus. Through in vivo plasmid shuffle experiments, we show that the major characteristics of the transmission barrier and conformational fidelity are determined by the protein sequence rather than by the cellular environment. In vitro data confirm that the kinetics and structural preferences of aggregation of the S. paradoxus and S. bayanus proteins are influenced by anions in accordance with their positions in the Hofmeister series, as observed previously for S. cerevisiae. However, the specificity of the species barrier is primarily affected by the sequence and the type of anion present during the formation of the initial seed, whereas anions present during the seeded aggregation process typically influence kinetics rather than the specificity of prion conversion. Therefore, our work shows that the protein sequence and the conformation variant (strain) of the prion seed are the primary determinants of cross-species prion specificity both in vivo and in vitro. PMID:26565023

  20. CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences.

    PubMed

    Afonnikov, Dmitry A; Kolchanov, Nikolay A

    2004-07-01

    Recent results suggest that during evolution certain substitutions at protein sites may occur in a coordinated manner due to interactions between amino acid residues. Information on these coordinated substitutions may be useful for analysis of protein structure and function. CRASP is an Internet-available software tool for the detection and analysis of coordinated substitutions in multiple alignments of protein sequences. The approach is based on estimation of the correlation coefficient between the values of a physicochemical parameter at a pair of positions of sequence alignment. The program enables the user to detect and analyze pairwise relationships between amino acid substitutions at protein sequence positions, estimate the contribution of the coordinated substitutions to the evolutionary invariance or variability in integral protein physicochemical characteristics such as the net charge of protein residues and hydrophobic core volume. The CRASP program is available at http://wwwmgs.bionet.nsc.ru/mgs/programs/crasp/.

  1. Using the Relevance Vector Machine Model Combined with Local Phase Quantization to Predict Protein-Protein Interactions from Protein Sequences

    PubMed Central

    An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Fang, Yu-Hong; Zhao, Yu-Jun; Zhang, Ming

    2016-01-01

    We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM) model and Local Phase Quantization (LPQ) to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. We perform 5-fold cross-validation experiments on Yeast and Human datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the Yeast dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research. PMID:27314023

  2. Complete mitochondrial DNA sequence of the yellowfin seabream Acanthopagrus latus and a genomic comparison among closely related sparid species.

    PubMed

    Xia, Junhong; Xia, Kuaifei; Jiang, Shigui

    2008-08-01

    The complete mitochondrial genome of the yellowfin seabream Acanthopagrus latus was determined in the present study. The genome was 16,609 bp in length and contained 37 genes (2 ribosomal RNA, 22 transfer RNA and 13 protein-coding genes) and the control region (CR), with the content and order of genes being similar to those in typical teleosts. Comparisons of the 37 genes and CR among species indicate the CR was the highest divergent (0.3341), but tRNA(Gly) possesses the lowest genetic variation (0.0542). Much greater p-genetic distances [mean = 0.1559, standard deviation (SD) = 0.0235; n = 1653] for the interspecies level with high frequency (99.4%) than those of the intraspecies level (mean = 0.0098, SD = 0.0090; n = 20) were inferred from 212 Cyt b sequence data, suggesting the Cyt b gene is conserved within Sparidae species and supporting the barcoding validity of Cyt b sequence data for Sparidae species identification. Phylogenetic analysis using amino acid sequences of 13 protein-coding genes supported that the genus Pagrus was not monophyletic, showing the need to re-evaluate the morphological characteristics of Pagrus fishes.

  3. Comparisons of the Distribution of Nucleotides and Common Sequences in Deoxyribonucleic Acid from Selected Bacteriophages

    PubMed Central

    Skalka, A.; Hanson, P.

    1972-01-01

    Results from comparisons of deoxyribonucleic acid (DNA) from several classes of bacteriophages suggest that most phage chromosomes contain either a homogeneous distribution of nucleotides or are made up of a few, rather large segments of different quanine plus cytosine (G + C) contents which are internally homogeneous. Among those temperate phages tested, most contained segmented DNA. Comparisons of sequence similarities among segments from lambdoid phage DNA species revealed the following order in relatedness to λ: 82 (and 434) > 21 > 424 > φ80. Most common sequences are found in the highest G + C segments, which in λ contain head and tail genes. Hybridization tests with λ and 186 or P2 DNA species verified that the lambdoids and 186 and P2 belong to two distinct groups. There are fewer homologous sequences between the DNA species of coliphages λ and P2 or 186 than there are between the DNA species of coliphage λ and salmonella phage P22. PMID:4553679

  4. Beyond Linear Sequence Comparisons: The use of genome-levelcharacters for phylogenetic reconstruction

    SciTech Connect

    Boore, Jeffrey L.

    2004-11-27

    Although the phylogenetic relationships of many organisms have been convincingly resolved by the comparisons of nucleotide or amino acid sequences, others have remained equivocal despite great effort. Now that large-scale genome sequencing projects are sampling many lineages, it is becoming feasible to compare large data sets of genome-level features and to develop this as a tool for phylogenetic reconstruction that has advantages over conventional sequence comparisons. Although it is unlikely that these will address a large number of evolutionary branch points across the broad tree of life due to the infeasibility of such sampling, they have great potential for convincingly resolving many critical, contested relationships for which no other data seems promising. However, it is important that we recognize potential pitfalls, establish reasonable standards for acceptance, and employ rigorous methodology to guard against a return to earlier days of scenario-driven evolutionary reconstructions.

  5. Novel method for identifying sequence-specific DNA-binding proteins.

    PubMed Central

    Levens, D; Howley, P M

    1985-01-01

    We developed a general method for the enrichment and identification of sequence-specific DNA-binding proteins. A well-characterized protein-DNA interaction is used to isolate from crude cellular extracts or fractions thereof proteins which bind to specific DNA sequences; the method is based solely on this binding property of the proteins. The DNA sequence of interest, cloned adjacent to the lac operator DNA segment is incubated with a lac repressor-beta-galactosidase fusion protein which retains full operator and inducer binding properties. The DNA fragment bound to the lac repressor-beta-galactosidase fusion protein is precipitated by the addition of affinity-purified anti-beta-galactosidase immobilized on beads. This forms an affinity matrix for any proteins which might interact specifically with the DNA sequence cloned adjacent to the lac operator. When incubated with cellular extracts in the presence of excess competitor DNA, any protein(s) which specifically binds to the cloned DNA sequence of interest can be cleanly precipitated. When isopropyl-beta-D-thiogalactopyranoside is added, the lac repressor releases the bound DNA, and thus the protein-DNA complex consisting of the specific restriction fragment and any specific binding protein(s) is released, permitting the identification of the protein by standard biochemical techniques. We demonstrate the utility of this method with the lambda repressor, another well-characterized DNA-binding protein, as a model. In addition, with crude preparations of the yeast mitochondrial RNA polymerase, we identified a 70,000-molecular-weight peptide which binds specifically to the promoter region of the yeast mitochondrial 14S rRNA gene. Images PMID:3016526

  6. eMatchSite: Sequence Order-Independent Structure Alignments of Ligand Binding Pockets in Protein Models

    PubMed Central

    Brylinski, Michal

    2014-01-01

    Detecting similarities between ligand binding sites in the absence of global homology between target proteins has been recognized as one of the critical components of modern drug discovery. Local binding site alignments can be constructed using sequence order-independent techniques, however, to achieve a high accuracy, many current algorithms for binding site comparison require high-quality experimental protein structures, preferably in the bound conformational state. This, in turn, complicates proteome scale applications, where only various quality structure models are available for the majority of gene products. To improve the state-of-the-art, we developed eMatchSite, a new method for constructing sequence order-independent alignments of ligand binding sites in protein models. Large-scale benchmarking calculations using adenine-binding pockets in crystal structures demonstrate that eMatchSite generates accurate alignments for almost three times more protein pairs than SOIPPA. More importantly, eMatchSite offers a high tolerance to structural distortions in ligand binding regions in protein models. For example, the percentage of correctly aligned pairs of adenine-binding sites in weakly homologous protein models is only 4–9% lower than those aligned using crystal structures. This represents a significant improvement over other algorithms, e.g. the performance of eMatchSite in recognizing similar binding sites is 6% and 13% higher than that of SiteEngine using high- and moderate-quality protein models, respectively. Constructing biologically correct alignments using predicted ligand binding sites in protein models opens up the possibility to investigate drug-protein interaction networks for complete proteomes with prospective systems-level applications in polypharmacology and rational drug repositioning. eMatchSite is freely available to the academic community as a web-server and a stand-alone software distribution at http://www.brylinski.org/ematchsite. PMID

  7. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder

    PubMed Central

    Lorenzo, J. Ramiro; Alonso, Leonardo G.; Sánchez, Ignacio E.

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage “Protein and nucleic acid structure and sequence analysis”. PMID:26674530

  8. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder.

    PubMed

    Lorenzo, J Ramiro; Alonso, Leonardo G; Sánchez, Ignacio E

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage "Protein and nucleic acid structure and sequence analysis".

  9. Detection of Weakly Conserved Ancestral Mammalian RegulatorySequences by Primate Comparisons

    SciTech Connect

    Wang, Qian-fei; Prabhakar, Shyam; Chanan, Sumita; Cheng,Jan-Fang; Rubin, Edward M.; Boffelli, Dario

    2006-06-01

    Genomic comparisons between human and distant, non-primatemammals are commonly used to identify cis-regulatory elements based onconstrained sequence evolution. However, these methods fail to detectcryptic functional elements, which are too weakly conserved among mammalsto distinguish from nonfunctional DNA. To address this problem, weexplored the potential of deep intra-primate sequence comparisons. Wesequenced the orthologs of 558 kb of human genomic sequence, coveringmultiple loci involved in cholesterol homeostasis, in 6 nonhumanprimates. Our analysis identified 6 noncoding DNA elements displayingsignificant conservation among primates, but undetectable in more distantcomparisons. In vitro and in vivo tests revealed that at least three ofthese 6 elements have regulatory function. Notably, the mouse orthologsof these three functional human sequences had regulatory activity despitetheir lack of significant sequence conservation, indicating that they arecryptic ancestral cis-regulatory elements. These regulatory elementscould still be detected in a smaller set of three primate speciesincluding human, rhesus and marmoset. Since the human and rhesus genomesequences are already available, and the marmoset genome is activelybeing sequenced, the primate-specific conservation analysis describedhere can be applied in the near future on a whole-genome scale, tocomplement the annotation provided by more distant speciescomparisons.

  10. Sequence comparison alignment-free approach based on suffix tree and L-words frequency.

    PubMed

    Soares, Inês; Goios, Ana; Amorim, António

    2012-01-01

    The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words--in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.

  11. A second rhodopsin-like protein in Cyanophora paradoxa: gene sequence and protein expression in a cell-free system.

    PubMed

    Frassanito, Anna Maria; Barsanti, Laura; Passarelli, Vincenzo; Evangelista, Valtere; Gualtieri, Paolo

    2013-08-01

    Here we report the identification and expression of a second rhodopsin-like protein in the alga Cyanophora paradoxa (Glaucophyta), named Cyanophopsin_2. This new protein was identified due to a serendipity event, since the RACE reaction performed to complete the sequence of Cyanophopsin_1, (the first rhodopsin-like protein of C. paradoxa identified in 2009 by our group), amplified a 619 bp sequence corresponding to a portion of a new gene of the same protein family. The full sequence consists of 1175 bp consisting of 849 bp coding DNA sequence and 4 introns of 326 bp. The protein is characterized by an N-terminal region of 47 amino acids, followed by a region with 7 α-helices of 213 amino acids and a C-terminal region of 22 amino acids. This protein showed high identity with Cyanophopsin_1 and other rhodopsin-like proteins of Archea, Bacteria, Fungi and Algae. Cyanophosin_2 (CpR2) was expressed in a cell-free expression system, and characterized by means of absorption spectroscopy. PMID:23851421

  12. PROMOT: a FORTRAN program to scan protein sequences against a library of known motifs.

    PubMed

    Sternberg, M J

    1991-04-01

    Information about the three-dimensional structure or function of a newly determined protein sequence can be obtained if the protein is found to contain a characterized motif or pattern of residues. Recently a database (PROSITE) has been established that contains 337 known motifs encoded as a list of allowed residue types at specific positions along the sequence. PROMOT is a FORTRAN computer program that takes a protein sequence and examines if it contains any of the motifs in PROSITE. The program also extends the definitions of patterns beyond those used in PROSITE to provide a simple, yet flexible, method to scan either a PROSITE or a user-defined pattern against a protein sequence database.

  13. Sequence-related human proteins cluster by degree of evolutionary conservation.

    PubMed

    Mrowka, Ralf; Patzak, Andreas; Herzel, Hanspeter; Holste, Dirk

    2004-11-01

    Gene duplication followed by adaptive evolution is thought to be a central mechanism for the emergence of novel genes. To illuminate the contribution of duplicated protein-coding sequences to the complexity of the human genome, we study the connectivity of pairwise sequence-related human proteins and construct a network (N) of linked protein sequences with shared similarities. We find that (i) the connectivity distribution P (k) for k sequence-related proteins decays as a power law P (k) approximately k(-gamma) with gamma approximately 1.2 , (ii) the top rank of N consists of a single large cluster of proteins ( approximately 70%) , while bottom ranks consist of multiple isolated clusters, and (iii) structural characteristics of N show both a high degree of clustering and an intermediate connectivity ("small-world" features). We gain further insight into structural properties of N by studying the relationship between the connectivity distribution and the phylogenetic conservation of proteins in bacteria, plants, invertebrates, and vertebrates. We find that (iv) the proportion of sequence-related proteins increases with increasing extent of evolutionary conservation. Our results support that small-world network properties constitute a footprint of an evolutionary mechanism and extend the traditional interpretation of protein families.

  14. Sequence-related human proteins cluster by degree of evolutionary conservation

    NASA Astrophysics Data System (ADS)

    Mrowka, Ralf; Patzak, Andreas; Herzel, Hanspeter; Holste, Dirk

    2004-11-01

    Gene duplication followed by adaptive evolution is thought to be a central mechanism for the emergence of novel genes. To illuminate the contribution of duplicated protein-coding sequences to the complexity of the human genome, we study the connectivity of pairwise sequence-related human proteins and construct a network (N) of linked protein sequences with shared similarities. We find that (i) the connectivity distribution P(k) for k sequence-related proteins decays as a power law P(k)˜k-γ with γ≈1.2 , (ii) the top rank of N consists of a single large cluster of proteins (≈70%) , while bottom ranks consist of multiple isolated clusters, and (iii) structural characteristics of N show both a high degree of clustering and an intermediate connectivity (“small-world” features). We gain further insight into structural properties of N by studying the relationship between the connectivity distribution and the phylogenetic conservation of proteins in bacteria, plants, invertebrates, and vertebrates. We find that (iv) the proportion of sequence-related proteins increases with increasing extent of evolutionary conservation. Our results support that small-world network properties constitute a footprint of an evolutionary mechanism and extend the traditional interpretation of protein families.

  15. The value of short amino acid sequence matches for prediction of protein allergenicity.

    PubMed

    Silvanovich, Andre; Nemeth, Margaret A; Song, Ping; Herman, Rod; Tagliani, Laura; Bannon, Gary A

    2006-03-01

    Typically, genetically engineered crops contain traits encoded by one or a few newly expressed proteins. The allergenicity assessment of newly expressed proteins is an important component in the safety evaluation of genetically engineered plants. One aspect of this assessment involves sequence searches that compare the amino acid sequence of the protein to all known allergens. Analyses are performed to determine the potential for immunologically based cross-reactivity where IgE directed against a known allergen could bind to the protein and elicit a clinical reaction in sensitized individuals. Bioinformatic searches are designed to detect global sequence similarity and short contiguous amino acid sequence identity. It has been suggested that potential allergen cross-reactivity may be predicted by identifying matches as short as six to eight contiguous amino acids between the protein of interest and a known allergen. A series of analyses were performed, and match probabilities were calculated for different size peptides to determine if there was a scientifically justified search window size that identified allergen sequence characteristics. Four probability modeling methods were tested: (1) a mock protein and a mock allergen database, (2) a mock protein and genuine allergen database, (3) a genuine allergen and genuine protein database, and (4) a genuine allergen and genuine protein database combined with a correction for repeating peptides. These analyses indicated that searches for short amino acid sequence matches of eight amino acids or fewer to identify proteins as potential cross-reactive allergens is a product of chance and adds little value to allergy assessments for newly expressed proteins.

  16. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction

    PubMed Central

    Cui, Xuefeng; Lu, Zhiwu; Wang, Sheng; Jing-Yan Wang, Jim; Gao, Xin

    2016-01-01

    Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence–structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM–HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods. Availability and implementation: Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx. Contact: xin.gao@kaust.edu.sa Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307635

  17. Divergence of function in sequence-related groups of Escherichia coli proteins.

    PubMed

    Nahum, L A; Riley, M

    2001-08-01

    The most prominent mechanism of molecular evolution is believed to have been duplication and divergence of genes. Proteins that belong to sequence-related groups in any one organism are candidates to have emerged from such a process and to share a common ancestor. Groups of proteins in Escherichia coli having sequence similarity are mostly composed of proteins with closely related function, but some groups comprise proteins with unrelated functions. In order to understand how function can change while sequences remain similar, we have examined some of these groups in detail. The enzymes analyzed in this work include representatives of amidotransferases, phosphotransferases, decarboxylases, and others. Most sequence-related groups contain enzymes that are in the same classes of Enzyme Commission (EC) numbers. We have concentrated on groups that are heterogeneous in that respect, and also on groups containing more than one enzyme of any pathway. We find that although the EC number may differ, the reaction chemistry of these sequence-related proteins is the same or very similar. Some of these families illustrate how diversification has taken place in evolution, using common features of either reaction chemistry or ligand specificity, or both, to create catalysts for different kinds of biochemical reactions. This information has relevance to the area of functional genomics in which the activities of gene products of unknown reading frames are attributed by analogy to the functions of sequence-related proteins of known function.

  18. Use of synthetic signal sequences to explore the protein export machinery.

    PubMed

    Clérico, Eugenia M; Maki, Jenny L; Gierasch, Lila M

    2008-01-01

    The information for correct localization of newly synthesized proteins in both prokaryotes and eukaryotes resides in self-contained, often transportable targeting sequences. Of these, signal sequences specify that a protein should be secreted from a cell or incorporated into the cytoplasmic membrane. A central puzzle is presented by the lack of primary structural homology among signal sequences, although they share common features in their sequences. Synthetic signal peptides have enabled a wide range of studies of how these "zipcodes" for protein secretion are decoded and used to target proteins to the protein machinery that facilitates their translocation across and integration into membranes. We review research on how the information in signal sequences enables their passenger proteins to be correctly and efficiently localized. Synthetic signal peptides have made possible binding and crosslinking studies to explore how selectivity is achieved in recognition by the signal sequence-binding receptors, signal recognition particle, or SRP, which functions in all organisms, and SecA, which functions in prokaryotes and some organelles of prokaryotic origins. While progress has been made, the absence of atomic resolution structures for complexes of signal peptides and their receptors has definitely left many questions to be answered in the future. PMID:17918185

  19. Use of Synthetic Signal Sequences to Explore the Protein Export Machinery

    PubMed Central

    Clérico, Eugenia M.; Maki, Jenny L.; Gierasch, Lila M.

    2010-01-01

    The information for correct localization of newly synthesized proteins in both prokaryotes and eukaryotes resides in self-contained, often transportable targeting sequences. Of these, signal sequences specify that a protein should be secreted from a cell or incorporated into the cytoplasmic membrane. A central puzzle is presented by the lack of primary structural homology among signal sequences, although they share common features in their sequences. Synthetic signal peptides have enabled a wide range of studies of how these “zipcodes” for protein secretion are decoded and used to target proteins to the protein machinery that facilitates their translocation across and integration into membranes. We review research on how the information in signal sequences enables their passenger proteins to be correctly and efficiently localized. Synthetic signal peptides have made possible binding and crosslinking studies to explore how selectivity is achieved in recognition by the signal sequence-binding receptors, signal recognition particle, or SRP, which functions in all organisms, and SecA, which functions in prokaryotes and some organelles of prokaryotic origins. While progress has been made, the absence of atomic resolution structures for complexes of signal peptides and their receptors has definitely left many questions to be answered in the future. PMID:17918185

  20. JiffyNet: a web-based instant protein network modeler for newly sequenced species.

    PubMed

    Kim, Eiru; Kim, Hanhae; Lee, Insuk

    2013-07-01

    Revolutionary DNA sequencing technology has enabled affordable genome sequencing for numerous species. Thousands of species already have completely decoded genomes, and tens of thousands more are in progress. Naturally, parallel expansion of the functional parts list library is anticipated, yet genome-level understanding of function also requires maps of functional relationships, such as functional protein networks. Such networks have been constructed for many sequenced species including common model organisms. Nevertheless, the majority of species with sequenced genomes still have no protein network models available. Moreover, biologists might want to obtain protein networks for their species of interest on completion of the genome projects. Therefore, there is high demand for accessible means to automatically construct genome-scale protein networks based on sequence information from genome projects only. Here, we present a public web server, JiffyNet, specifically designed to instantly construct genome-scale protein networks based on associalogs (functional associations transferred from a template network by orthology) for a query species with only protein sequences provided. Assessment of the networks by JiffyNet demonstrated generally high predictive ability for pathway annotations. Furthermore, JiffyNet provides network visualization and analysis pages for wide variety of molecular concepts to facilitate network-guided hypothesis generation. JiffyNet is freely accessible at http://www.jiffynet.org.

  1. Homology analyses of the protein sequences of fatty acid synthases from chicken liver, rat mammary gland, and yeast

    SciTech Connect

    Chang, Soo-Ik ); Hammes, G.G. )

    1989-11-01

    Homology analyses of the protein sequences of chicken liver and rat mammary gland fatty acid synthases were carried out. The amino acid sequences of the chicken and rat enzymes are 67% identical. If conservative substitutions are allowed, 78% of the amino acids are matched. A region of low homologies exists between the functional domains, in particular around amino acid residues 1059-1264 of the chicken enzyme. Homologies between the active sites of chicken and rat and of chicken and yeast enzymes have been analyzed by an alignment method. A high degree of homology exists between the active sites of the chicken and rat enzymes. However, the chicken and yeast enzymes show a lower degree of homology. The DADPH-binding dinucleotide folds of the {beta}-ketoacyl reductase and the enoyl reductase sites were identified by comparison with a known consensus sequence for the DADP- and FAD-binding dinucleotide folds. The active sites of all of the enzymes are primarily in hydrophobic regions of the protein. This study suggests that the genes for the functional domains of fatty acid synthase were originally separated, and these genes were connected to each other by using different connecting nucleotide sequences in different species. An alternative explanation for the differences in rat and chicken is a common ancestry and mutations in the joining regions during evolution.

  2. Sequence studies on post-ecdysial cuticular proteins from pupae of the yellow mealworm, Tenebrio molitor.

    PubMed

    Baernholdt, D; Anderson, S O

    1998-07-01

    Proteins were extracted from the cuticle mid-instar pupae of Tenebrio and purified by column chromatography. The protein pattern obtained by two-dimensional gel-electrophoresis was different from that obtained from pharate pupal cuticle, indicating that Tenebrio during the post-ecdysial pupal deposits cuticular proteins different from those deposited during the preecdysial period. The complete amino acid sequence was determined for four of the urea-extractable proteins from Tenebrio midinstar pupal cuticle. They range from 5.8 to 16.7 kDa in molecular weights and from 5.2 to 7.9 in isoelectric points. Little similarity was observed between the sequenced post-and pre-ecdysial cuticular proteins from Tenebrio pupae. Only one of the sequenced post-ecdysial proteins contains the Ala-Ala-Pro-Ala/Val motif common in proteins from Tenebrio larval/pupal pharate cuticle and from locust pharate cuticle. None of the post-ecdysial proteins contains the conserved hydrophilic sequence regions described for Tenebrio pharate cuticular proteins.

  3. Draft versus finished sequence data for DNA and protein diagnostic signature development

    SciTech Connect

    Gardner, S N; Lam, M W; Smith, J R; Torres, C L; Slezak, T R

    2004-10-29

    Sequencing pathogen genomes is costly, demanding careful allocation of limited sequencing resources. We built a computational Sequencing Analysis Pipeline (SAP) to guide decisions regarding the amount of genomic sequencing necessary to develop high-quality diagnostic DNA and protein signatures. SAP uses simulations to estimate the number of target genomes and close phylogenetic relatives (near neighbors, or NNs) to sequence. We use SAP to assess whether draft data is sufficient or finished sequencing is required using Marburg and variola virus sequences. Simulations indicate that intermediate to high quality draft with error rates of 10{sup -3}-10{sup -5} ({approx} 8x coverage) of target organisms is suitable for DNA signature prediction. Low quality draft with error rates of {approx} 1% (3x to 6x coverage) of target isolates is inadequate for DNA signature prediction, although low quality draft of NNs is sufficient, as long as the target genomes are of high quality. For protein signature prediction, sequencing errors in target genomes substantially reduce the detection of amino acid sequence conservation, even if the draft is of high quality. In summary, high quality draft of target and low quality draft of NNs appears to be a cost-effective investment for DNA signature prediction, but may lead to underestimation of predicted protein signatures.

  4. Cloning, sequencing, and expression of the Zymomonas mobilis fructokinase gene and structural comparison of the enzyme with other hexose kinases.

    PubMed Central

    Zembrzuski, B; Chilco, P; Liu, X L; Liu, J; Conway, T; Scopes, R

    1992-01-01

    The frk gene encoding the enzyme fructokinase (fructose 6-phosphotransferase [EC 2.7.1.4]) from Zymomonas mobilis has been isolated on a partial TaqI digest fragment of the genome and sequenced. An open reading frame of 906 bp corresponding to 302 amino acids was identified on a 3-kbp TaqI fragment. The deduced amino acid sequence corresponds to the first 20 amino acids (including an N-terminal methionine) determined by amino acid sequencing of the purified protein. The 118 bp preceding the methionine codon on this fragment does not appear to contain a promoter sequence. There was weak expression of the active enzyme in the recombinant Escherichia coli clone under control of the lac promoter on the pUC plasmid. Comparison of the amino acid sequence with that of the glucokinase enzyme (EC 2.7.1.2) from Z. mobilis reveals relatively little homology, despite the fact that fructokinase also binds glucose and has kinetic and structural properties similar to those of glucokinase. Also, there is little homology with hexose kinases that have been sequenced from other organisms. Northern (RNA) blot analysis showed that the frk transcript is 1.2 kb long. Fructokinase activity is elevated up to twofold when Z. mobilis was grown on fructose instead of glucose, and there was a parallel increase in frk mRNA levels. Differential mRNA stability was not a factor, since the half-lives of the frk transcript were 6.2 min for glucose-grown cells and 6.6 min for fructose-grown cells. Images PMID:1317376

  5. Comparison of Dixon Sequences for Estimation of Percent Breast Fibroglandular Tissue

    PubMed Central

    Ledger, Araminta E. W.; Scurr, Erica D.; Hughes, Julie; Macdonald, Alison; Wallace, Toni; Thomas, Karen; Wilson, Robin; Leach, Martin O.; Schmidt, Maria A.

    2016-01-01

    Objectives To evaluate sources of error in the Magnetic Resonance Imaging (MRI) measurement of percent fibroglandular tissue (%FGT) using two-point Dixon sequences for fat-water separation. Methods Ten female volunteers (median age: 31 yrs, range: 23–50 yrs) gave informed consent following Research Ethics Committee approval. Each volunteer was scanned twice following repositioning to enable an estimation of measurement repeatability from high-resolution gradient-echo (GRE) proton-density (PD)-weighted Dixon sequences. Differences in measures of %FGT attributable to resolution, T1 weighting and sequence type were assessed by comparison of this Dixon sequence with low-resolution GRE PD-weighted Dixon data, and against gradient-echo (GRE) or spin-echo (SE) based T1-weighted Dixon datasets, respectively. Results %FGT measurement from high-resolution PD-weighted Dixon sequences had a coefficient of repeatability of ±4.3%. There was no significant difference in %FGT between high-resolution and low-resolution PD-weighted data. Values of %FGT from GRE and SE T1-weighted data were strongly correlated with that derived from PD-weighted data (r = 0.995 and 0.96, respectively). However, both sequences exhibited higher mean %FGT by 2.9% (p < 0.0001) and 12.6% (p < 0.0001), respectively, in comparison with PD-weighted data; the increase in %FGT from the SE T1-weighted sequence was significantly larger at lower breast densities. Conclusion Although measurement of %FGT at low resolution is feasible, T1 weighting and sequence type impact on the accuracy of Dixon-based %FGT measurements; Dixon MRI protocols for %FGT measurement should be carefully considered, particularly for longitudinal or multi-centre studies. PMID:27011312

  6. A Primary Sequence Analysis of the ARGONAUTE Protein Family in Plants

    PubMed Central

    Rodríguez-Leal, Daniel; Castillo-Cobián, Amanda; Rodríguez-Arévalo, Isaac; Vielle-Calzada, Jean-Philippe

    2016-01-01

    Small RNA (sRNA)-mediated gene silencing represents a conserved regulatory mechanism controlling a wide diversity of developmental processes through interactions of sRNAs with proteins of the ARGONAUTE (AGO) family. On the basis of a large phylogenetic analysis that includes 206 AGO genes belonging to 23 plant species, AGO genes group into four clades corresponding to the phylogenetic distribution proposed for the ten family members of Arabidopsis thaliana. A primary analysis of the corresponding protein sequences resulted in 50 sequences of amino acids (blocks) conserved across their linear length. Protein members of the AGO4/6/8/9 and AGO1/10 clades are more conserved than members of the AGO5 and AGO2/3/7 clades. In addition to blocks containing components of the PIWI, PAZ, and DUF1785 domains, members of the AGO2/3/7 and AGO4/6/8/9 clades possess other consensus block sequences that are exclusive of members within these clades, suggesting unforeseen functional specialization revealed by their primary sequence. We also show that AGO proteins of animal and plant kingdoms share linear sequences of blocks that include motifs involved in posttranslational modifications such as those regulating AGO2 in humans and the PIWI protein AUBERGINE in Drosophila. Our results open possibilities for exploring new structural and functional aspects related to the evolution of AGO proteins within the plant kingdom, and their convergence with analogous proteins in mammals and invertebrates.

  7. A Primary Sequence Analysis of the ARGONAUTE Protein Family in Plants.

    PubMed

    Rodríguez-Leal, Daniel; Castillo-Cobián, Amanda; Rodríguez-Arévalo, Isaac; Vielle-Calzada, Jean-Philippe

    2016-01-01

    Small RNA (sRNA)-mediated gene silencing represents a conserved regulatory mechanism controlling a wide diversity of developmental processes through interactions of sRNAs with proteins of the ARGONAUTE (AGO) family. On the basis of a large phylogenetic analysis that includes 206 AGO genes belonging to 23 plant species, AGO genes group into four clades corresponding to the phylogenetic distribution proposed for the ten family members of Arabidopsis thaliana. A primary analysis of the corresponding protein sequences resulted in 50 sequences of amino acids (blocks) conserved across their linear length. Protein members of the AGO4/6/8/9 and AGO1/10 clades are more conserved than members of the AGO5 and AGO2/3/7 clades. In addition to blocks containing components of the PIWI, PAZ, and DUF1785 domains, members of the AGO2/3/7 and AGO4/6/8/9 clades possess other consensus block sequences that are exclusive of members within these clades, suggesting unforeseen functional specialization revealed by their primary sequence. We also show that AGO proteins of animal and plant kingdoms share linear sequences of blocks that include motifs involved in posttranslational modifications such as those regulating AGO2 in humans and the PIWI protein AUBERGINE in Drosophila. Our results open possibilities for exploring new structural and functional aspects related to the evolution of AGO proteins within the plant kingdom, and their convergence with analogous proteins in mammals and invertebrates. PMID:27635128

  8. A Primary Sequence Analysis of the ARGONAUTE Protein Family in Plants.

    PubMed

    Rodríguez-Leal, Daniel; Castillo-Cobián, Amanda; Rodríguez-Arévalo, Isaac; Vielle-Calzada, Jean-Philippe

    2016-01-01

    Small RNA (sRNA)-mediated gene silencing represents a conserved regulatory mechanism controlling a wide diversity of developmental processes through interactions of sRNAs with proteins of the ARGONAUTE (AGO) family. On the basis of a large phylogenetic analysis that includes 206 AGO genes belonging to 23 plant species, AGO genes group into four clades corresponding to the phylogenetic distribution proposed for the ten family members of Arabidopsis thaliana. A primary analysis of the corresponding protein sequences resulted in 50 sequences of amino acids (blocks) conserved across their linear length. Protein members of the AGO4/6/8/9 and AGO1/10 clades are more conserved than members of the AGO5 and AGO2/3/7 clades. In addition to blocks containing components of the PIWI, PAZ, and DUF1785 domains, members of the AGO2/3/7 and AGO4/6/8/9 clades possess other consensus block sequences that are exclusive of members within these clades, suggesting unforeseen functional specialization revealed by their primary sequence. We also show that AGO proteins of animal and plant kingdoms share linear sequences of blocks that include motifs involved in posttranslational modifications such as those regulating AGO2 in humans and the PIWI protein AUBERGINE in Drosophila. Our results open possibilities for exploring new structural and functional aspects related to the evolution of AGO proteins within the plant kingdom, and their convergence with analogous proteins in mammals and invertebrates.

  9. A Primary Sequence Analysis of the ARGONAUTE Protein Family in Plants

    PubMed Central

    Rodríguez-Leal, Daniel; Castillo-Cobián, Amanda; Rodríguez-Arévalo, Isaac; Vielle-Calzada, Jean-Philippe

    2016-01-01

    Small RNA (sRNA)-mediated gene silencing represents a conserved regulatory mechanism controlling a wide diversity of developmental processes through interactions of sRNAs with proteins of the ARGONAUTE (AGO) family. On the basis of a large phylogenetic analysis that includes 206 AGO genes belonging to 23 plant species, AGO genes group into four clades corresponding to the phylogenetic distribution proposed for the ten family members of Arabidopsis thaliana. A primary analysis of the corresponding protein sequences resulted in 50 sequences of amino acids (blocks) conserved across their linear length. Protein members of the AGO4/6/8/9 and AGO1/10 clades are more conserved than members of the AGO5 and AGO2/3/7 clades. In addition to blocks containing components of the PIWI, PAZ, and DUF1785 domains, members of the AGO2/3/7 and AGO4/6/8/9 clades possess other consensus block sequences that are exclusive of members within these clades, suggesting unforeseen functional specialization revealed by their primary sequence. We also show that AGO proteins of animal and plant kingdoms share linear sequences of blocks that include motifs involved in posttranslational modifications such as those regulating AGO2 in humans and the PIWI protein AUBERGINE in Drosophila. Our results open possibilities for exploring new structural and functional aspects related to the evolution of AGO proteins within the plant kingdom, and their convergence with analogous proteins in mammals and invertebrates. PMID:27635128

  10. rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

    PubMed Central

    Hahn, Lars; Leimeister, Chris-André; Morgenstern, Burkhard

    2016-01-01

    Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/ PMID:27760124

  11. Comparison between optimized GRE and RARE sequences for 19F MRI studies

    NASA Astrophysics Data System (ADS)

    Soffientini, Chiara D.; Mastropietro, Alfonso; Caffini, Matteo; Cocco, Sara; Zucca, Ileana; Scotti, Alessandro; Baselli, Giuseppe; Bruzzone, Maria Grazia

    2014-03-01

    In 19F-MRI studies limiting factors are the presence of a low signal due to the low concentration of 19F-nuclei, necessary for biological applications, and the inherent low sensitivity of MRI. Hence, acquiring images using the pulse sequence with the best signal to noise ratio (SNR) by optimizing the acquisition parameters specifically to a 19F compound is a core issue. In 19F-MRI, multiple-spin-echo (RARE) and gradient-echo (GRE) are the two most frequently used pulse sequence families; therefore we performed an optimization study of GRE pulse sequences based on numerical simulations and experimental acquisitions on fluorinated compounds. We compared GRE performance to an optimized RARE sequence. Images were acquired on a 7T MRI preclinical scanner on phantoms containing different fluorinated compounds. Actual relaxation times (T1, T2, T2*) were evaluated in order to predict SNR dependence on sequence parameters. Experimental comparisons between spoiled GRE and RARE, obtained at a fixed acquisition time and in steady state condition, showed RARE sequence outperforming the spoiled GRE (up to 406% higher). Conversely, the use of the unbalanced-SSFP showed a significant increase in SNR compared to RARE (up to 28% higher). Moreover, this sequence (as GRE in general) was confirmed to be virtually insensitive to T1 and T2 relaxation times, after proper optimization, thus improving marker independence from the biological environment. These results confirm the efficacy of the proposed optimization tool and foster further investigation addressing in-vivo applicability.

  12. mtDNAprofiler: a Web application for the nomenclature and comparison of human mitochondrial DNA sequences.

    PubMed

    Yang, In Seok; Lee, Hwan Young; Yang, Woo Ick; Shin, Kyoung-Jin

    2013-07-01

    Mitochondrial DNA (mtDNA) is a valuable tool in the fields of forensic, population, and medical genetics. However, recording and comparing mtDNA control region or entire genome sequences would be difficult if researchers are not familiar with mtDNA nomenclature conventions. Therefore, mtDNAprofiler, a Web application, was designed for the analysis and comparison of mtDNA sequences in a string format or as a list of mtDNA single-nucleotide polymorphisms (mtSNPs). mtDNAprofiler which comprises four mtDNA sequence-analysis tools (mtDNA nomenclature, mtDNA assembly, mtSNP conversion, and mtSNP concordance-check) supports not only the accurate analysis of mtDNA sequences via an automated nomenclature function, but also consistent management of mtSNP data via direct comparison and validity-check functions. Since mtDNAprofiler consists of four tools that are associated with key steps of mtDNA sequence analysis, mtDNAprofiler will be helpful for researchers working with mtDNA. mtDNAprofiler is freely available at http://mtprofiler.yonsei.ac.kr. PMID:23682804

  13. Effect of k-tuple length on sample-comparison with high-throughput sequencing data.

    PubMed

    Wang, Ying; Lei, Xiaoye; Wang, Shun; Wang, Zicheng; Song, Nianfeng; Zeng, Feng; Chen, Ting

    2016-01-22

    The high-throughput metagenomic sequencing offers a powerful technique to compare the microbial communities. Without requiring extra reference sequences, alignment-free models with short k-tuple (k = 2-10 bp) yielded promising results. Short k-tuples describe the overall statistical distribution, but is hard to capture the specific characteristics inside one microbial community. Longer k-tuple contains more abundant information. However, because the frequency vector of long k-tuple(k ≥ 30 bp) is sparse, the statistical measures designed for short k-tuples are not applicable. In our study, we considered each tuple as a meaningful word and then each sequencing data as a document composed of the words. Therefore, the comparison between two sequencing data is processed as "topic analysis of documents" in text mining. We designed a pipeline with long k-tuple features to compare metagenomic samples combined using algorithms from text mining and pattern recognition. The pipeline is available at http://culotuple.codeplex.com/. Experiments show that our pipeline with long k-tuple features: ①separates genomes with high similarity; ②outperforms short k-tuple models in all experiments. When k ≥ 12, the short k-tuple measures are not applicable anymore. When k is between 20 and 40, long k-tuple pipeline obtains much better grouping results; ③is free from the effect of sequencing platforms/protocols. ③We obtained meaningful and supported biological results on the 40-tuples selected for comparison.

  14. Adhesive proteins of stalked and acorn barnacles display homology with low sequence similarities.

    PubMed

    Jonker, Jaimie-Leigh; Abram, Florence; Pires, Elisabete; Varela Coelho, Ana; Grunwald, Ingo; Power, Anne Marie

    2014-01-01

    Barnacle adhesion underwater is an important phenomenon to understand for the prevention of biofouling and potential biotechnological innovations, yet so far, identifying what makes barnacle glue proteins 'sticky' has proved elusive. Examination of a broad range of species within the barnacles may be instructive to identify conserved adhesive domains. We add to extensive information from the acorn barnacles (order Sessilia) by providing the first protein analysis of a stalked barnacle adhesive, Lepas anatifera (order Lepadiformes). It was possible to separate the L. anatifera adhesive into at least 10 protein bands using SDS-PAGE. Intense bands were present at approximately 30, 70, 90 and 110 kilodaltons (kDa). Mass spectrometry for protein identification was followed by de novo sequencing which detected 52 peptides of 7-16 amino acids in length. None of the peptides matched published or unpublished transcriptome sequences, but some amino acid sequence similarity was apparent between L. anatifera and closely-related Dosima fascicularis. Antibodies against two acorn barnacle proteins (ab-cp-52k and ab-cp-68k) showed cross-reactivity in the adhesive glands of L. anatifera. We also analysed the similarity of adhesive proteins across several barnacle taxa, including Pollicipes pollicipes (a stalked barnacle in the order Scalpelliformes). Sequence alignment of published expressed sequence tags clearly indicated that P. pollicipes possesses homologues for the 19 kDa and 100 kDa proteins in acorn barnacles. Homology aside, sequence similarity in amino acid and gene sequences tended to decline as taxonomic distance increased, with minimum similarities of 18-26%, depending on the gene. The results indicate that some adhesive proteins (e.g. 100 kDa) are more conserved within barnacles than others (20 kDa).

  15. Adhesive Proteins of Stalked and Acorn Barnacles Display Homology with Low Sequence Similarities

    PubMed Central

    Jonker, Jaimie-Leigh; Abram, Florence; Pires, Elisabete; Varela Coelho, Ana; Grunwald, Ingo; Power, Anne Marie

    2014-01-01

    Barnacle adhesion underwater is an important phenomenon to understand for the prevention of biofouling and potential biotechnological innovations, yet so far, identifying what makes barnacle glue proteins ‘sticky’ has proved elusive. Examination of a broad range of species within the barnacles may be instructive to identify conserved adhesive domains. We add to extensive information from the acorn barnacles (order Sessilia) by providing the first protein analysis of a stalked barnacle adhesive, Lepas anatifera (order Lepadiformes). It was possible to separate the L. anatifera adhesive into at least 10 protein bands using SDS-PAGE. Intense bands were present at approximately 30, 70, 90 and 110 kilodaltons (kDa). Mass spectrometry for protein identification was followed by de novo sequencing which detected 52 peptides of 7–16 amino acids in length. None of the peptides matched published or unpublished transcriptome sequences, but some amino acid sequence similarity was apparent between L. anatifera and closely-related Dosima fascicularis. Antibodies against two acorn barnacle proteins (ab-cp-52k and ab-cp-68k) showed cross-reactivity in the adhesive glands of L. anatifera. We also analysed the similarity of adhesive proteins across several barnacle taxa, including Pollicipes pollicipes (a stalked barnacle in the order Scalpelliformes). Sequence alignment of published expressed sequence tags clearly indicated that P. pollicipes possesses homologues for the 19 kDa and 100 kDa proteins in acorn barnacles. Homology aside, sequence similarity in amino acid and gene sequences tended to decline as taxonomic distance increased, with minimum similarities of 18–26%, depending on the gene. The results indicate that some adhesive proteins (e.g. 100 kDa) are more conserved within barnacles than others (20 kDa). PMID:25295513

  16. How much of protein sequence space has been explored by life on Earth?

    PubMed

    Dryden, David T F; Thomson, Andrew R; White, John H

    2008-08-01

    We suggest that the vastness of protein sequence space is actually completely explorable during the populating of the Earth by life by considering upper and lower limits for the number of organisms, genome size, mutation rate and the number of functionally distinct classes of amino acids. We conclude that rather than life having explored only an infinitesimally small part of sequence space in the last 4 Gyr, it is instead quite plausible for all of functional protein sequence space to have been explored and that furthermore, at the molecular level, there is no role for contingency.

  17. Enzyme sequence similarity improves the reaction alignment method for cross-species pathway comparison

    SciTech Connect

    Ovacik, Meric A.; Androulakis, Ioannis P.

    2013-09-15

    Pathway-based information has become an important source of information for both establishing evolutionary relationships and understanding the mode of action of a chemical or pharmaceutical among species. Cross-species comparison of pathways can address two broad questions: comparison in order to inform evolutionary relationships and to extrapolate species differences used in a number of different applications including drug and toxicity testing. Cross-species comparison of metabolic pathways is complex as there are multiple features of a pathway that can be modeled and compared. Among the various methods that have been proposed, reaction alignment has emerged as the most successful at predicting phylogenetic relationships based on NCBI taxonomy. We propose an improvement of the reaction alignment method by accounting for sequence similarity in addition to reaction alignment method. Using nine species, including human and some model organisms and test species, we evaluate the standard and improved comparison methods by analyzing glycolysis and citrate cycle pathways conservation. In addition, we demonstrate how organism comparison can be conducted by accounting for the cumulative information retrieved from nine pathways in central metabolism as well as a more complete study involving 36 pathways common in all nine species. Our results indicate that reaction alignment with enzyme sequence similarity results in a more accurate representation of pathway specific cross-species similarities and differences based on NCBI taxonomy.

  18. Protein-Protein Interactions Inferred from Domain-Domain Interactions in Genogroup II Genotype 4 Norovirus Sequences

    PubMed Central

    Huang, Chuan-Ching

    2013-01-01

    Severe gastroenteritis and foodborne illness caused by Noroviruses (NoVs) during the winter are a worldwide phenomenon. Vulnerable populations including young children and elderly and immunocompromised people often require hospitalization and may die. However, no efficient vaccine for NoVs exists because of their variable genome sequences. This study investigates the infection processes in protein-protein interactions between hosts and NoVs. Protein-protein interactions were collected from related Pfam NoV domains. The related Pfam domains were accumulated incrementally from the protein domain interaction database. To examine the influence of domain intimacy, the 7 NoV domains were grouped by depth. The number of domain-domain interactions increased exponentially as the depth increased. Many protein-protein interactions were relevant; therefore, cloud techniques were used to analyze data because of their computational capacity. The infection relationship between hosts and NoVs should be used in clinical applications and drug design. PMID:23738320

  19. Vertebrate DM domain proteins bind similar DNA sequences and can heterodimerize on DNA

    PubMed Central

    Murphy, Mark W; Zarkower, David; Bardwell, Vivian J

    2007-01-01

    Background: The DM domain is a zinc finger-like DNA binding motif first identified in the sexual regulatory proteins Doublesex (DSX) and MAB-3, and is widely conserved among metazoans. DM domain proteins regulate sexual differentiation in at least three phyla and also control other aspects of development, including vertebrate segmentation. Most DM domain proteins share little similarity outside the DM domain. DSX and MAB-3 bind partially overlapping DNA sequences, and DSX has been shown to interact with DNA via the minor groove without inducing DNA bending. DSX and MAB-3 exhibit unusually high DNA sequence specificity relative to other minor groove binding proteins. No detailed analysis of DNA binding by the seven vertebrate DM domain proteins, DMRT1-DMRT7 has been reported, and thus it is unknown whether they recognize similar or diverse DNA sequences. Results: We used a random oligonucleotide in vitro selection method to determine DNA binding sites for six of the seven proteins. These proteins selected sites resembling that of DSX despite differences in the sequence of the DM domain recognition helix, but they varied in binding efficiency and in preferences for particular nucleotides, and some behaved anomalously in gel mobility shift assays. DMRT1 protein from mouse testis extracts binds the sequence we determined, and the DMRT proteins can bind their in vitro-defined sites in transfected cells. We also find that some DMRT proteins can bind DNA as heterodimers. Conclusion: Our results suggest that target gene specificity of the DMRT proteins does not derive exclusively from major differences in DNA binding specificity. Instead target specificity may come from more subtle differences in DNA binding preference between different homodimers, together with differences in binding specificity between homodimers versus heterodimers. PMID:17605809

  20. Preparative Protein Production from Inclusion Bodies and Crystallization: A Seven-Week Biochemistry Sequence

    ERIC Educational Resources Information Center

    Peterson, Megan J.; Snyder, W. Kalani; Westerman, Shelley; McFarland, Benjamin J.

    2011-01-01

    We describe how to produce and purify proteins from "Escherichia coli" inclusion bodies by adapting versatile, preparative-scale techniques to the undergraduate laboratory schedule. This 7-week sequence of experiments fits into an annual cycle of research activity in biochemistry courses. Recombinant proteins are expressed as inclusion bodies,…

  1. Relating sequence encoded information to form and function of intrinsically disordered proteins

    PubMed Central

    Das, Rahul K.; Ruff, Kiersten M.; Pappu, Rohit V.

    2015-01-01

    Intrinsically disordered proteins (IDPs) showcase the importance of conformational plasticity and heterogeneity in protein function. We summarize recent advances that connect information encoded in IDP sequences to their conformational properties and functions. We focus on insights obtained through a combination of atomistic simulations and biophysical measurements that are synthesized into a coherent framework using polymer physics theories. PMID:25863585

  2. Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software.

    PubMed

    Nakano, Shogo; Asano, Yasuhisa

    2015-02-03

    Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs.

  3. Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software

    PubMed Central

    Nakano, Shogo; Asano, Yasuhisa

    2015-01-01

    Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs. PMID:25645341

  4. Effect of single-point sequence alterations on the aggregationpropensity of a model protein

    SciTech Connect

    Bratko, Dusan; Cellmer, Troy; Prausnitz, John M.; Blanch, Harvey W.

    2005-10-07

    Sequences of contemporary proteins are believed to have evolved through process that optimized their overall fitness including their resistance to deleterious aggregation. Biotechnological processing may expose therapeutic proteins to conditions that are much more conducive to aggregation than those encountered in a cellular environment. An important task of protein engineering is to identify alternative sequences that would protect proteins when processed at high concentrations without altering their native structure associated with specific biological function. Our computational studies exploit parallel tempering simulations of coarse-grained model proteins to demonstrate that isolated amino-acid residue substitutions can result in significant changes in the aggregation resistance of the protein in a crowded environment while retaining protein structure in isolation. A thermodynamic analysis of protein clusters subject to competing processes of folding and association shows that moderate mutations can produce effects similar to those caused by changes in system conditions, including temperature, concentration, and solvent composition that affect the aggregation propensity. The range of conditions where a protein can resist aggregation can therefore be tuned by sequence alterations although the protein generally may retain its generic ability for aggregation.

  5. RPI-Pred: predicting ncRNA-protein interaction using sequence and structural information

    PubMed Central

    Suresh, V.; Liu, Liang; Adjeroh, Donald; Zhou, Xiaobo

    2015-01-01

    RNA-protein complexes are essential in mediating important fundamental cellular processes, such as transport and localization. In particular, ncRNA-protein interactions play an important role in post-transcriptional gene regulation like mRNA localization, mRNA stabilization, poly-adenylation, splicing and translation. The experimental methods to solve RNA-protein interaction prediction problem remain expensive and time-consuming. Here, we present the RPI-Pred (RNA-protein interaction predictor), a new support-vector machine-based method, to predict protein-RNA interaction pairs, based on both the sequences and structures. The results show that RPI-Pred can correctly predict RNA-protein interaction pairs with ∼94% prediction accuracy when using sequence and experimentally determined protein and RNA structures, and with ∼83% when using sequences and predicted protein and RNA structures. Further, our proposed method RPI-Pred was superior to other existing ones by predicting more experimentally validated ncRNA-protein interaction pairs from different organisms. Motivated by the improved performance of RPI-Pred, we further applied our method for reliable construction of ncRNA-protein interaction networks. The RPI-Pred is publicly available at: http://ctsb.is.wfubmc.edu/projects/rpi-pred. PMID:25609700

  6. AMS 4.0: consensus prediction of post-translational modifications in protein sequences.

    PubMed

    Plewczynski, Dariusz; Basu, Subhadip; Saha, Indrajit

    2012-08-01

    We present here the 2011 update of the AutoMotif Service (AMS 4.0) that predicts the wide selection of 88 different types of the single amino acid post-translational modifications (PTM) in protein sequences. The selection of experimentally confirmed modifications is acquired from the latest UniProt and Phospho.ELM databases for training. The sequence vicinity of each modified residue is represented using amino acids physico-chemical features encoded using high quality indices (HQI) obtaining by automatic clustering of known indices extracted from AAindex database. For each type of the numerical representation, the method builds the ensemble of Multi-Layer Perceptron (MLP) pattern classifiers, each optimising different objectives during the training (for example the recall, precision or area under the ROC curve (AUC)). The consensus is built using brainstorming technology, which combines multi-objective instances of machine learning algorithm, and the data fusion of different training objects representations, in order to boost the overall prediction accuracy of conserved short sequence motifs. The performance of AMS 4.0 is compared with the accuracy of previous versions, which were constructed using single machine learning methods (artificial neural networks, support vector machine). Our software improves the average AUC score of the earlier version by close to 7 % as calculated on the test datasets of all 88 PTM types. Moreover, for the selected most-difficult sequence motifs types it is able to improve the prediction performance by almost 32 %, when compared with previously used single machine learning methods. Summarising, the brainstorming consensus meta-learning methodology on the average boosts the AUC score up to around 89 %, averaged over all 88 PTM types. Detailed results for single machine learning methods and the consensus methodology are also provided, together with the comparison to previously published methods and state-of-the-art software tools. The

  7. Definition and Analysis of a System for the Automated Comparison of Curriculum Sequencing Algorithms in Adaptive Distance Learning

    ERIC Educational Resources Information Center

    Limongelli, Carla; Sciarrone, Filippo; Temperini, Marco; Vaste, Giulia

    2011-01-01

    LS-Lab provides automatic support to comparison/evaluation of the Learning Object Sequences produced by different Curriculum Sequencing Algorithms. Through this framework a teacher can verify the correspondence between the behaviour of different sequencing algorithms and her pedagogical preferences. In fact the teacher can compare algorithms…

  8. Cloning and sequence of the gene for heat shock protein 60 from Chlamydia trachomatis and immunological reactivity of the protein.

    PubMed Central

    Cerrone, M C; Ma, J J; Stephens, R S

    1991-01-01

    We isolated and sequenced the gene for the chlamydial heat shock protein 60 (HSP-60) from a Chlamydia trachomatis genomic library by molecular genetic methods. The DNA sequence derived revealed an operon-like gene structure with two open reading frames encoding an 11,122- and a 57,956-Da protein. The translated amino acid sequence of the larger open reading frame showed a high degree of homology with known sequences for HSP-60 from several bacterial species as well as with plant and human sequences. By using the determined nucleotide sequence, fragments of the gene were cloned into the plasmid vector pGEX for expression as fusion proteins consisting of glutathione S-transferase and peptide portions of the chlamydial HSP-60. HSP-60 antigenic identity was confirmed by an immunoblot with anti-HSP-60 rabbit serum. Sera from patients that exhibited both high antichlamydial titers and reactivity to chlamydial HSP-60 showed reactivity on immunoblots to two fusion proteins that represented portions of the carboxyl-terminal half of the molecule, whereas fusion proteins defining the amino-terminal half were nonreactive. No reactivity with the fusion proteins was seen with sera from patients that had been previously screened as nonreactive to native chlamydial HSP-60 but which had high antichlamydial titers. Sera from noninfected control subjects also exhibited no reactivity. Definition of recognized HSP-60 epitopes may provide a predictive screen for those patients with C. trachomatis infections who may develop damaging sequelae, as well as providing tools for the study of immunopathogenic mechanisms of Chlamydia-induced disease. Images PMID:1987066

  9. M2SG: mapping human disease-related genetic variants to protein sequences and genomic loci

    PubMed Central

    Ji, Renkai; Cong, Qian; Li, Wenlin; Grishin, Nick V.

    2013-01-01

    Summary: Online Mendelian Inheritance in Man (OMIM) is a manually curated compendium of human genetic variants and the corresponding phenotypes, mostly human diseases. Instead of directly documenting the native sequences for gene entries, OMIM links its entries to protein and DNA sequences in other databases. However, because of the existence of gene isoforms and errors in OMIM records, mapping a specific OMIM mutation to its corresponding protein sequence is not trivial. Combining computer programs and extensive manual curation of OMIM full-text descriptions and original literature, we mapped 98% of OMIM amino acid substitutions (AASs) and all SwissProt Variant (SwissVar) disease-related AASs to reference sequences and confidently mapped 99.96% of all AASs to the genomic loci. Based on the results, we developed an online database and interactive web server (M2SG) to (i) retrieve the mapped OMIM and SwissVar variants for a given protein sequence; and (ii) obtain related proteins and mutations for an input disease phenotype. This database will be useful for analyzing sequences, understanding the effect of mutations, identifying important genetic variations and designing experiments on a protein of interest. Availability and implementation: The database and web server are freely available at http://prodata.swmed.edu/M2S/mut2seq.cgi. Contact: grishin@chop.swmed.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24002112

  10. Sequence and structural implications of a bovine corneal keratan sulfate proteoglycan core protein. Protein 37B represents bovine lumican and proteins 37A and 25 are unique

    NASA Technical Reports Server (NTRS)

    Funderburgh, J. L.; Funderburgh, M. L.; Brown, S. J.; Vergnes, J. P.; Hassell, J. R.; Mann, M. M.; Conrad, G. W.; Spooner, B. S. (Principal Investigator)

    1993-01-01

    Amino acid sequence from tryptic peptides of three different bovine corneal keratan sulfate proteoglycan (KSPG) core proteins (designated 37A, 37B, and 25) showed similarities to the sequence of a chicken KSPG core protein lumican. Bovine lumican cDNA was isolated from a bovine corneal expression library by screening with chicken lumican cDNA. The bovine cDNA codes for a 342-amino acid protein, M(r) 38,712, containing amino acid sequences identified in the 37B KSPG core protein. The bovine lumican is 68% identical to chicken lumican, with an 83% identity excluding the N-terminal 40 amino acids. Location of 6 cysteine and 4 consensus N-glycosylation sites in the bovine sequence were identical to those in chicken lumican. Bovine lumican had about 50% identity to bovine fibromodulin and 20% identity to bovine decorin and biglycan. About two-thirds of the lumican protein consists of a series of 10 amino acid leucine-rich repeats that occur in regions of calculated high beta-hydrophobic moment, suggesting that the leucine-rich repeats contribute to beta-sheet formation in these proteins. Sequences obtained from 37A and 25 core proteins were absent in bovine lumican, thus predicting a unique primary structure and separate mRNA for each of the three bovine KSPG core proteins.

  11. Ancient conserved regions in new gene sequences and the protein databases

    SciTech Connect

    Green, P.; Hillier, L.; Waterston, R. ); Lipman, D.; States, D.; Claverie, J.M. )

    1993-03-19

    Sets of new gene sequences from human, nematode, and yeast were compared with each other and with a set of Escherichia coli genes in order to detect ancient evolutionarily conserved regions (ACRs) in the encoded proteins. Nearly all of the ACRs so identified were found to be homologous to sequences in the protein databases. This suggests that currently known proteins may already include representatives of most ACRs and that new sequences not similar to any database sequence are unlikely to contain ACRs. Preliminary analyses indicate that moderately expressed genes may be more likely to contain ACRs than rarely expressed genes. It is estimated that there are fewer than 900 ACRs in all. 20 refs., 2 figs., 4 tabs.

  12. Hydrogen Exchange Mass Spectrometry of Related Proteins with Divergent Sequences: A Comparative Study of HIV-1 Nef Allelic Variants

    NASA Astrophysics Data System (ADS)

    Wales, Thomas E.; Poe, Jerrod A.; Emert-Sedlak, Lori; Morgan, Christopher R.; Smithgall, Thomas E.; Engen, John R.

    2016-06-01

    Hydrogen exchange mass spectrometry can be used to compare the conformation and dynamics of proteins that are similar in tertiary structure. If relative deuterium levels are measured, differences in sequence, deuterium forward- and back-exchange, peptide retention time, and protease digestion patterns all complicate the data analysis. We illustrate what can be learned from such data sets by analyzing five variants (Consensus G2E, SF2, NL4-3, ELI, and LTNP4) of the HIV-1 Nef protein, both alone and when bound to the human Hck SH3 domain. Regions with similar sequence could be compared between variants. Although much of the hydrogen exchange features were preserved across the five proteins, the kinetics of Nef binding to Hck SH3 were not the same. These observations may be related to biological function, particularly for ELI Nef where we also observed an impaired ability to downregulate CD4 surface presentation. The data illustrate some of the caveats that must be considered for comparison experiments and provide a framework for investigations of other protein relatives, families, and superfamilies with HX MS.

  13. Thermodynamic features characterizing good and bad folding sequences obtained using a simplified off-lattice protein model

    NASA Astrophysics Data System (ADS)

    Amatori, A.; Ferkinghoff-Borg, J.; Tiana, G.; Broglia, R. A.

    2006-06-01

    The thermodynamics of the small SH3 protein domain is studied by means of a simplified model where each beadlike amino acid interacts with the others through a contact potential controlled by a 20×20 random matrix. Good folding sequences, characterized by a low native energy, display three main thermodynamical ensembles, namely, a coil-like ensemble, an unfolded globule, and a folded ensemble (plus two other states, frozen and random coils, populated only at extreme temperatures). Interestingly, the unfolded globule has some regions already structured. Poorly designed sequences, on the other hand, display a wide transition from the random coil to a frozen state. The comparison with the analytic theory of heteropolymers is discussed.

  14. Rapid comparison of protein binding site surfaces with Property Encoded Shape Distributions (PESD)

    PubMed Central

    Das, Sourav; Kokardekar, Arshad

    2009-01-01

    Patterns in shape and property distributions on the surface of binding sites are often conserved across functional proteins without significant conservation of the underlying amino-acid residues. To explore similarities of these sites from the viewpoint of a ligand, a sequence and fold-independent method was created to rapidly and accurately compare binding sites of proteins represented by property-mapped triangulated Gauss-Connolly surfaces. Within this paradigm, signatures for each binding site surface are produced by calculating their property-encoded shape distributions (PESD), a measure of the probability that a particular property will be at a specific distance to another on the molecular surface. Similarity between the signatures can then be treated as a measure of similarity between binding sites. As postulated, the PESD method rapidly detected high levels of similarity in binding site surface characteristics even in cases where there was very low similarity at the sequence level. In a screening experiment involving each member of the PDBBind 2005 dataset as a query against the rest of the set, PESD was able to retrieve a binding site with identical E.C. (Enzyme Commission) numbers as the top match in 79.5% of cases. The ability of the method in detecting similarity in binding sites with low sequence conservations were compared with state-of-the-art binding site comparison methods. PMID:19919089

  15. Sequence heuristics to encode phase behaviour in intrinsically disordered protein polymers

    PubMed Central

    Quiroz, Felipe García; Chilkoti, Ashutosh

    2015-01-01

    Proteins and synthetic polymers that undergo aqueous phase transitions mediate self-assembly in nature and in man-made material systems. Yet little is known about how the phase behaviour of a protein is encoded in its amino acid sequence. Here, by synthesizing intrinsically disordered, repeat proteins to test motifs that we hypothesized would encode phase behaviour, we show that the proteins can be designed to exhibit tunable lower or upper critical solution temperature (LCST and UCST, respectively) transitions in physiological solutions. We also show that mutation of key residues at the repeat level abolishes phase behaviour or encodes an orthogonal transition. Furthermore, we provide heuristics to identify, at the proteome level, proteins that might exhibit phase behaviour and to design novel protein polymers consisting of biologically active peptide repeats that exhibit LCST or UCST transitions. These findings set the foundation for the prediction and encoding of phase behaviour at the sequence level. PMID:26390327

  16. Sequence heuristics to encode phase behaviour in intrinsically disordered protein polymers.

    PubMed

    Quiroz, Felipe García; Chilkoti, Ashutosh

    2015-11-01

    Proteins and synthetic polymers that undergo aqueous phase transitions mediate self-assembly in nature and in man-made material systems. Yet little is known about how the phase behaviour of a protein is encoded in its amino acid sequence. Here, by synthesizing intrinsically disordered, repeat proteins to test motifs that we hypothesized would encode phase behaviour, we show that the proteins can be designed to exhibit tunable lower or upper critical solution temperature (LCST and UCST, respectively) transitions in physiological solutions. We also show that mutation of key residues at the repeat level abolishes phase behaviour or encodes an orthogonal transition. Furthermore, we provide heuristics to identify, at the proteome level, proteins that might exhibit phase behaviour and to design novel protein polymers consisting of biologically active peptide repeats that exhibit LCST or UCST transitions. These findings set the foundation for the prediction and encoding of phase behaviour at the sequence level.

  17. Sortase A as a novel molecular "stapler" for sequence-specific protein conjugation.

    PubMed

    Parthasarathy, Ranganath; Subramanian, Shyamsundar; Boder, Eric T

    2007-01-01

    The Sortase family of transpeptidase enzymes catalyzes sequence-specific ligation of proteins to the cell wall of Gram-positive bacteria. Here, we describe the application of recombinant Staphylococcus aureus Sortase A to attach a tagged model protein substrate (green fluorescent protein) to polystyrene beads chemically modified with either alkylamine or the in vivo Sortase A ligand, Gly-Gly-Gly, on their surfaces. Furthermore, we show that Sortase A can be used to sequence-specifically ligate eGFP to amino-terminated poly(ethylene glycol) and to generate protein oligomers and cyclized monomers using suitably tagged eGFP. We find that an alkylamine can substitute for the natural Gly3 substrate, which suggests the possibility of using the enzyme in materials applications. The highly specific and mild Sortase A-catalyzed reaction, based on small recognition tags unlikely to interfere with protein expression, thus represents a useful addition to the protein immobilization and modification tool kit.

  18. Nonrandomness in protein sequences: evidence for a physically driven stage of evolution?

    PubMed Central

    Pande, V S; Grosberg, A Y; Tanaka, T

    1994-01-01

    The sequences, or primary structures, of existing biopolymers--in particular, proteins--are believed to be a product of evolution. Are the sequences random? If not, what is the character of this nonrandomness? To explore the statistics of protein sequences, we use the idea of mapping the sequence onto the trajectory of a random walk, originally proposed by Peng et al. [Peng, C.-K., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Sciortino, F., Simons, M. & Stanley, H. E. (1992) Nature (London) 356, 168-170] in their analysis of DNA sequences. Using three different mappings, corresponding to three basic physical interactions between amino acids, we found pronounced deviations from pure randomness, and these deviations seem directed toward minimization of the energy of the three-dimensional structure. We consider this result as evidence for a physically driven stage of evolution. Images Fig. 1 Fig. 2 Fig. 3 PMID:7809157

  19. The iceLogo web server and SOAP service for determining protein consensus sequences.

    PubMed

    Maddelein, Davy; Colaert, Niklaas; Buchanan, Iain; Hulstaert, Niels; Gevaert, Kris; Martens, Lennart

    2015-07-01

    The iceLogo web server and SOAP service implement the previously published iceLogo algorithm. iceLogo builds on probability theory to visualize protein consensus sequences in a format resembling sequence logos. Peptide sequences are compared against a reference sequence set that can be tailored to the studied system and the used protocol. As such, not only over- but also underrepresented residues can be visualized in a statistically sound manner, which further allows the user to easily analyse and interpret conserved sequence patterns in proteins. The web application and SOAP service can be found free and open to all users without the need for a login on http://iomics.ugent.be/icelogoserver/main.html.

  20. Design of a zinc finger protein binding a sequence upstream of the A20 gene

    PubMed Central

    Wei, Yong; Ying, Dajun; Hou, Chunli; Cui, Xiaoping; Zhu, Chuhong

    2008-01-01

    Background Artificial transcription factors (ATFs) are composed of DNA-binding and functional domains. These domains can be fused together to create proteins that can bind a chosen DNA sequence. To construct a valid ATF, it is necessary to design suitable DNA-binding and functional domains. The Cys2-His2 zinc finger motif is the ideal structural scaffold on which to construct a sequence-specific protein. A20 is a cytoplasmic zinc finger protein that inhibits nuclear factor kappa-B activity and tumor necrosis factor (TNF)-mediated programmed cell death. A20 has been shown to prevent TNF-induced cytotoxicity in a variety of cell types including fibroblasts, B lymphocytes, WEHI 164 cells, NIH 3T3 cells and endothelial cells. Results In order to design a zinc finger protein (ZFP) structural domain that binds specific target sequences in the A20 gene promoter region, the structure and sequence composition of this promoter were analyzed by bioinformatics methods. The target sequences in the A20 promoter were submitted to the on-line ZF Tools server of the Barbas Laboratory, Scripps Research Institute (TSRI), to obtain a specific 18 bp target sequence and also the amino acid sequence of a ZFP that would bind to it. Sequence characterization and structural modeling of the predicted ZFP were performed by bioinformatics methods. The optimized DNA sequence of this artificial ZFP was recombined into the eukaryotic expression vector pIRES2-EGFP to construct pIRES2-EGFP/ZFP-flag recombinants, and the expression and biological activity of the ZFP were analyzed by RT-PCR, western blotting and EMSA, respectively. The ZFP was designed successfully and exhibited biological activity. Conclusion It is feasible to design specific zinc finger proteins by bioinformatics methods. PMID:18366681

  1. The new sequencer on the block: comparison of Life Technology's Proton sequencer to an Illumina HiSeq for whole-exome sequencing.

    PubMed

    Boland, Joseph F; Chung, Charles C; Roberson, David; Mitchell, Jason; Zhang, Xijun; Im, Kate M; He, Ji; Chanock, Stephen J; Yeager, Meredith; Dean, Michael

    2013-10-01

    We assessed the performance of the new Life Technologies Proton sequencer by comparing whole-exome sequence data in a Centre d'Etude du Polymorphisme Humain trio (family 1463) to the Illumina HiSeq instrument. To simulate a typical user's results, we utilized the standard capture, alignment and variant calling methods specific to each platform. We restricted data analysis to include the capture region common to both methods. The Proton produced high quality data at a comparable average depth and read length, and the Ion Reporter variant caller identified 96 % of single nucleotide polymorphisms (SNPs) detected by the HiSeq and GATK pipeline. However, only 40 % of small insertion and deletion variants (indels) were identified by both methods. Usage of the trio structure and segregation of platform-specific alleles supported this result. Further comparison of the trio data with Complete Genomics sequence data and Illumina SNP microarray genotypes documented high concordance and accurate SNP genotyping of both Proton and Illumina platforms. However, our study underscored the problem of accurate detection of indels for both the Proton and HiSeq platforms.

  2. Maps, codes, and sequence elements: can we predict the protein output from an alternatively spliced locus?

    PubMed

    Sharma, Shalini; Black, Douglas L

    2006-11-22

    Alternative splicing choices are governed by splicing regulatory protein interactions with splicing silencer and enhancer elements present in the pre-mRNA. However, the prediction of these choices from genomic sequence is difficult, in part because the regulators can act as either enhancers or silencers. A recent study describes how for a particular neuronal splicing regulatory protein, Nova, the location of its binding sites is highly predictive of the protein's effect on an exon's splicing.

  3. Identification of Human MicroRNA-Like Sequences Embedded within the Protein-Encoding Genes of the Human Immunodeficiency Virus

    PubMed Central

    Holland, Bryan; Wong, Jonathan; Li, Meng; Rasheed, Suraiya

    2013-01-01

    Background MicroRNAs (miRNAs) are highly conserved, short (18–22 nts), non-coding RNA molecules that regulate gene expression by binding to the 3′ untranslated regions (3′UTRs) of mRNAs. While numerous cellular microRNAs have been associated with the progression of various diseases including cancer, miRNAs associated with retroviruses have not been well characterized. Herein we report identification of microRNA-like sequences in coding regions of several HIV-1 genomes. Results Based on our earlier proteomics and bioinformatics studies, we have identified 8 cellular miRNAs that are predicted to bind to the mRNAs of multiple proteins that are dysregulated during HIV-infection of CD4+ T-cells in vitro. In silico analysis of the full length and mature sequences of these 8 miRNAs and comparisons with all the genomic and subgenomic sequences of HIV-1 strains in global databases revealed that the first 18/18 sequences of the mature hsa-miR-195 sequence (including the short seed sequence), matched perfectly (100%), or with one nucleotide mismatch, within the envelope (env) genes of five HIV-1 genomes from Africa. In addition, we have identified 4 other miRNA-like sequences (hsa-miR-30d, hsa-miR-30e, hsa-miR-374a and hsa-miR-424) within the env and the gag-pol encoding regions of several HIV-1 strains, albeit with reduced homology. Mapping of the miRNA-homologues of env within HIV-1 genomes localized these sequence to the functionally significant variable regions of the env glycoprotein gp120 designated V1, V2, V4 and V5. Conclusions We conclude that microRNA-like sequences are embedded within the protein-encoding regions of several HIV-1 genomes. Given that the V1 to V5 regions of HIV-1 envelopes contain specific, well-characterized domains that are critical for immune responses, virus neutralization and disease progression, we propose that the newly discovered miRNA-like sequences within the HIV-1 genomes may have evolved to self-regulate survival of the virus in

  4. Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies.

    PubMed

    Rickert, Keith W; Grinberg, Luba; Woods, Robert M; Wilson, Susan; Bowen, Michael A; Baca, Manuel

    2016-01-01

    The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3-5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material.

  5. Sequence analysis and protein import studies of an outer chloroplast envelope polypeptide.

    PubMed Central

    Salomon, M; Fischer, K; Flügge, U I; Soll, J

    1990-01-01

    A chloroplast outer envelope membrane protein was cloned and sequenced and from the sequence it was possible to deduce a polypeptide of 6.7 kDa. It has only one membrane-spanning region; the C terminus extends into the cytosol, whereas the N terminus is exposed to the space between the two envelope membranes. The protein was synthesized in an in vitro transcription-translation system to study its routing into isolated chloroplasts. The import studies revealed that the 6.7-kDa protein followed a different and heretofore undescribed translocation pathway in the respect that (i) it does not have a cleavable transit sequence, (ii) it does not require ATP hydrolysis for import, and (iii) protease-sensitive components that are responsible for recognition of precursor proteins destined for the inside of the chloroplasts are not involved in routing the 6.7-kDa polypeptide to the outer chloroplast envelope. Images PMID:2377616

  6. Prediction of high-risk types of human papillomaviruses using statistical model of protein "sequence space".

    PubMed

    Wang, Cong; Hai, Yabing; Liu, Xiaoqing; Liu, Nanfang; Yao, Yuhua; He, Pingan; Dai, Qi

    2015-01-01

    Discrimination of high-risk types of human papillomaviruses plays an important role in the diagnosis and remedy of cervical cancer. Recently, several computational methods have been proposed based on protein sequence-based and structure-based information, but the information of their related proteins has not been used until now. In this paper, we proposed using protein "sequence space" to explore this information and used it to predict high-risk types of HPVs. The proposed method was tested on 68 samples with known HPV types and 4 samples without HPV types and further compared with the available approaches. The results show that the proposed method achieved the best performance among all the evaluated methods with accuracy 95.59% and F1-score 90.91%, which indicates that protein "sequence space" could potentially be used to improve prediction of high-risk types of HPVs.

  7. Cloud computing for protein-ligand binding site comparison.

    PubMed

    Hung, Che-Lun; Hua, Guan-Jie

    2013-01-01

    The proteome-wide analysis of protein-ligand binding sites and their interactions with ligands is important in structure-based drug design and in understanding ligand cross reactivity and toxicity. The well-known and commonly used software, SMAP, has been designed for 3D ligand binding site comparison and similarity searching of a structural proteome. SMAP can also predict drug side effects and reassign existing drugs to new indications. However, the computing scale of SMAP is limited. We have developed a high availability, high performance system that expands the comparison scale of SMAP. This cloud computing service, called Cloud-PLBS, combines the SMAP and Hadoop frameworks and is deployed on a virtual cloud computing platform. To handle the vast amount of experimental data on protein-ligand binding site pairs, Cloud-PLBS exploits the MapReduce paradigm as a management and parallelizing tool. Cloud-PLBS provides a web portal and scalability through which biologists can address a wide range of computer-intensive questions in biology and drug discovery. PMID:23762824

  8. Cloud computing for protein-ligand binding site comparison.

    PubMed

    Hung, Che-Lun; Hua, Guan-Jie

    2013-01-01

    The proteome-wide analysis of protein-ligand binding sites and their interactions with ligands is important in structure-based drug design and in understanding ligand cross reactivity and toxicity. The well-known and commonly used software, SMAP, has been designed for 3D ligand binding site comparison and similarity searching of a structural proteome. SMAP can also predict drug side effects and reassign existing drugs to new indications. However, the computing scale of SMAP is limited. We have developed a high availability, high performance system that expands the comparison scale of SMAP. This cloud computing service, called Cloud-PLBS, combines the SMAP and Hadoop frameworks and is deployed on a virtual cloud computing platform. To handle the vast amount of experimental data on protein-ligand binding site pairs, Cloud-PLBS exploits the MapReduce paradigm as a management and parallelizing tool. Cloud-PLBS provides a web portal and scalability through which biologists can address a wide range of computer-intensive questions in biology and drug discovery.

  9. High-Resolution Sequence-Function Mapping of Full-Length Proteins

    PubMed Central

    Kowalsky, Caitlin A.; Klesmith, Justin R.; Stapleton, James A.; Kelly, Vince; Reichkitzer, Nolan; Whitehead, Timothy A.

    2015-01-01

    Comprehensive sequence-function mapping involves detailing the fitness contribution of every possible single mutation to a gene by comparing the abundance of each library variant before and after selection for the phenotype of interest. Deep sequencing of library DNA allows frequency reconstruction for tens of thousands of variants in a single experiment, yet short read lengths of current sequencers makes it challenging to probe genes encoding full-length proteins. Here we extend the scope of sequence-function maps to entire protein sequences with a modular, universal sequence tiling method. We demonstrate the approach with both growth-based selections and FACS screening, offer parameters and best practices that simplify design of experiments, and present analytical solutions to normalize data across independent selections. Using this protocol, sequence-function maps covering full sequences can be obtained in four to six weeks. Best practices introduced in this manuscript are fully compatible with, and complementary to, other recently published sequence-function mapping protocols. PMID:25790064

  10. PredPPCrys: Accurate Prediction of Sequence Cloning, Protein Production, Purification and Crystallization Propensity from Protein Sequences Using Multi-Step Heterogeneous Feature Fusion and Selection

    PubMed Central

    Wang, Huilin; Wang, Mingjun; Tan, Hao; Li, Yuan; Zhang, Ziding; Song, Jiangning

    2014-01-01

    X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed ‘PredPPCrys’ using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of

  11. Computational Framework for Prediction of Peptide Sequences That May Mediate Multiple Protein Interactions in Cancer-Associated Hub Proteins.

    PubMed

    Sarkar, Debasree; Patra, Piya; Ghosh, Abhirupa; Saha, Sudipto

    2016-01-01

    A considerable proportion of protein-protein interactions (PPIs) in the cell are estimated to be mediated by very short peptide segments that approximately conform to specific sequence patterns known as linear motifs (LMs), often present in the disordered regions in the eukaryotic proteins. These peptides have been found to interact with low affinity and are able bind to multiple interactors, thus playing an important role in the PPI networks involving date hubs. In this work, PPI data and de novo motif identification based method (MEME) were used to identify such peptides in three cancer-associated hub proteins-MYC, APC and MDM2. The peptides corresponding to the significant LMs identified for each hub protein were aligned, the overlapping regions across these peptides being termed as overlapping linear peptides (OLPs). These OLPs were thus predicted to be responsible for multiple PPIs of the corresponding hub proteins and a scoring system was developed to rank them. We predicted six OLPs in MYC and five OLPs in MDM2 that scored higher than OLP predictions from randomly generated protein sets. Two OLP sequences from the C-terminal of MYC were predicted to bind with FBXW7, component of an E3 ubiquitin-protein ligase complex involved in proteasomal degradation of MYC. Similarly, we identified peptides in the C-terminal of MDM2 interacting with FKBP3, which has a specific role in auto-ubiquitinylation of MDM2. The peptide sequences predicted in MYC and MDM2 look promising for designing orthosteric inhibitors against possible disease-associated PPIs. Since these OLPs can interact with other proteins as well, these inhibitors should be specific to the targeted interactor to prevent undesired side-effects. This computational framework has been designed to predict and rank the peptide regions that may mediate multiple PPIs and can be applied to other disease-associated date hub proteins for prediction of novel therapeutic targets of small molecule PPI modulators. PMID

  12. Exploring Sequence Characteristics Related to High-Level Production of Secreted Proteins in Aspergillus niger

    PubMed Central

    van den Berg, Bastiaan A.; Reinders, Marcel J. T.; Hulsman, Marc; Wu, Liang; Pel, Herman J.; Roubos, Johannes A.; de Ridder, Dick

    2012-01-01

    Protein sequence features are explored in relation to the production of over-expressed extracellular proteins by fungi. Knowledge on features influencing protein production and secretion could be employed to improve enzyme production levels in industrial bioprocesses via protein engineering. A large set, over 600 homologous and nearly 2,000 heterologous fungal genes, were overexpressed in Aspergillus niger using a standardized expression cassette and scored for high versus no production. Subsequently, sequence-based machine learning techniques were applied for identifying relevant DNA and protein sequence features. The amino-acid composition of the protein sequence was found to be most predictive and interpretation revealed that, for both homologous and heterologous gene expression, the same features are important: tyrosine and asparagine composition was found to have a positive correlation with high-level production, whereas for unsuccessful production, contributions were found for methionine and lysine composition. The predictor is available online at http://bioinformatics.tudelft.nl/hipsec. Subsequent work aims at validating these findings by protein engineering as a method for increasing expression levels per gene copy. PMID:23049690

  13. 100% protein sequence coverage: a modern form of surrealism in proteomics.

    PubMed

    Meyer, Bjoern; Papasotiriou, Dimitrios G; Karas, Michael

    2011-07-01

    This review intends not only to discuss the current possibilities to gain 100% sequence coverage for proteins, but also to point out the critical limits to such an attempt. The aim of 100% sequence coverage, as the review title already implies, seems to be rather surreal if the complexity and dynamic range of a proteome is taken into consideration. Nevertheless, established bottom-up shotgun approaches are able to roughly identify a complete proteome as exemplary shown by yeast. However, this proceeding ignores more or less the fact that a protein is present as various protein species. The unambiguous identification of protein species requires 100% sequence coverage. Furthermore, the separation of the proteome must be performed on the protein species and not on the peptide level. Therefore, top-down is a good strategy for protein species analysis. Classical 2D-electrophoresis followed by an enzymatic or chemical cleavage, which is a combination of top-down and bottom-up, is another interesting approach. Moreover, the review summarizes further top-down and bottom-up combinations and to which extent middle-down improves the identification of protein species. The attention is also focused on cleavage strategies other than trypsin, as 100% sequence coverage in bottom-up experiments is only obtainable with a combination of cleavage reagents. PMID:20625782

  14. Interspecific sequence comparison of the muscle-myosin heavy-chain genes from Drosophila hydei and Drosophila melanogaster.

    PubMed

    Miedema, K; Harhangi, H; Mentzel, S; Wilbrink, M; Akhmanova, A; Hooiveld, M; Bindels, P; Hennig, W

    1994-10-01

    The muscle-myosin heavy-chain (mMHC) gene of Drosophila hydei has been sequenced completely (size 23.3 kb). The sequence comparison with the D. melanogaster mMHC gene revealed that the exon-intron pattern is identical. The protein coding regions show a high degree of conservation (97%). The alternatively spliced exons (3a-b, 7a-d, 9a-c, 11a-e, and 15a-b) display more variations in the number of nonsynonymous and synonymous substitutions than the common exons (2, 4, 5, 6, 8, 10, 12, 13, 14, 16, 17, and 19). The base composition at synonymous sites of fourfold degenerate codons (third position) is not biased in the alternative exons. In the common exons there exists a bias for C and against A. These findings imply that the alternative exons of the Drosophila mMHC gene evolve at a different, in several cases higher, rate than the common ones. The 5' splice junctions and 5' and 3' untranslated regions show a high level of similarity, indicating a functional constraint on these sequences. The intron regions vary considerably in length within one species, but the corresponding introns are very similar in length between the two species and all contain stretches of sequence similarity. A particular example is the first intron, which contains multiple regions of similarity. In the conserved regions of intron 12 (head-tail border) sequences were found which have the potential to direct another smaller mMHC transcript.

  15. Nucleotide sequence of the DNA polymerase gene of herpes simplex virus type 2 and comparison with the type 1 counterpart.

    PubMed

    Tsurumi, T; Maeno, K; Nishiyama, Y

    1987-01-01

    The complete nucleotide sequence of the DNA polymerase gene of herpes simplex virus (HSV) type 2 strain 186 has been determined. The gene included a 3720-bp major open reading frame capable of encoding 1240 amino acids. The predicted primary translation product had an Mr of 137,354, which was slightly larger than its HSV-1 counterpart. A comparison of the predicted functional amino acid sequences of the HSV-1 and HSV-2 DNA polymerases revealed 95.5% overall amino acid homology, the value of which was the highest among those of the other known polypeptides encoded by HSV-1 and HSV-2. The functional amino acid changes were spread in the N-terminal one-third of the protein, whereas the C-terminal two-third was almost identical between the two types except a particular hydrophilic region. A highly conserved sequence of 6 aa, YGDTDS, which has been observed in DNA polymerases of HSV-1, Epstein-Barr virus, adenovirus, and vaccinia virus, was also present at positions 889 to 894 in the C-terminal region of HSV-2 DNA polymerase.

  16. A Comparison of the First Two Sequenced Chloroplast Genomes in Asteraceae: Lettuce and Sunflower

    SciTech Connect

    Timme, Ruth E.; Kuehl, Jennifer V.; Boore, Jeffrey L.; Jansen, Robert K.

    2006-01-20

    Asteraceae is the second largest family of plants, with over 20,000 species. For the past few decades, numerous phylogenetic studies have contributed to our understanding of the evolutionary relationships within this family, including comparisons of the fast evolving chloroplast gene, ndhF, rbcL, as well as non-coding DNA from the trnL intron plus the trnLtrnF intergenic spacer, matK, and, with lesser resolution, psbA-trnH. This culminated in a study by Panero and Funk in 2002 that used over 13,000 bp per taxon for the largest taxonomic revision of Asteraceae in over a hundred years. Still, some uncertainties remain, and it would be very useful to have more information on the relative rates of sequence evolution among various genes and on genome structure as a potential set of phylogenetic characters to help guide future phylogenetic structures. By way of contributing to this, we report the first two complete chloroplast genome sequences from members of the Asteraceae, those of Helianthus annuus and Lactuca sativa. These plants belong to two distantly related subfamilies, Asteroideae and Cichorioideae, respectively. In addition to these, there is only one other published chloroplast genome sequence for any plant within the larger group called Eusterids II, that of Panax ginseng (Araliaceae, 156,318 bps, AY582139). Early chloroplast genome mapping studies demonstrated that H. annuus and L. sativa share a 22 kb inversion relative to members of the subfamily Barnadesioideae. By comparison to outgroups, this inversion was shown to be derived, indicating that the Asteroideae and Cichorioideae are more closely related than either is to the Barnadesioideae. Later sequencing study found that taxa that share this 22 kb inversion also contain within this region a second, smaller, 3.3 kb inversion. These sequences also enable an analysis of patterns of shared repeats in the genomes at fine level and of RNA editing by comparison to available EST sequences. In addition, since

  17. Amino acid sequence and structural properties of protein p12, an African swine fever virus attachment protein.

    PubMed Central

    Alcamí, A; Angulo, A; López-Otín, C; Muñoz, M; Freije, J M; Carrascosa, A L; Viñuela, E

    1992-01-01

    The gene encoding the African swine fever virus protein p12, which is involved in virus attachment to the host cell, has been mapped and sequenced in the genome of the Vero-adapted virus strain BA71V. The determination of the N-terminal amino acid sequence and the hybridization of oligonucleotide probes derived from this sequence to cloned restriction fragments allowed the mapping of the gene in fragment EcoRI-O, located in the central region of the viral genome. The DNA sequence of an EcoRI-XbaI fragment showed an open reading frame which is predicted to encode a polypeptide of 61 amino acids. The expression of this open reading frame in rabbit reticulocyte lysates and in Escherichia coli gave rise to a 12-kDa polypeptide that was immunoprecipitated with a monoclonal antibody specific for protein p12. The hydrophilicity profile indicated the existence of a stretch of 22 hydrophobic residues in the central part that may anchor the protein in the virus envelope. Three forms of the protein with apparent molecular masses of 17, 12, and 10 kDa in sodium dodecyl sulfate-polyacrylamide gel electrophoresis have been observed, depending on the presence of 2-mercaptoethanol and alkylation with 4-vinylpyridine, indicating that disulfide bonds are responsible for the multimerization of the protein. This result was in agreement with the existence of a cysteine-rich domain in the C-terminal region of the predicted amino acid sequence. The protein was synthesized at late times of infection, and no posttranslational modifications such as glycosylation, phosphorylation, or fatty acid acylation were detected. Images PMID:1583732

  18. Orpinomyces cellulase CelE protein and coding sequences

    DOEpatents

    Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong

    2000-08-29

    A CDNA designated celE cloned from Orpinomyces PC-2 encodes a polypeptide (CelE) of 477 amino acids. CelE is highly homologous to CelB of Orpinomyces (72.3% identity) and Neocallimastix (67.9% identity), and like them, it has a non-catalytic repeated peptide domain (NCRPD) at the C-terminal end. The catalytic domain of CelE is homologous to glycosyl hydrolases of Family 5, found in several anaerobic bacteria. The gene of celE is devoid of introns. The recombinant proteins CelE and CelB of Orpinomyces PC-2 randomly hydrolyze carboxymethylcellulose and cello-oligosaccharides in the pattern of endoglucanases.

  19. The MPI Bioinformatics Toolkit for protein sequence analysis

    PubMed Central

    Biegert, Andreas; Mayer, Christian; Remmert, Michael; Söding, Johannes; Lupas, Andrei N.

    2006-01-01

    The MPI Bioinformatics Toolkit is an interactive web service which offers access to a great variety of public and in-house bioinformatics tools. They are grouped into different sections that support sequence searches, multiple alignment, secondary and tertiary structure prediction and classification. Several public tools are offered in customized versions that extend their functionality. For example, PSI-BLAST can be run against regularly updated standard databases, customized user databases or selectable sets of genomes. Another tool, Quick2D, integrates the results of various secondary structure, transmembrane and disorder prediction programs into one view. The Toolkit provides a friendly and intuitive user interface with an online help facility. As a key feature, various tools are interconnected so that the results of one tool can be forwarded to other tools. One could run PSI-BLAST, parse out a multiple alignment of selected hits and send the results to a cluster analysis tool. The Toolkit framework and the tools developed in-house will be packaged and freely available under the GNU Lesser General Public Licence (LGPL). The Toolkit can be accessed at . PMID:16845021

  20. The MPI Bioinformatics Toolkit for protein sequence analysis.

    PubMed

    Biegert, Andreas; Mayer, Christian; Remmert, Michael; Söding, Johannes; Lupas, Andrei N

    2006-07-01

    The MPI Bioinformatics Toolkit is an interactive web service which offers access to a great variety of public and in-house bioinformatics tools. They are grouped into different sections that support sequence searches, multiple alignment, secondary and tertiary structure prediction and classification. Several public tools are offered in customized versions that extend their functionality. For example, PSI-BLAST can be run against regularly updated standard databases, customized user databases or selectable sets of genomes. Another tool, Quick2D, integrates the results of various secondary structure, transmembrane and disorder prediction programs into one view. The Toolkit provides a friendly and intuitive user interface with an online help facility. As a key feature, various tools are interconnected so that the results of one tool can be forwarded to other tools. One could run PSI-BLAST, parse out a multiple alignment of selected hits and send the results to a cluster analysis tool. The Toolkit framework and the tools developed in-house will be packaged and freely available under the GNU Lesser General Public Licence (LGPL). The Toolkit can be accessed at http://toolkit.tuebingen.mpg.de.

  1. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins

    NASA Astrophysics Data System (ADS)

    Sawle, Lucas; Ghosh, Kingshuk

    2015-08-01

    A general formalism to compute configurational properties of proteins and other heteropolymers with an arbitrary sequence of charges and non-uniform excluded volume interaction is presented. A variational approach is utilized to predict average distance between any two monomers in the chain. The presented analytical model, for the first time, explicitly incorporates the role of sequence charge distribution to determine relative sizes between two sequences that vary not only in total charge composition but also in charge decoration (even when charge composition is fixed). Furthermore, the formalism is general enough to allow variation in excluded volume interactions between two monomers. Model predictions are benchmarked against the all-atom Monte Carlo studies of Das and Pappu [Proc. Natl. Acad. Sci. U. S. A. 110, 13392 (2013)] for 30 different synthetic sequences of polyampholytes. These sequences possess an equal number of glutamic acid (E) and lysine (K) residues but differ in the patterning within the sequence. Without any fit parameter, the model captures the strong sequence dependence of the simulated values of the radius of gyration with a correlation coefficient of R2 = 0.9. The model is then applied to real proteins to compare the unfolded state dimensions of 540 orthologous pairs of thermophilic and mesophilic proteins. The excluded volume parameters are assumed similar under denatured conditions, and only electrostatic effects encoded in the sequence are accounted for. With these assumptions, thermophilic proteins are found—with high statistical significance—to have more compact disordered ensemble compared to their mesophilic counterparts. The method presented here, due to its analytical nature, is capable of making such high throughput analysis of multiple proteins and will have broad applications in proteomic studies as well as in other heteropolymeric systems.

  2. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins

    SciTech Connect

    Sawle, Lucas; Ghosh, Kingshuk

    2015-08-28

    A general formalism to compute configurational properties of proteins and other heteropolymers with an arbitrary sequence of charges and non-uniform excluded volume interaction is presented. A variational approach is utilized to predict average distance between any two monomers in the chain. The presented analytical model, for the first time, explicitly incorporates the role of sequence charge distribution to determine relative sizes between two sequences that vary not only in total charge composition but also in charge decoration (even when charge composition is fixed). Furthermore, the formalism is general enough to allow variation in excluded volume interactions between two monomers. Model predictions are benchmarked against the all-atom Monte Carlo studies of Das and Pappu [Proc. Natl. Acad. Sci. U. S. A. 110, 13392 (2013)] for 30 different synthetic sequences of polyampholytes. These sequences possess an equal number of glutamic acid (E) and lysine (K) residues but differ in the patterning within the sequence. Without any fit parameter, the model captures the strong sequence dependence of the simulated values of the radius of gyration with a correlation coefficient of R{sup 2} = 0.9. The model is then applied to real proteins to compare the unfolded state dimensions of 540 orthologous pairs of thermophilic and mesophilic proteins. The excluded volume parameters are assumed similar under denatured conditions, and only electrostatic effects encoded in the sequence are accounted for. With these assumptions, thermophilic proteins are found—with high statistical significance—to have more compact disordered ensemble compared to their mesophilic counterparts. The method presented here, due to its analytical nature, is capable of making such high throughput analysis of multiple proteins and will have broad applications in proteomic studies as well as in other heteropolymeric systems.

  3. EST-PAC a web package for EST annotation and protein sequence prediction

    PubMed Central

    Strahm, Yvan; Powell, David; Lefèvre, Christophe

    2006-01-01

    With the decreasing cost of DNA sequencing technology and the vast diversity of biological resources, researchers increasingly face the basic challenge of annotating a larger number of expressed sequences tags (EST) from a variety of species. This typically consists of a series of repetitive tasks, which should be automated and easy to use. The results of these annotation tasks need to be stored and organized in a consistent way. All these operations should be self-installing, platform independent, easy to customize and amenable to using distributed bioinformatics resources available on the Internet. In order to address these issues, we present EST-PAC a web oriented multi-platform software package for expressed sequences tag (EST) annotation. EST-PAC provides a solution for the administration of EST and protein sequence annotations accessible through a web interface. Three aspects of EST annotation are automated: 1) searching local or remote biological databases for sequence similarities using Blast services, 2) predicting protein coding sequence from EST data and, 3) annotating predicted protein sequences with functional domain predictions. In practice, EST-PAC integrates the BLASTALL suite, EST-Scan2 and HMMER in a relational database system accessible through a simple web interface. EST-PAC also takes advantage of the relational database to allow consistent storage, powerful queries of results and, management of the annotation process. The system allows users to customize annotation strategies and provides an open-source data-management environment for research and education in bioinformatics. PMID:17147782

  4. Prediction of glutathionylation sites in proteins using minimal sequence information and their experimental validation.

    PubMed

    Pal, Debojyoti; Sharma, Deepak; Kumar, Mukesh; Sandur, Santosh K

    2016-09-01

    S-glutathionylation of proteins plays an important role in various biological processes and is known to be protective modification during oxidative stress. Since, experimental detection of S-glutathionylation is labor intensive and time consuming, bioinformatics based approach is a viable alternative. Available methods require relatively longer sequence information, which may prevent prediction if sequence information is incomplete. Here, we present a model to predict glutathionylation sites from pentapeptide sequences. It is based upon differential association of amino acids with glutathionylated and non-glutathionylated cysteines from a database of experimentally verified sequences. This data was used to calculate position dependent F-scores, which measure how a particular amino acid at a particular position may affect the likelihood of glutathionylation event. Glutathionylation-score (G-score), indicating propensity of a sequence to undergo glutathionylation, was calculated using position-dependent F-scores for each amino-acid. Cut-off values were used for prediction. Our model returned an accuracy of 58% with Matthew's correlation-coefficient (MCC) value of 0.165. On an independent dataset, our model outperformed the currently available model, in spite of needing much less sequence information. Pentapeptide motifs having high abundance among glutathionylated proteins were identified. A list of potential glutathionylation hotspot sequences were obtained by assigning G-scores and subsequent Protein-BLAST analysis revealed a total of 254 putative glutathionable proteins, a number of which were already known to be glutathionylated. Our model predicted glutathionylation sites in 93.93% of experimentally verified glutathionylated proteins. Outcome of this study may assist in discovering novel glutathionylation sites and finding candidate proteins for glutathionylation. PMID:27454891

  5. DINAMO: a coupled sequence alignment editor/molecular graphics tool for interactive homology modeling of proteins.

    PubMed

    Hansen, M; Bentz, J; Baucom, A; Gregoret, L

    1998-01-01

    Gaining functional information about a novel protein is a universal problem in biomedical research. With the explosive growth of the protein sequence and structural databases, it is becoming increasingly common for researchers to attempt to build a three-dimensional model of their protein of interest in order to gain information about its structure and interactions with other molecules. The two most reliable methods for predicting the structure of a protein are homology modeling, in which the novel sequence is modeled on the known three-dimensional structure of a related protein, and fold recognition (threading), where the sequence is scored against a library of fold models, and the highest scoring model is selected. The sequence alignment to a known structure can be ambiguous, and human intervention is often required to optimize the model. We describe an interactive model building and assessment tool in which a sequence alignment editor is dynamically coupled to a molecular graphics display. By means of a set of assessment tools, the user may optimize his or her alignment to satisfy the known heuristics of protein structure. Adjustments to the sequence alignment made by the user are reflected in the displayed model by color and other visual cues. For instance, residues are colored by hydrophobicity in both the three-dimensional model and in the sequence alignment. This aids the user in identifying undesirable buried polar residues. Several different evaluation metrics may be selected including residue conservation, residue properties, and visualization of predicted secondary structure. These characteristics may be mapped to the model both singly and in combination. DINAMO is a Java-based tool that may be run either over the web or installed locally. Its modular architecture also allows Java-literate users to add plug-ins of their own design.

  6. Ab initio protein folding simulations using atomic burials as informational intermediates between sequence and structure.

    PubMed

    van der Linden, Marx Gomes; Ferreira, Diogo César; de Oliveira, Leandro Cristante; Onuchic, José N; de Araújo, Antônio F Pereira

    2014-07-01

    The three-dimensional structure of proteins is determined by their linear amino acid sequences but decipherment of the underlying protein folding code has remained elusive. Recent studies have suggested that burials, as expressed by atomic distances to the molecular center, are sufficiently informative for structural determination while potentially obtainable from sequences. Here we provide direct evidence for this distinctive role of burials in the folding code, demonstrating that burial propensities estimated from local sequence can indeed be used to fold globular proteins in ab initio simulations. We have used a statistical scheme based on a Hidden Markov Model (HMM) to classify all heavy atoms of a protein into a small number of burial atomic types depending on sequence context. Molecular dynamics simulations were then performed with a potential that forces all atoms of each type towards their predicted burial level, while simple geometric constraints were imposed on covalent structure and hydrogen bond formation. The correct folded conformation was obtained and distinguished in simulations that started from extended chains for a selection of structures comprising all three folding classes and high burial prediction quality. These results demonstrate that atomic burials can act as informational intermediates between sequence and structure, providing a new conceptual framework for improving structural prediction and understanding the fundamentals of protein folding.

  7. Silkmoth chorion proteins: sequence analysis of the products of a multigene family.

    PubMed Central

    Regier, J C; Kafatos, F C; Goodfliesh, R; Hood, L

    1978-01-01

    Five polypeptide components have been isolated from the eggshell (chorions) of a silkmoth. Two are homogeneous on sodium dodecyl sulfate and isoelectric focusing gels, and three contain predominantly two proteins each. Amino acid analyses show that all five components are similar to each other. These proteins have been sequenced from the amino terminus. Homogeneous components yielded single sequences; heterogeneous components yielded two residues at some positions, consistent with their containing two major electrophoretic components. Striking similarities are apparent among all these sequences. These similarities can be increased dramatically by separating each of the three protein mixtures into two sequences and introducing a small number of gaps or insertions. This is due in part to bringing into register a portion that contains short repeating subunits found in all sequences. All proteins are also characterized by a region of high cysteine content near the amino terminus followed by a longer low-cysteine region. The data suggest that these proteins share a common evolutionary origin and are encoded by a multigene family. Images PMID:272655

  8. Computational Framework for Prediction of Peptide Sequences That May Mediate Multiple Protein Interactions in Cancer-Associated Hub Proteins

    PubMed Central

    Sarkar, Debasree; Patra, Piya; Ghosh, Abhirupa; Saha, Sudipto

    2016-01-01

    A considerable proportion of protein-protein interactions (PPIs) in the cell are estimated to be mediated by very short peptide segments that approximately conform to specific sequence patterns known as linear motifs (LMs), often present in the disordered regions in the eukaryotic proteins. These peptides have been found to interact with low affinity and are able bind to multiple interactors, thus playing an important role in the PPI networks involving date hubs. In this work, PPI data and de novo motif identification based method (MEME) were used to identify such peptides in three cancer-associated hub proteins—MYC, APC and MDM2. The peptides corresponding to the significant LMs identified for each hub protein were aligned, the overlapping regions across these peptides being termed as overlapping linear peptides (OLPs). These OLPs were thus predicted to be responsible for multiple PPIs of the corresponding hub proteins and a scoring system was developed to rank them. We predicted six OLPs in MYC and five OLPs in MDM2 that scored higher than OLP predictions from randomly generated protein sets. Two OLP sequences from the C-terminal of MYC were predicted to bind with FBXW7, component of an E3 ubiquitin-protein ligase complex involved in proteasomal degradation of MYC. Similarly, we identified peptides in the C-terminal of MDM2 interacting with FKBP3, which has a specific role in auto-ubiquitinylation of MDM2. The peptide sequences predicted in MYC and MDM2 look promising for designing orthosteric inhibitors against possible disease-associated PPIs. Since these OLPs can interact with other proteins as well, these inhibitors should be specific to the targeted interactor to prevent undesired side-effects. This computational framework has been designed to predict and rank the peptide regions that may mediate multiple PPIs and can be applied to other disease-associated date hub proteins for prediction of novel therapeutic targets of small molecule PPI modulators. PMID

  9. Understanding sequence similarity and framework analysis between centromere proteins using computational biology.

    PubMed

    Doss, C George Priya; Chakrabarty, Chiranjib; Debajyoti, C; Debottam, S

    2014-11-01

    Certain mysteries pointing toward their recruitment pathways, cell cycle regulation mechanisms, spindle checkpoint assembly, and chromosome segregation process are considered the centre of attraction in cancer research. In modern times, with the established databases, ranges of computational platforms have provided a platform to examine almost all the physiological and biochemical evidences in disease-associated phenotypes. Using existing computational methods, we have utilized the amino acid residues to understand the similarity within the evolutionary variance of different associated centromere proteins. This study related to sequence similarity, protein-protein networking, co-expression analysis, and evolutionary trajectory of centromere proteins will speed up the understanding about centromere biology and will create a road map for upcoming researchers who are initiating their work of clinical sequencing using centromere proteins.

  10. Molecular design of performance proteins with repetitive sequences: recombinant flagelliform spider silk as basis for biomaterials.

    PubMed

    Vendrely, Charlotte; Ackerschott, Christian; Römer, Lin; Scheibel, Thomas

    2008-01-01

    Most performance proteins responsible for the mechanical stability of cells and organisms reveal highly repetitive sequences. Mimicking such performance proteins is of high interest for the design of nanostructured biomaterials. In this article, flagelliform silk is exemplary introduced to describe a general principle for designing genes of repetitive performance proteins for recombinant expression in Escherichia coli . In the first step, repeating amino acid sequence motifs are reversely transcripted into DNA cassettes, which can in a second step be seamlessly ligated, yielding a designed gene. Recombinant expression thereof leads to proteins mimicking the natural ones. The recombinant proteins can be assembled into nanostructured materials in a controlled manner, allowing their use in several applications. PMID:19031057

  11. Using CATH-Gene3D to Analyze the Sequence, Structure, and Function of Proteins.

    PubMed

    Sillitoe, Ian; Lewis, Tony; Orengo, Christine

    2015-01-01

    The CATH database is a classification of protein structures found in the Protein Data Bank (PDB). Protein structures are chopped into individual units of structural domains, and these domains are grouped together into superfamilies if there is sufficient evidence that they have diverged from a common ancestor during the process of evolution. A sister resource, Gene3D, extends this information by scanning sequence profiles of these CATH domain superfamilies against many millions of known proteins to identify related sequences. Thus the combined CATH-Gene3D resource provides confident predictions of the likely structural fold, domain organisation, and evolutionary relatives of these proteins. In addition, this resource incorporates annotations from a large number of external databases such as known enzyme active sites, GO molecular functions, physical interactions, and mutations. This unit details how to access and understand the information contained within the CATH-Gene3D Web pages, the downloadable data files, and the remotely accessible Web services.

  12. Sequence co-evolution gives 3D contacts and structures of protein complexes

    PubMed Central

    Hopf, Thomas A; Schärfe, Charlotta P I; Rodrigues, João P G L M; Green, Anna G; Kohlbacher, Oliver; Sander, Chris; Bonvin, Alexandre M J J; Marks, Debora S

    2014-01-01

    Protein–protein interactions are fundamental to many biological processes. Experimental screens have identified tens of thousands of interactions, and structural biology has provided detailed functional insight for select 3D protein complexes. An alternative rich source of information about protein interactions is the evolutionary sequence record. Building on earlier work, we show that analysis of correlated evolutionary sequence changes across proteins identifies residues that are close in space with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We evaluate prediction performance in blinded tests on 76 complexes of known 3D structure, predict protein–protein contacts in 32 complexes of unknown structure, and demonstrate how evolutionary couplings can be used to distinguish between interacting and non-interacting protein pairs in a large complex. With the current growth of sequences, we expect that the method can be generalized to genome-wide elucidation of protein–protein interaction networks and used for interaction predictions at residue resolution. DOI: http://dx.doi.org/10.7554/eLife.03430.001 PMID:25255213

  13. Isolation and N-terminal sequencing of a novel cadmium-binding protein from Boletus edulis

    NASA Astrophysics Data System (ADS)

    Collin-Hansen, C.; Andersen, R. A.; Steinnes, E.

    2003-05-01

    A Cd-binding protein was isolated from the popular edible mushroom Boletus edulis, which is a hyperaccumulator of both Cd and Hg. Wild-growing samples of B. edulis were collected from soils rich in Cd. Cd radiotracer was added to the crude protein preparation obtained from ethanol precipitation of heat-treated cytosol. Proteins were then further separated in two consecutive steps; gel filtration and anion exchange chromatography. In both steps the Cd radiotracer profile showed only one distinct peak, which corresponded well with the profiles of endogenous Cd obtained by atomic absorption spectrophotometry (AAS). Concentrations of the essential elements Cu and Zn were low in the protein fractions high in Cd. N-terminal sequencing performed on the Cd-binding protein fractions revealed a protein with a novel amino acid sequence, which contained aromatic amino acids as well as proline. Both the N-terminal sequencing and spectrofluorimetric analysis with EDTA and ABD-F (4-aminosulfonyl-7-fluoro-2, 1, 3-benzoxadiazole) failed to detect cysteine in the Cd-binding fractions. These findings conclude that the novel protein does not belong to the metallothionein family. The results suggest a role for the protein in Cd transport and storage, and they are of importance in view of toxicology and food chemistry, but also for environmental protection.

  14. Ancient origin for Hawaiian Drosophilinae inferred from protein comparisons.

    PubMed Central

    Beverley, S M; Wilson, A C

    1985-01-01

    Immunological comparisons of a larval hemolymph protein enabled us to build a tree relating major groups of drosophiline flies in Hawaii to one another and to continental flies. The tree agrees in topology with that based on internal anatomy. Relative rate tests suggest that evolution of hemolymph proteins has been about as fast in Hawaii as on continents. Since the absolute rate of evolution of hemolymph proteins in continental flies is known, one can erect an approximate time scale for Hawaiian fly evolution. According to this scale, the Hawaiian fly fauna stems from a colonist that landed on the archipelago about 42 million years ago-i.e., before any of the present islands harboring drosophilines formed. This date fits with the geological history of the archipelago, which has witnessed the sequential rise and erosion of many islands during the past 70 million years. We discuss the bearing of the molecular time scale on views about rates of organismal evolution in the Hawaiian flies. PMID:3860822

  15. Myelin protein zero gene sequencing diagnoses Charcot-Marie-Tooth Type 1B disease

    SciTech Connect

    Su, Y.; Zhang, H.; Madrid, R.

    1994-09-01

    Charcot-Marie-Tooth disease (CMT), the most common genetic neuropathy, affects about 1 in 2600 people in Norway and is found worldwide. CMT Type 1 (CMT1) has slow nerve conduction with demyelinated Schwann cells. Autosomal dominant CMT Type 1B (CMT1B) results from mutations in the myelin protein zero gene which directs the synthesis of more than half of all Schwann cell protein. This gene was mapped to the chromosome 1q22-1q23.1 borderline by fluorescence in situ hybridization. The first 7 of 7 reported CMT1B mutations are unique. Thus the most effective means to identify CMT1B mutations in at-risk family members and fetuses is to sequence the entire coding sequence in dominant or sporadic CMT patients without the CMT1A duplication. Of the 19 primers used in 16 pars to uniquely amplify the entire MPZ coding sequence, 6 primer pairs were used to amplify and sequence the 6 exons. The DyeDeoxy Terminator cycle sequencing method used with four different color fluorescent lables was superior to manual sequencing because it sequences more bases unambiguously from extracted genomic DNA samples within 24 hours. This protocol was used to test 28 CMT and Dejerine-Sottas patients without CMT1A gene duplication. Sequencing MPZ gene-specific amplified fragments identified 9 polymorphic sites within the 6 exons that encode the 248 amino acid MPZ protein. The large number of major CMT1B mutations identified by single strand sequencing are being verified by reverse strand sequencing and when possible, by restriction enzyme analysis. This protocol can be used to distringuish CMT1B patients from othre CMT phenotypes and to determine the CMT1B status of relatives both presymptomatically and prenatally.

  16. Integrating mRNA and protein sequencing enables the detection and quantitative profiling of natural protein sequence variants of Populus trichocarpa

    DOE PAGESBeta

    Abraham, Paul E.; Wang, Xiaojing; Ranjan, Priya; Zhang, Bing; Tuskan, Gerald A.; Robert L. Hettich; Nookaew, Intawat

    2015-10-20

    The availability of next-generation sequencing technologies has rapidly transformed our ability to link genotypes to phenotypes, and as such, promises to facilitate the dissection of genetic contribution to complex traits. Although discoveries of genetic associations will further our understanding of biology, once candidate variants have been identified, investigators are faced with the challenge of characterizing the functional effects on proteins encoded by such genes. Here we show how next-generation RNA sequencing data can be exploited to construct genotype-specific protein sequence databases, which provide a clearer picture of the molecular toolbox underlying cellular and organismal processes and their variation in amore » natural population. For this study, we used two individual genotypes (DENA-17-3 and VNDL-27-4) from a recent genome wide association (GWA) study of Populus trichocarpa, an obligate outcrosser that exhibits tremendous phenotypic variation across the natural population. This strategy allowed us to comprehensively catalogue proteins containing single amino acid polymorphisms (SAAPs) and insertions and deletions (INDELS). Based on large-scale identification of SAAPs, we profiled the frequency of 128 types of naturally occurring amino acid substitutions, with a subset of SAAPs occurring in regions of the genome having strong polymorphism patterns consistent with recent positive and/or divergent selection. In addition, we were able to explore the diploid landscape of Populus at the proteome-level, allowing the characterization of heterozygous variants.« less

  17. Integrating mRNA and protein sequencing enables the detection and quantitative profiling of natural protein sequence variants of Populus trichocarpa

    SciTech Connect

    Abraham, Paul E.; Wang, Xiaojing; Ranjan, Priya; Zhang, Bing; Tuskan, Gerald A.; Robert L. Hettich; Nookaew, Intawat

    2015-10-20

    The availability of next-generation sequencing technologies has rapidly transformed our ability to link genotypes to phenotypes, and as such, promises to facilitate the dissection of genetic contribution to complex traits. Although discoveries of genetic associations will further our understanding of biology, once candidate variants have been identified, investigators are faced with the challenge of characterizing the functional effects on proteins encoded by such genes. Here we show how next-generation RNA sequencing data can be exploited to construct genotype-specific protein sequence databases, which provide a clearer picture of the molecular toolbox underlying cellular and organismal processes and their variation in a natural population. For this study, we used two individual genotypes (DENA-17-3 and VNDL-27-4) from a recent genome wide association (GWA) study of Populus trichocarpa, an obligate outcrosser that exhibits tremendous phenotypic variation across the natural population. This strategy allowed us to comprehensively catalogue proteins containing single amino acid polymorphisms (SAAPs) and insertions and deletions (INDELS). Based on large-scale identification of SAAPs, we profiled the frequency of 128 types of naturally occurring amino acid substitutions, with a subset of SAAPs occurring in regions of the genome having strong polymorphism patterns consistent with recent positive and/or divergent selection. In addition, we were able to explore the diploid landscape of Populus at the proteome-level, allowing the characterization of heterozygous variants.

  18. Nucleotide binding database NBDB – a collection of sequence motifs with specific protein-ligand interactions

    PubMed Central

    Zheng, Zejun; Goncearenco, Alexander; Berezovsky, Igor N.

    2016-01-01

    NBDB database describes protein motifs, elementary functional loops (EFLs) that are involved in binding of nucleotide-containing ligands and other biologically relevant cofactors/coenzymes, including ATP, AMP, ATP, GMP, GDP, GTP, CTP, PAP, PPS, FMN, FAD(H), NAD(H), NADP, cAMP, cGMP, c-di-AMP and c-di-GMP, ThPP, THD, F-420, ACO, CoA, PLP and SAM. The database is freely available online at http://nbdb.bii.a-star.edu.sg. In total, NBDB contains data on 249 motifs that work in interactions with 24 ligands. Sequence profiles of EFL motifs were derived de novo from nonredundant Uniprot proteome sequences. Conserved amino acid residues in the profiles interact specifically with distinct chemical parts of nucleotide-containing ligands, such as nitrogenous bases, phosphate groups, ribose, nicotinamide, and flavin moieties. Each EFL profile in the database is characterized by a pattern of corresponding ligand–protein interactions found in crystallized ligand–protein complexes. NBDB database helps to explore the determinants of nucleotide and cofactor binding in different protein folds and families. NBDB can also detect fragments that match to profiles of particular EFLs in the protein sequence provided by user. Comprehensive information on sequence, structures, and interactions of EFLs with ligands provides a foundation for experimental and computational efforts on design of required protein functions. PMID:26507856

  19. Nucleotide binding database NBDB--a collection of sequence motifs with specific protein-ligand interactions.

    PubMed

    Zheng, Zejun; Goncearenco, Alexander; Berezovsky, Igor N

    2016-01-01

    NBDB database describes protein motifs, elementary functional loops (EFLs) that are involved in binding of nucleotide-containing ligands and other biologically relevant cofactors/coenzymes, including ATP, AMP, ATP, GMP, GDP, GTP, CTP, PAP, PPS, FMN, FAD(H), NAD(H), NADP, cAMP, cGMP, c-di-AMP and c-di-GMP, ThPP, THD, F-420, ACO, CoA, PLP and SAM. The database is freely available online at http://nbdb.bii.a-star.edu.sg. In total, NBDB contains data on 249 motifs that work in interactions with 24 ligands. Sequence profiles of EFL motifs were derived de novo from nonredundant Uniprot proteome sequences. Conserved amino acid residues in the profiles interact specifically with distinct chemical parts of nucleotide-containing ligands, such as nitrogenous bases, phosphate groups, ribose, nicotinamide, and flavin moieties. Each EFL profile in the database is characterized by a pattern of corresponding ligand-protein interactions found in crystallized ligand-protein complexes. NBDB database helps to explore the determinants of nucleotide and cofactor binding in different protein folds and families. NBDB can also detect fragments that match to profiles of particular EFLs in the protein sequence provided by user. Comprehensive information on sequence, structures, and interactions of EFLs with ligands provides a foundation for experimental and computational efforts on design of required protein functions.

  20. Monoclonal antibodies against an identical short peptide sequence shared by two unrelated proteins.

    PubMed

    Schulze-Gahmen, U; Wilson, I A

    1989-01-01

    Antipeptide antibodies provide the opportunity to explore the molecular basis for antigen-antibody recognition and to test theories of immune recognition. We investigated the possibility of raising monoclonal antipeptide antibodies against a specific epitope consisting of six amino acid residues, which is common to two unrelated proteins. The goal of this investigation was to analyze the reactivity of these epitope specific antibodies towards the same sequence in these two different proteins. A correlation between antibody reactivity and secondary structures of the same peptide sequence in different proteins could help to understand the ability of antipeptide antibodies to react with their cognate sequence in intact folded proteins. Monoclonal antibodies were raised against one hexamer sequence, PGTAPK, that is present in both thioredoxin and Fab New lambda-light chain. The antipeptide antibodies reacted only with thioredoxin but not with Fab New in ELISA's, immune precipitation and Western blots. Determination of the antibody specificity through binding tests with peptide analogs revealed the influence of the residue N-terminal from the hexamer epitope on antibody binding. Because of the observed influence of the N-1 adjacent residue in peptide analogs, the discrimination between the protein antigens could not be interpreted clearly as the result of the different hexamer conformations present in the native structures of the two proteins. However, analysis of the antibody reactivity with peptide analogs with varying "frame residues" surrounding the hexamer epitope indicates the possible discrimination of different peptide conformations by the antibody.

  1. Nucleotide binding database NBDB--a collection of sequence motifs with specific protein-ligand interactions.

    PubMed

    Zheng, Zejun; Goncearenco, Alexander; Berezovsky, Igor N

    2016-01-01

    NBDB database describes protein motifs, elementary functional loops (EFLs) that are involved in binding of nucleotide-containing ligands and other biologically relevant cofactors/coenzymes, including ATP, AMP, ATP, GMP, GDP, GTP, CTP, PAP, PPS, FMN, FAD(H), NAD(H), NADP, cAMP, cGMP, c-di-AMP and c-di-GMP, ThPP, THD, F-420, ACO, CoA, PLP and SAM. The database is freely available online at http://nbdb.bii.a-star.edu.sg. In total, NBDB contains data on 249 motifs that work in interactions with 24 ligands. Sequence profiles of EFL motifs were derived de novo from nonredundant Uniprot proteome sequences. Conserved amino acid residues in the profiles interact specifically with distinct chemical parts of nucleotide-containing ligands, such as nitrogenous bases, phosphate groups, ribose, nicotinamide, and flavin moieties. Each EFL profile in the database is characterized by a pattern of corresponding ligand-protein interactions found in crystallized ligand-protein complexes. NBDB database helps to explore the determinants of nucleotide and cofactor binding in different protein folds and families. NBDB can also detect fragments that match to profiles of particular EFLs in the protein sequence provided by user. Comprehensive information on sequence, structures, and interactions of EFLs with ligands provides a foundation for experimental and computational efforts on design of required protein functions. PMID:26507856

  2. The CATH extended protein-family database: providing structural annotations for genome sequences.

    PubMed

    Pearl, Frances M G; Lee, David; Bray, James E; Buchan, Daniel W A; Shepherd, Adrian J; Orengo, Christine A

    2002-02-01

    An automatic sequence search and analysis protocol (DomainFinder) based on PSI-BLAST and IMPALA, and using conservative thresholds, has been developed for reliably integrating gene sequences from GenBank into their respective structural families within the CATH domain database (http://www.biochem.ucl.ac.uk/bsm/cath_new). DomainFinder assigns a new gene sequence to a CATH homologous superfamily provided that PSI-BLAST identifies a clear relationship to at least one other Protein Data Bank sequence within that superfamily. This has resulted in an expansion of the CATH protein family database (CATH-PFDB v1.6) from 19,563 domain structures to 176,597 domain sequences. A further 50,000 putative homologous relationships can be identified using less stringent cut-offs and these relationships are maintained within neighbour tables in the CATH Oracle database, pending further evidence of their suggested evolutionary relationship. Analysis of the CATH-PFDB has shown that only 15% of the sequence families are close enough to a known structure for reliable homology modeling. IMPALA/PSI-BLAST profiles have been generated for each of the sequence families in the expanded CATH-PFDB and a web server has been provided so that new sequences may be scanned against the profile library and be assigned to a structure and homologous superfamily.

  3. Sequence composition and environment effects on residue fluctuations in protein structures

    NASA Astrophysics Data System (ADS)

    Ruvinsky, Anatoly M.; Vakser, Ilya A.

    2010-10-01

    Structure fluctuations in proteins affect a broad range of cell phenomena, including stability of proteins and their fragments, allosteric transitions, and energy transfer. This study presents a statistical-thermodynamic analysis of relationship between the sequence composition and the distribution of residue fluctuations in protein-protein complexes. A one-node-per-residue elastic network model accounting for the nonhomogeneous protein mass distribution and the interatomic interactions through the renormalized inter-residue potential is developed. Two factors, a protein mass distribution and a residue environment, were found to determine the scale of residue fluctuations. Surface residues undergo larger fluctuations than core residues in agreement with experimental observations. Ranking residues over the normalized scale of fluctuations yields a distinct classification of amino acids into three groups: (i) highly fluctuating-Gly, Ala, Ser, Pro, and Asp, (ii) moderately fluctuating-Thr, Asn, Gln, Lys, Glu, Arg, Val, and Cys, and (iii) weakly fluctuating-Ile, Leu, Met, Phe, Tyr, Trp, and His. The structural instability in proteins possibly relates to the high content of the highly fluctuating residues and a deficiency of the weakly fluctuating residues in irregular secondary structure elements (loops), chameleon sequences, and disordered proteins. Strong correlation between residue fluctuations and the sequence composition of protein loops supports this hypothesis. Comparing fluctuations of binding site residues (interface residues) with other surface residues shows that, on average, the interface is more rigid than the rest of the protein surface and Gly, Ala, Ser, Cys, Leu, and Trp have a propensity to form more stable docking patches on the interface. The findings have broad implications for understanding mechanisms of protein association and stability of protein structures.

  4. Rapid Evolution of the Sequences and Gene Repertoires of Secreted Proteins in Bacteria

    PubMed Central

    Rocha, Eduardo P. C.

    2012-01-01

    Proteins secreted to the extracellular environment or to the periphery of the cell envelope, the secretome, play essential roles in foraging, antagonistic and mutualistic interactions. We hypothesize that arms races, genetic conflicts and varying selective pressures should lead to the rapid change of sequences and gene repertoires of the secretome. The analysis of 42 bacterial pan-genomes shows that secreted, and especially extracellular proteins, are predominantly encoded in the accessory genome, i.e. among genes not ubiquitous within the clade. Genes encoding outer membrane proteins might engage more frequently in intra-chromosomal gene conversion because they are more often in multi-genic families. The gene sequences encoding the secretome evolve faster than the rest of the genome and in particular at non-synonymous positions. Cell wall proteins in Firmicutes evolve particularly fast when compared with outer membrane proteins of Proteobacteria. Virulence factors are over-represented in the secretome, notably in outer membrane proteins, but cell localization explains more of the variance in substitution rates and gene repertoires than sequence homology to known virulence factors. Accordingly, the repertoires and sequences of the genes encoding the secretome change fast in the clades of obligatory and facultative pathogens and also in the clades of mutualists and free-living bacteria. Our study shows that cell localization shapes genome evolution. In agreement with our hypothesis, the repertoires and the sequences of genes encoding secreted proteins evolve fast. The particularly rapid change of extracellular proteins suggests that these public goods are key players in bacterial adaptation. PMID:23189144

  5. Sequence analysis and comparison of cDNAs of the zein multigene family .

    PubMed Central

    Geraghty, D E; Messing, J; Rubenstein, I

    1982-01-01

    The nucleotide sequence of two zein cDNAs in hybrid plasmids A20 and B49 have been determined. The insert in A20 is 921 bp long including a 5' non-coding region of 60 nucleotides, preceded by what is believed to be an artifactual sequence of 41 nucleotides, and a 3' non-coding region of 87 nucleotides. The B49 insert is 467 bp long and includes approximately one-half the protein coding sequence as well as a 3' non-coding region of 97 nucleotides. These sequences have been compared with the previously published sequence of another zein clone, A30 . A20 and A30 , both encoding 19 000 mol. wt. zeins , have approximately 85% homology at the nucleotide level. The B49 sequence, corresponding to a 22 000 mol. wt. zein, has approximately 65% homology to either A20 or A30 . All three zeins share common features including nearly identical amino acid compositions. In addition, the tandem repeats of 20 amino acids first seen in A30 are also present in A20 and B49 . PMID:6897917

  6. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

    PubMed Central

    Koehorst, Jasper J.; Saccenti, Edoardo; Schaap, Peter J.; Martins dos Santos, Vitor A. P.; Suarez-Diez, Maria

    2016-01-01

    A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic bounderies. As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

  7. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

    PubMed Central

    Koehorst, Jasper J.; Saccenti, Edoardo; Schaap, Peter J.; Martins dos Santos, Vitor A. P.; Suarez-Diez, Maria

    2016-01-01

    A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic bounderies. As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness. PMID:27703668

  8. Molecular cloning and sequence determination of the genomic regions encoding protease and genome-linked protein of three picornaviruses.

    PubMed Central

    Werner, G; Rosenwirth, B; Bauer, E; Seifert, J M; Werner, F J; Besemer, J

    1986-01-01

    To investigate the degree of similarity between picornavirus proteases, we cloned the genomic cDNAs of an enterovirus, echovirus 9 (strain Barty), and two rhinoviruses, serotypes 1A and 14LP, and determined the nucleotide sequence of the region which, by analogy to poliovirus, encodes the protease. The nucleotide sequence of the region encoding the genome-linked protein VPg, immediately adjacent to the protease, was also determined. Comparison of nucleotide and deduced amino acid sequences with other available picornavirus sequences showed remarkable homology in proteases and among VPgs. Three highly conserved peptide regions were identified in the protease; one of these is specific for human picornaviruses and has no obvious counterpart in encephalomyocarditis virus, foot-and-mouth disease virus, or cowpea mosaic virus proteases. Within the other two peptide regions two conserved amino acids, Cys 147 and His 161, could be the reactive residues of the active site. We used a statistical method to predict certain features of the secondary structures, such as alpha helices, beta sheets, and turns, and found many of these conformations to be conserved. The hydropathy profiles of the compared proteases were also strikingly similar. Thus, the proteases of human picornaviruses very probably have a similar three-dimensional structure. Images PMID:3512851

  9. alpha. -Amylase of Clostridium thermosulfurogenes EM1: Nucleotide sequence of the gene, processing of the enzyme, and comparison to other. alpha. -amylases

    SciTech Connect

    Bahl, H.; Burchhardt, G.; Spreinat, A.; Haeckel, K.; Wienecke, A.; Antranikian, G.; Schmidt, B. )

    1991-05-01

    The nucleotide sequence of the {alpha}-amylase gene (amyA) from Clostridium thermosulfurogenes EM1 cloned in Escherichia coli was determined. The reading frame of the gene consisted of 2,121 bp. Comparison of the DNA sequence data with the amino acid sequence of the N terminus of the purified secreted protein of C. thermosulfurogenes Em1 suggested that the {alpha}-amylase is translated form mRNA as a secretory precursor with a signal peptide of 27 amino acid residues. The deduced amino acid sequence of the mature {alpha}-amylase contained 679 residues, resulting in a protein with a molecular mass of 75,112 Da. In E. coli the enzyme was transported to the periplasmic space and the signal peptide was cleaved at exactly the same site between two alanine residues. Comparison of the amino acid sequence of the C. thermosulfurogenes EM1 {alpha}-amylase with those from other bacterial and eukaryotic {alpha}-amylases showed several homologous regions, probably in the enzymatically functioning regions. The tentative Ca{sup 2+}-binding site (consensus region I) of this Ca{sub 2+}-independent enzyme showed only limited homology. The deduced amino acid sequence of a second obviously truncated open reading frame showed significant homology to the malG gene product of E. coli. Comparison of the {alpha}-amylase gene region of C. thermosulfurogenes EM1 (DSM3896) with the {beta}-amylase gene region of C. thermosulfurogenes (ATCC 33743) indicated that both genes have been exchanged with each other at identical sites in the chromosomes of these strains.

  10. ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids.

    PubMed

    Ashkenazy, Haim; Erez, Elana; Martz, Eric; Pupko, Tal; Ben-Tal, Nir

    2010-07-01

    It is informative to detect highly conserved positions in proteins and nucleic acid sequence/structure since they are often indicative of structural and/or functional importance. ConSurf (http://consurf.tau.ac.il) and ConSeq (http://conseq.tau.ac.il) are two well-established web servers for calculating the evolutionary conservation of amino acid positions in proteins using an empirical Bayesian inference, starting from protein structure and sequence, respectively. Here, we present the new version of the ConSurf web server that combines the two independent servers, providing an easier and more intuitive step-by-step interface, while offering the user more flexibility during the process. In addition, the new version of ConSurf calculates the evolutionary rates for nucleic acid sequences. The new version is freely available at: http://consurf.tau.ac.il/.

  11. Tough Coating Proteins: Subtle Sequence Variation Modulates Cohesion

    PubMed Central

    Das, Saurabh; Miller, Dusty R.; Kaufman, Yair; Martinez Rodriguez, Nadine R.; Pallaoro, Alessia; Harrington, Matthew J.; Gylys, Maryte; Israelachvili, Jacob N.; Waite, J. Herbert

    2015-01-01

    Mussel foot protein-1 (mfp-1) is an essential constituent of the protective cuticle covering all exposed portions of the byssus (plaque and the thread) that marine mussels use to attach to intertidal rocks. The reversible complexation of Fe3+ by the 3,4-dihydroxyphenylalanine (Dopa) side chains in mfp-1 in Mytilus californianus cuticle is responsible for its high extensibility (120%) as well as its stiffness (2 GPa) due to the formation of sacrificial bonds that help to dissipate energy and avoid accumulation of stresses in the material. We have investigated the interactions between Fe3+ and mfp-1 from two mussel species, M. californianus (Mc) and M. edulis (Me), using both surface sensitive and solution phase techniques. Our results show that although mfp-1 homologues from both species bind Fe3+, mfp-1 (Mc) contains Dopa with two distinct Fe3+-binding tendencies and prefers to form intramolecular complexes with Fe3+. In contrast, mfp-1 (Me) is better adapted to intermolecular Fe3+ binding by Dopa. Addition of Fe3+ did not significantly increase the cohesion energy between the mfp-1 (Mc) films at pH 5.5. However, iron appears to stabilize the cohesive bridging of mfp-1 (Mc) films at the physiologically relevant pH of 7.5, where most other mfps lose their ability to adhere reversibly. Understanding the molecular mechanisms underpinning the capacity of M. californianus cuticle to withstand twice the strain of M. edulis cuticle is important for engineering of tunable strain tolerant composite coatings for biomedical applications. PMID:25692318

  12. Variation in the prion protein sequence in Dutch goat breeds.

    PubMed

    Windig, J J; Hoving, R A H; Priem, J; Bossers, A; van Keulen, L J M; Langeveld, J P M

    2016-10-01

    Scrapie is a neurodegenerative disease occurring in goats and sheep. Several haplotypes of the prion protein increase resistance to scrapie infection and may be used in selective breeding to help eradicate scrapie. In this study, frequencies of the allelic variants of the PrP gene are determined for six goat breeds in the Netherlands. Overall frequencies in Dutch goats were determined from 768 brain tissue samples in 2005, 766 in 2008 and 300 in 2012, derived from random sampling for the national scrapie surveillance without knowledge of the breed. Breed specific frequencies were determined in the winter 2013/2014 by sampling 300 breeding animals from the main breeders of the different breeds. Detailed analysis of the scrapie-resistant K222 haplotype was carried out in 2014 for 220 Dutch Toggenburger goats and in 2015 for 942 goats from the Saanen derived White Goat breed. Nine haplotypes were identified in the Dutch breeds. Frequencies for non-wild type haplotypes were generally low. Exception was the K222 haplotype in the Dutch Toggenburger (29%) and the S146 haplotype in the Nubian and Boer breeds (respectively 7 and 31%). The frequency of the K222 haplotype in the Toggenburger was higher than for any other breed reported in literature, while for the White Goat breed it was with 3.1% similar to frequencies of other Saanen or Saanen derived breeds. Further evidence was found for the existence of two M142 haplotypes, M142 /S240 and M142 /P240 . Breeds vary in haplotype frequencies but frequencies of resistant genotypes are generally low and consequently selective breeding for scrapie resistance can only be slow but will benefit from animals identified in this study. The unexpectedly high frequency of the K222 haplotype in the Dutch Toggenburger underlines the need for conservation of rare breeds in order to conserve genetic diversity rare or absent in other breeds. PMID:26991480

  13. The utility of artificially evolved sequences in protein threading and fold recognition.

    PubMed

    Brylinski, Michal

    2013-07-01

    Template-based protein structure prediction plays an important role in Functional Genomics by providing structural models of gene products, which can be utilized by structure-based approaches to function inference. From a systems level perspective, the high structural coverage of gene products in a given organism is critical. Despite continuous efforts towards the development of more sensitive threading approaches, confident structural models cannot be constructed for a considerable fraction of proteins due to difficulties in recognizing low-sequence identity templates with a similar fold to the target. Here we introduce a new modeling stratagem, which employs a library of synthetic sequences to improve template ranking in fold recognition by sequence profile-based methods. We developed a new method for the optimization of generic protein-like amino acid sequences to stabilize the respective structures using a combined empirical scoring function, which is compatible with these commonly used in protein threading and fold recognition. We show that the artificially evolved sequences, whose average sequence identity to the wild-type sequences is as low as 13.8%, have significant capabilities to recognize the correct structures. Importantly, the quality of the corresponding threading alignments is comparable to these constructed using conventional wild-type approaches (the average TM-score is 0.48 and 0.54, respectively). Fold recognition that uses data fusion to combine ranks calculated for both wild-type and synthetic template libraries systematically improves the detection of structural analogs. Depending on the threading algorithm used, it yields on average 4-16% higher recognition rates than using the wild-type template library alone. Synthetic sequences artificially evolved for the template structures provide an orthogonal source of signal that could be exploited to detect these templates unrecognized by standard modeling techniques. It opens up new directions in

  14. Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns

    PubMed Central

    2007-01-01

    We have converted genome-encoded protein sequences into musical notes to reveal auditory patterns without compromising musicality. We derived a reduced range of 13 base notes by pairing similar amino acids and distinguishing them using variations of three-note chords and codon distribution to dictate rhythm. The conversion will help make genomic coding sequences more approachable for the general public, young children, and vision-impaired scientists. PMID:17477882

  15. Cloning and sequencing of a cDNA encoding a taste-modifying protein, miraculin.

    PubMed

    Masuda, Y; Nirasawa, S; Nakaya, K; Kurihara, Y

    1995-08-19

    A cDNA clone encoding a taste-modifying protein, miraculin (MIR), was isolated and sequenced. The encoded precursor to MIR was composed of 220 amino acid (aa) residues, including a possible signal sequence of 29 aa. Northern blot analysis showed that the mRNA encoding MIR was already expressed in fruits of Richadella dulcifica at 3 weeks after pollination and was present specifically in the pulp. PMID:7665074

  16. Naked but not Hairless: the pitfalls of analyses of molecular adaptation based on few genome sequence comparisons.

    PubMed

    Delsuc, Frédéric; Tilak, Marie-Ka

    2015-02-20

    The naked mole-rat (Heterocephalus glaber) is the only rodent species that naturally lacks fur. Genome sequencing of this atypical rodent species recently shed light on a number of its morphological and physiological adaptations. More specifically, its hairless phenotype has been traced back to a single amino acid change (C397W) in the hair growth associated (HR) protein (or Hairless). By considering the available species diversity, we show that this specific position is in fact variable across mammals, including in the horse that was misleadingly reported to have the ancestral Cysteine. Moreover, by sequencing the corresponding HR exon in additional rodent species, we demonstrate that the C397W substitution is actually not a peculiarity of the naked mole-rat. Instead, this specific amino acid substitution is present in all hystricognath rodents investigated, which are all fully furred, including the naked mole-rat closest relative, the Damaraland mole-rat (Fukomys damarensis). Overall, we found no statistical correlation between amino acid changes at position 397 of the HR protein and reduced pilosity across the mammalian phylogeny. This demonstrates that this single amino acid change does not explain the naked mole-rat hairless phenotype. Our case study calls for caution before making strong claims regarding the molecular basis of phenotypic adaptation based on the screening of specific amino acid substitutions using only few model species in genome sequence comparisons. It also exposes the more general problem of the dilution of essential information in the supplementary material of genome papers thereby increasing the probability that misleading results will escape the scrutiny of editors, reviewers, and ultimately readers.

  17. Hfqs in Bacillus anthracis: Role of protein sequence variation in the structure and function of proteins in the Hfq family.

    PubMed

    Vrentas, Catherine; Ghirlando, Rodolfo; Keefer, Andrea; Hu, Zonglin; Tomczak, Aurelie; Gittis, Apostolos G; Murthi, Athulaprabha; Garboczi, David N; Gottesman, Susan; Leppla, Stephen H

    2015-11-01

    Hfq proteins in Gram-negative bacteria play important roles in bacterial physiology and virulence, mediated by binding of the Hfq hexamer to small RNAs and/or mRNAs to post-transcriptionally regulate gene expression. However, the physiological role of Hfqs in Gram-positive bacteria is less clear. Bacillus anthracis, the causative agent of anthrax, uniquely expresses three distinct Hfq proteins, two from the chromosome (Hfq1, Hfq2) and one from its pXO1 virulence plasmid (Hfq3). The protein sequences of Hfq1 and 3 are evolutionarily distinct from those of Hfq2 and of Hfqs found in other Bacilli. Here, the quaternary structure of each B. anthracis Hfq protein, as produced heterologously in Escherichia coli, was characterized. While Hfq2 adopts the expected hexamer structure, Hfq1 does not form similarly stable hexamers in vitro. The impact on the monomer-hexamer equilibrium of varying Hfq C-terminal tail length and other sequence differences among the Hfqs was examined, and a sequence region of the Hfq proteins that was involved in hexamer formation was identified. It was found that, in addition to the distinct higher-order structures of the Hfq homologs, they give rise to different phenotypes. Hfq1 has a disruptive effect on the function of E. coli Hfq in vivo, while Hfq3 expression at high levels is toxic to E. coli but also partially complements Hfq function in E. coli. These results set the stage for future studies of the roles of these proteins in B. anthracis physiology and for the identification of sequence determinants of phenotypic complementation.

  18. A plant viral coat protein RNA binding consensus sequence contains a crucial arginine.

    PubMed Central

    Ansel-McKinney, P; Scott, S W; Swanson, M; Ge, X; Gehrke, L

    1996-01-01

    A defining feature of alfalfa mosaic virus (AMV) and ilarviruses [type virus: tobacco streak virus (TSV)] is that, in addition to genomic RNAs, viral coat protein is required to establish infection in plants. AMV and TSV coat proteins, which share little primary amino acid sequence identity, are functionally interchangeable in RNA binding and initiation of infection. The lysine-rich amino-terminal RNA binding domain of the AMV coat protein lacks previously identified RNA binding motifs. Here, the AMV coat protein RNA binding domain is shown to contain a single arginine whose specific side chain and position are crucial for RNA binding. In addition, the putative RNA binding domain of two ilarvirus coat proteins, TSV and citrus variegation virus, is identified and also shown to contain a crucial arginine. AMV and ilarvirus coat protein sequence alignment centering on the key arginine revealed a new RNA binding consensus sequence. This consensus may explain in part why heterologous viral RNA-coat protein mixtures are infectious. Images PMID:8890181

  19. PDP-CON: prediction of domain/linker residues in protein sequences using a consensus approach.

    PubMed

    Chatterjee, Piyali; Basu, Subhadip; Zubek, Julian; Kundu, Mahantapas; Nasipuri, Mita; Plewczynski, Dariusz

    2016-04-01

    The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers-decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron-were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use.

  20. Effect of k-tuple length on sample-comparison with high-throughput sequencing data.

    PubMed

    Wang, Ying; Lei, Xiaoye; Wang, Shun; Wang, Zicheng; Song, Nianfeng; Zeng, Feng; Chen, Ting

    2016-01-22

    The high-throughput metagenomic sequencing offers a powerful technique to compare the microbial communities. Without requiring extra reference sequences, alignment-free models with short k-tuple (k = 2-10 bp) yielded promising results. Short k-tuples describe the overall statistical distribution, but is hard to capture the specific characteristics inside one microbial community. Longer k-tuple contains more abundant information. However, because the frequency vector of long k-tuple(k ≥ 30 bp) is sparse, the statistical measures designed for short k-tuples are not applicable. In our study, we considered each tuple as a meaningful word and then each sequencing data as a document composed of the words. Therefore, the comparison between two sequencing data is processed as "topic analysis of documents" in text mining. We designed a pipeline with long k-tuple features to compare metagenomic samples combined using algorithms from text mining and pattern recognition. The pipeline is available at http://culotuple.codeplex.com/. Experiments show that our pipeline with long k-tuple features: ①separates genomes with high similarity; ②outperforms short k-tuple models in all experiments. When k ≥ 12, the short k-tuple measures are not applicable anymore. When k is between 20 and 40, long k-tuple pipeline obtains much better grouping results; ③is free from the effect of sequencing platforms/protocols. ③We obtained meaningful and supported biological results on the 40-tuples selected for comparison. PMID:26721429

  1. Evidence of mineralization activity and supramolecular assembly by the N-terminal sequence of ACCBP, a biomineralization protein that is homologous to the acetylcholine binding protein family.

    PubMed

    Amos, Fairland F; Ndao, Moise; Evans, John Spencer

    2009-12-14

    Several biomineralization proteins that exhibit intrinsic disorder also possess sequence regions that are homologous to nonmineral associated folded proteins. One such protein is the amorphous calcium carbonate binding protein (ACCBP), one of several proteins that regulate the formation of the oyster shell and exhibit 30% conserved sequence identity to the acetylcholine binding protein sequences. To gain a better understanding of the ACCBP protein, we utilized bioinformatic approaches to identify the location of disordered and folded regions within this protein. In addition, we synthesized a 50 AA polypeptide, ACCN, representing the N-terminal domain of the mature processed ACCBP protein. We then utilized this polypeptide to determine the mineralization activity and qualitative structure of the N-terminal region of ACCBP. Our bioinformatic studies indicate that ACCBP consists of a ten-stranded beta-sandwich structure that includes short disordered sequence blocks, two of which reside within the primarily helical and surface-accessible ACCN sequence. Circular dichroism studies reveal that ACCN is partially disordered in solution; however, ACCN can be induced to fold into an alpha helix in the presence of TFE. Furthermore, we confirm that the ACCN sequence is multifunctional; this sequence promotes radial calcite polycrystal growth on Kevlar threads and forms supramolecular assemblies in solution that contain amorphous-appearing deposits. We conclude that the partially disordered ACCN sequence is a putative site for mineralization activity within the ACCBP protein and that the presence of short disordered sequence regions within the ACCBP fold are essential for function.

  2. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.

    PubMed

    Neuwald, Andrew F; Altschul, Stephen F

    2016-05-01

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/. PMID:27192614

  3. Sequence preservation of osteocalcin protein and mitochondrial DNA in bison bones older than 55 ka

    NASA Astrophysics Data System (ADS)

    Nielsen-Marsh, Christina M.; Ostrom, Peggy H.; Gandhi, Hasand; Shapiro, Beth; Cooper, Alan; Hauschka, Peter V.; Collins, Matthew J.

    2002-12-01

    We report the first complete sequences of the protein osteocalcin from small amounts (20 mg) of two bison bone (Bison priscus) dated to older than 55.6 ka and older than 58.9 ka. Osteocalcin was purified using new gravity columns (never exposed to protein) followed by microbore reversed-phase high-performance liquid chromatography. Sequencing of osteocalcin employed two methods of matrix-assisted laser desorption ionization mass spectrometry (MALDI-MS): peptide mass mapping (PMM) and post-source decay (PSD). The PMM shows that ancient and modern bison osteocalcin have the same mass to charge (m/z) distribution, indicating an identical protein sequence and absence of diagenetic products. This was confirmed by PSD of the m/z 2066 tryptic peptide (residues 1 19); the mass spectra from ancient and modern peptides were identical. The 129 mass unit difference in the molecular ion between cow (Bos taurus) and bison is caused by a single amino-acid substitution between the taxa (Trp in cow is replaced by Gly in bison at residue 5). Bison mitochondrial control region DNA sequences were obtained from the older than 55.6 ka fossil. These results suggest that DNA and protein sequences can be used to directly investigate molecular phylogenies over a considerable time period, the absolute limit of which is yet to be determined.

  4. Efficient use of unlabeled data for protein sequence classification: a comparative study

    PubMed Central

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-01-01

    Background Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags–the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Results Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. Conclusion The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably. PMID:19426450

  5. Mapping Protein-DNA Interactions Using ChIP-exo and Illumina-Based Sequencing.

    PubMed

    Barfeld, Stefan J; Mills, Ian G

    2016-01-01

    Chromatin immunoprecipitation (ChIP) provides a means of enriching DNA associated with transcription factors, histone modifications, and indeed any other proteins for which suitably characterized antibodies are available. Over the years, sequence detection has progressed from quantitative real-time PCR and Southern blotting to microarrays (ChIP-chip) and now high-throughput sequencing (ChIP-seq). This progression has vastly increased the sequence coverage and data volumes generated. This in turn has enabled informaticians to predict the identity of multi-protein complexes on DNA based on the overrepresentation of sequence motifs in DNA enriched by ChIP with a single antibody against a single protein. In the course of the development of high-throughput sequencing, little has changed in the ChIP methodology until recently. In the last three years, a number of modifications have been made to the ChIP protocol with the goal of enhancing the sensitivity of the method and further reducing the levels of nonspecific background sequences in ChIPped samples. In this chapter, we provide a brief commentary on these methodological changes and describe a detailed ChIP-exo method able to generate narrower peaks and greater peak coverage from ChIPped material.

  6. GeneSV - an Approach to Help Characterize Possible Variations in Genomic and Protein Sequences.

    PubMed

    Zemla, Adam; Kostova, Tanya; Gorchakov, Rodion; Volkova, Evgeniya; Beasley, David W C; Cardosa, Jane; Weaver, Scott C; Vasilakis, Nikos; Naraghi-Arani, Pejman

    2014-01-01

    A computational approach for identification and assessment of genomic sequence variability (GeneSV) is described. For a given nucleotide sequence, GeneSV collects information about the permissible nucleotide variability (changes that potentially preserve function) observed in corresponding regions in genomic sequences, and combines it with conservation/variability results from protein sequence and structure-based analyses of evaluated protein coding regions. GeneSV was used to predict effects (functional vs. non-functional) of 37 amino acid substitutions on the NS5 polymerase (RdRp) of dengue virus type 2 (DENV-2), 36 of which are not observed in any publicly available DENV-2 sequence. 32 novel mutants with single amino acid substitutions in the RdRp were generated using a DENV-2 reverse genetics system. In 81% (26 of 32) of predictions tested, GeneSV correctly predicted viability of introduced mutations. In 4 of 5 (80%) mutants with double amino acid substitutions proximal in structure to one another GeneSV was also correct in its predictions. Predictive capabilities of the developed system were illustrated on dengue RNA virus, but described in the manuscript a general approach to characterize real or theoretically possible variations in genomic and protein sequences can be applied to any organism. PMID:24453480

  7. A Comparison Study for DNA Motif Modeling on Protein Binding Microarray.

    PubMed

    Wong, Ka-Chun; Li, Yue; Peng, Chengbin; Wong, Hau-San

    2016-01-01

    Transcription factor binding sites (TFBSs) are relatively short (5-15 bp) and degenerate. Identifying them is a computationally challenging task. In particular, protein binding microarray (PBM) is a high-throughput platform that can measure the DNA binding preference of a protein in a comprehensive and unbiased manner; for instance, a typical PBM experiment can measure binding signal intensities of a protein to all possible DNA k-mers (k = 8∼10). Since proteins can often bind to DNA with different binding intensities, one of the major challenges is to build TFBS (also known as DNA motif) models which can fully capture the quantitative binding affinity data. To learn DNA motif models from the non-convex objective function landscape, several optimization methods are compared and applied to the PBM motif model building problem. In particular, representative methods from different optimization paradigms have been chosen for modeling performance comparison on hundreds of PBM datasets. The results suggest that the multimodal optimization methods are very effective for capturing the binding preference information from PBM data. In particular, we observe a general performance improvement if choosing di-nucleotide modeling over mono-nucleotide modeling. In addition, the models learned by the best-performing method are applied to two independent applications: PBM probe rotation testing and ChIP-Seq peak sequence prediction, demonstrating its biological applicability.

  8. Octopus S-crystallins with endogenous glutathione S-transferase (GST) activity: sequence comparison and evolutionary relationships with authentic GST enzymes.

    PubMed Central

    Chiou, S H; Yu, C W; Lin, C W; Pan, F M; Lu, S F; Lee, H J; Chang, G G

    1995-01-01

    S-Crystallin is a major protein present in the lenses of cephalopods (octopus and squid). To facilitate the cloning of this crystallin gene, cDNA was constructed from the poly(A)+ mRNA of octopus lenses, and amplified by PCR for nucleotide sequencing. Sequencing of 10 of 15 positive clones coding for this crystallin revealed three distinct S-crystallin isoforms with 61-64% identity in nucleotide sequences and 42-58% similarity in amino acid sequences when compared with homologous crystallins in squid lenses. These charge-isomeric crystallins also show between 26 and 33% amino acid sequence identity to four major classes of glutathione S-transferase (GST), a major detoxification enzyme present in most mammalian tissues. For further analysis, expression of one of the S-crystallin cDNAs was carried out in the bacterial expression system pQE-30, and the S-crystallin protein produced in Escherichia coli was purified to homogeneity to determine the enzymic properties. We found that the expressed octopus S-crystallin possessed much lower GST activity than the authentic GSTs from other tissues. Sequence comparison and construction of phylogenetic trees for S-crystallins from squid and octopus lenses and various classes of GSTs revealed that S-crystallins represent a multigene family which is structurally related to Alpha-class GSTs and probably derived from the ancestral GST by gene duplication and subsequent multiple mutational substitutions. Images Figure 2 Figure 3 Figure 6 Figure 7 PMID:7639695

  9. Characterization and amino acid sequence of a fatty acid-binding protein from human heart.

    PubMed Central

    Offner, G D; Brecher, P; Sawlivich, W B; Costello, C E; Troxler, R F

    1988-01-01

    The complete amino acid sequence of a fatty acid-binding protein from human heart was determined by automated Edman degradation of CNBr, BNPS-skatole [3'-bromo-3-methyl-2-(2-nitrobenzenesulphenyl)indolenine], hydroxylamine, Staphylococcus aureus V8 proteinase, tryptic and chymotryptic peptides, and by digestion of the protein with carboxypeptidase A. The sequence of the blocked N-terminal tryptic peptide from citraconylated protein was determined by collisionally induced decomposition mass spectrometry. The protein contains 132 amino acid residues, is enriched with respect to threonine and lysine, lacks cysteine, has an acetylated valine residue at the N-terminus, and has an Mr of 14768 and an isoelectric point of 5.25. This protein contains two short internal repeated sequences from residues 48-54 and from residues 114-119 located within regions of predicted beta-structure and decreasing hydrophobicity. These short repeats are contained within two longer repeated regions from residues 48-60 and residues 114-125, which display 62% sequence similarity. These regions could accommodate the charged and uncharged moieties of long-chain fatty acids and may represent fatty acid-binding domains consistent with the finding that human heart fatty acid-binding protein binds 2 mol of oleate or palmitate/mol of protein. Detailed evidence for the amino acid sequences of the peptides has been deposited as Supplementary Publication SUP 50143 (23 pages) at the British Library Lending Division, Boston Spa, Yorkshire LS23 7BQ, U.K., from whom copies may be obtained as indicated in Biochem. J. (1988) 249, 5. PMID:3421901

  10. Unraveling the sequence and structure of the protein osteocalcin from a 42 ka fossil horse

    NASA Astrophysics Data System (ADS)

    Ostrom, Peggy H.; Gandhi, Hasand; Strahler, John R.; Walker, Angela K.; Andrews, Philip C.; Leykam, Joseph; Stafford, Thomas W.; Kelly, Robert L.; Walker, Danny N.; Buckley, Mike; Humpula, James

    2006-04-01

    We report the first complete amino acid sequence and evidence of secondary structure for osteocalcin from a temperate fossil. The osteocalcin derives from a 42 ka equid bone excavated from Juniper Cave, Wyoming. Results were determined by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-MS) and Edman sequencing with independent confirmation of the sequence in two laboratories. The ancient sequence was compared to that of three modern taxa: horse ( Equus caballus), zebra ( Equus grevyi), and donkey ( Equus asinus). Although there was no difference in sequence among modern taxa, MALDI-MS and Edman sequencing show that residues 48 and 49 of our modern horse are Thr, Ala rather than Pro, Val as previously reported (Carstanjen B., Wattiez, R., Armory, H., Lepage, O.M., Remy, B., 2002. Isolation and characterization of equine osteocalcin. Ann. Med. Vet.146(1), 31-38). MALDI-MS and Edman sequencing data indicate that the osteocalcin sequence of the 42 ka fossil is similar to that of modern horse. Previously inaccessible structural attributes for ancient osteocalcin were observed. Glu 39 rather than Gln 39 is consistent with deamidation, a process known to occur during fossilization and aging. Two post-translational modifications were documented: Hyp 9 and a disulfide bridge. The latter suggests at least partial retention of secondary structure. As has been done for ancient DNA research, we recommend standards for preparation and criteria for authenticating results of ancient protein sequencing.

  11. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins.

    PubMed

    Garrow, Andrew G; Agnew, Alison; Westhead, David R

    2005-07-01

    TMB-Hunt is a program that uses a modified k-nearest neighbour (k-NN) algorithm to classify protein sequences as transmembrane beta-barrel (TMB) or non-TMB on the basis of whole sequence amino acid composition. By including differentially weighted amino acids, evolutionary information and by calibrating the scoring, a discrimination accuracy of 92.5% was achieved, as tested using a rigorous cross-validation procedure. The TMB-Hunt web server, available at www.bioinformatics.leeds.ac.uk/betaBarrel, allows screening of up to 10,000 sequences in a single query and provides results and key statistics in a simple colour coded format.

  12. Sequence variability is correlated with weak immunogenicity in Streptococcus pyogenes M protein

    PubMed Central

    Lannergård, Jonas; Kristensen, Bodil M; Gustafsson, Mattias C U; Persson, Jenny J; Norrby-Teglund, Anna; Stålhammar-Carlemalm, Margaretha; Lindahl, Gunnar

    2015-01-01

    The M protein of Streptococcus pyogenes, a major bacterial virulence factor, has an amino-terminal hypervariable region (HVR) that is a target for type-specific protective antibodies. Intriguingly, the HVR elicits a weak antibody response, indicating that it escapes host immunity by two mechanisms, sequence variability and weak immunogenicity. However, the properties influencing the immunogenicity of regions in an M protein remain poorly understood. Here, we studied the antibody response to different regions of the classical M1 and M5 proteins, in which not only the HVR but also the adjacent fibrinogen-binding B repeat region exhibits extensive sequence divergence. Analysis of antisera from S. pyogenes-infected patients, infected mice, and immunized mice showed that both the HVR and the B repeat region elicited weak antibody responses, while the conserved carboxy-terminal part was immunodominant. Thus, we identified a correlation between sequence variability and weak immunogenicity for M protein regions. A potential explanation for the weak immunogenicity was provided by the demonstration that protease digestion selectively eliminated the HVR-B part from whole M protein-expressing bacteria. These data support a coherent model, in which the entire variable HVR-B part evades antibody attack, not only by sequence variability but also by weak immunogenicity resulting from protease attack. PMID:26175306

  13. Protein backbone angle restraints from searching a database for chemical shift and sequence homology.

    PubMed

    Cornilescu, G; Delaglio, F; Bax, A

    1999-03-01

    Chemical shifts of backbone atoms in proteins are exquisitely sensitive to local conformation, and homologous proteins show quite similar patterns of secondary chemical shifts. The inverse of this relation is used to search a database for triplets of adjacent residues with secondary chemical shifts and sequence similarity which provide the best match to the query triplet of interest. The database contains 13C alpha, 13C beta, 13C', 1H alpha and 15N chemical shifts for 20 proteins for which a high resolution X-ray structure is available. The computer program TALOS was developed to search this database for strings of residues with chemical shift and residue type homology. The relative importance of the weighting factors attached to the secondary chemical shifts of the five types of resonances relative to that of sequence similarity was optimized empirically. TALOS yields the 10 triplets which have the closest similarity in secondary chemical shift and amino acid sequence to those of the query sequence. If the central residues in these 10 triplets exhibit similar phi and psi backbone angles, their averages can reliably be used as angular restraints for the protein whose structure is being studied. Tests carried out for proteins of known structure indicate that the root-mean-square difference (rmsd) between the output of TALOS and the X-ray derived backbone angles is about 15 degrees. Approximately 3% of the predictions made by TALOS are found to be in error.

  14. C-Terminal DxD-Containing Sequences within Paramyxovirus Nucleocapsid Proteins Determine Matrix Protein Compatibility and Can Direct Foreign Proteins into Budding Particles

    PubMed Central

    Ray, Greeshma; Schmitt, Phuong Tieu

    2016-01-01

    ABSTRACT Paramyxovirus particles are formed by a budding process coordinated by viral matrix (M) proteins. M proteins coalesce at sites underlying infected cell membranes and induce other viral components, including viral glycoproteins and viral ribonucleoprotein complexes (vRNPs), to assemble at these locations from which particles bud. M proteins interact with the nucleocapsid (NP or N) components of vRNPs, and these interactions enable production of infectious, genome-containing virions. For the paramyxoviruses parainfluenza virus 5 (PIV5) and mumps virus, M-NP interaction also contributes to efficient production of virus-like particles (VLPs) in transfected cells. A DLD sequence near the C-terminal end of PIV5 NP protein was previously found to be necessary for M-NP interaction and efficient VLP production. Here, we demonstrate that 15-residue-long, DLD-containing sequences derived from either the PIV5 or Nipah virus nucleocapsid protein C-terminal ends are sufficient to direct packaging of a foreign protein, Renilla luciferase, into budding VLPs. Mumps virus NP protein harbors DWD in place of the DLD sequence found in PIV5 NP protein, and consequently, PIV5 NP protein is incompatible with mumps virus M protein. A single amino acid change converting DLD to DWD within PIV5 NP protein induced compatibility between these proteins and allowed efficient production of mumps VLPs. Our data suggest a model in which paramyxoviruses share an overall common strategy for directing M-NP interactions but with important variations contained within DLD-like sequences that play key roles in defining M/NP protein compatibilities. IMPORTANCE Paramyxoviruses are responsible for a wide range of diseases that affect both humans and animals. Paramyxovirus pathogens include measles virus, mumps virus, human respiratory syncytial virus, and the zoonotic paramyxoviruses Nipah virus and Hendra virus. Infectivity of paramyxovirus particles depends on matrix-nucleocapsid protein

  15. Prediction of protein structural features from sequence data based on Shannon entropy and Kolmogorov complexity.

    PubMed

    Bywater, Robert Paul

    2015-01-01

    While the genome for a given organism stores the information necessary for the organism to function and flourish it is the proteins that are encoded by the genome that perhaps more than anything else characterize the phenotype for that organism. It is therefore not surprising that one of the many approaches to understanding and predicting protein folding and properties has come from genomics and more specifically from multiple sequence alignments. In this work I explore ways in which data derived from sequence alignment data can be used to investigate in a predictive way three different aspects of protein structure: secondary structures, inter-residue contacts and the dynamics of switching between different states of the protein. In particular the use of Kolmogorov complexity has identified a novel pathway towards achieving these goals.

  16. Parameters of the proteome evolution from the distribution of sequence identities of paralogous proteins

    NASA Astrophysics Data System (ADS)

    Yan, Koon-Kiu; Axelsen, Jacob; Maslov, Sergei

    2006-03-01

    The evolution of the full repertoire of proteins encoded in a given genome is driven by gene duplications, deletions and modifications of amino-acid sequences of already existing proteins. The information about relative rates and other intrinsic parameters of these three basic processes is contained in the distribution of sequence identities of pairs of paralogous proteins. We introduced a simple mathematical framework that allows one to extract some of this hidden information. It was then applied to the proteome-wide set of paralogous proteins in H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster and H. sapiens. We estimated the stationary per-gene deletion and duplication rates, the distribution of amino-acid substitution rate of these organisms. The validity of our mathematical framework was further confirmed by numerical simulations of a simple evolutionary model of a fixed-size proteome.

  17. CDvist: A webserver for identification and visualization of conserved domains in protein sequences

    DOE PAGESBeta

    Adebali, Ogun; Ortega, Davi R.; Zhulin, Igor B.

    2014-12-18

    Identification of domains in protein sequences allows their assigning to biological functions. Several webservers exist for identification of protein domains using similarity searches against various databases of protein domain models. However, none of them provides comprehensive domain coverage while allowing bulk querying and their visualization schemes can be improved. To address these issues, we developed CDvist (a comprehensive domain visualization tool), which combines the best available search algorithms and databases into a user-friendly framework. First, a given protein sequence is matched to domain models using high-specificity tools and only then unmatched segments are subjected to more sensitive algorithms resulting inmore » a best possible comprehensive coverage. In conclusion, bulk querying and rich visualization and download options provide improved functionality to domain architecture analysis.« less

  18. CDvist: A webserver for identification and visualization of conserved domains in protein sequences

    SciTech Connect

    Adebali, Ogun; Ortega, Davi R.; Zhulin, Igor B.

    2014-12-18

    Identification of domains in protein sequences allows their assigning to biological functions. Several webservers exist for identification of protein domains using similarity searches against various databases of protein domain models. However, none of them provides comprehensive domain coverage while allowing bulk querying and their visualization schemes can be improved. To address these issues, we developed CDvist (a comprehensive domain visualization tool), which combines the best available search algorithms and databases into a user-friendly framework. First, a given protein sequence is matched to domain models using high-specificity tools and only then unmatched segments are subjected to more sensitive algorithms resulting in a best possible comprehensive coverage. In conclusion, bulk querying and rich visualization and download options provide improved functionality to domain architecture analysis.

  19. Properties and sequence of a female-specific, juvenile hormone-induced protein from locust hemolymph.

    PubMed

    Zhang, J; McCracken, A; Wyatt, G R

    1993-02-15

    In the fat body of Locusta migratoria, an RNA transcript of about 800 nucleotides has been detected that is specific to the adult female and dependent on induction by juvenile hormone (JH) or an analog. The corresponding cDNA has been cloned (lambda 21) and a 718-base pair sequence determined. It encodes a 196-amino acid polypeptide, including a signal peptide. An NH2-terminal sequence has 24 out of 28 amino acids identical with those of a previously described 19K locust hemolymph protein, but the remainder of the sequence shows no similarity. From adult female hemolymph, a 21-kDa protein, designated 21K protein, has been purified, with an NH2-terminal sequence exactly matching that deduced from clone lambda 21. This 21K protein is found only in the adult female, is dependent on induction by JH, and is assumed to represent the product of the lambda 21 gene. It shows no immunochemical cross-reaction with locust 19K protein, apolipophorin III, nor with vitellogenin (Vg). Its isoelectric point is pH 5.4; it contains some carbohydrate. 21K protein is synthesized in adult female fat body, accumulates in hemolymph, and is taken up into the developing oocytes in parallel with Vg. In locusts deprived of JH with precocene, production of 21K protein and of lambda 21-hybridizing transcripts is induced by the JH analog, methoprene, in parallel with Vg and its mRNA. Because of its sex-, stage-, and JH-dependent regulation, coordinate with Vg, the 21K protein will be valuable for analysis of gene expression. PMID:7679110

  20. How the Sequence of a Gene Specifies Structural Symmetry in Proteins

    PubMed Central

    Shen, Xiaojuan; Huang, Tongcheng; Wang, Guanyu; Li, Guanglin

    2015-01-01

    Internal symmetry is commonly observed in the majority of fundamental protein folds. Meanwhile, sufficient evidence suggests that nascent polypeptide chains of proteins have the potential to start the co-translational folding process and this process allows mRNA to contain additional information on protein structure. In this paper, we study the relationship between gene sequences and protein structures from the viewpoint of symmetry to explore how gene sequences code for structural symmetry in proteins. We found that, for a set of two-fold symmetric proteins from left-handed beta-helix fold, intragenic symmetry always exists in their corresponding gene sequences. Meanwhile, codon usage bias and local mRNA structure might be involved in modulating translation speed for the formation of structural symmetry: a major decrease of local codon usage bias in the middle of the codon sequence can be identified as a common feature; and major or consecutive decreases in local mRNA folding energy near the boundaries of the symmetric substructures can also be observed. The results suggest that gene duplication and fusion may be an evolutionarily conserved process for this protein fold. In addition, the usage of rare codons and the formation of higher order of secondary structure near the boundaries of symmetric substructures might have coevolved as conserved mechanisms to slow down translation elongation and to facilitate effective folding of symmetric substructures. These findings provide valuable insights into our understanding of the mechanisms of translation and its evolution, as well as the design of proteins via symmetric modules. PMID:26641668

  1. Restriction of Nonpermissive RUNX3 Protein Expression in T Lymphocytes by the Kozak Sequence.

    PubMed

    Kim, Byungil; Sasaki, Yo; Egawa, Takeshi

    2015-08-15

    The transcription factor Runx3 promotes differentiation of naive CD4(+) T cells into type-1 effector T (TH1) cells at the expense of TH2. TH1 cells as well as CD8(+) T cells express a subset-specific Runx3 transcript from a distal promoter, which is necessary for high protein expression. However, all T cell subsets, including naive CD4(+) T cells and TH2 cells, express a distinct transcript of Runx3 that is derived from a proximal promoter and that produces functional protein in neurons. Therefore, accumulation of RUNX3 protein generated from the proximal transcript needs to be repressed at the posttranscriptional level to preserve CD4(+) T cell capability of differentiating into TH2 cells. In this article, we show that expression of RUNX3 protein from the proximal Runx3 transcript is blocked at the level of translational initiation in T cells. A coding sequence for the proximal Runx3 mRNA is preceded by a nonoptimal context sequence for translational initiation, known as the Kozak sequence, and thus generates protein at low efficiencies and with multiple alternative translational initiations. Editing the endogenous initiation context to an "optimal" Kozak sequence in a human T cell line resulted in enhanced translation of a single RUNX3 protein derived from the proximal transcript. Furthermore, RUNX3 protein represses transcription from the proximal promoter in T cells. These results suggest that nonpermissive expression of RUNX3 protein is restricted at the translational level, and that the repression is further enforced by a transcriptional regulation for maintenance of diverse developmental plasticity of T cells for different effector subsets. PMID:26170388

  2. Drosophila melanogaster mitochondrial DNA: completion of the nucleotide sequence and evolutionary comparisons.

    PubMed

    Lewis, D L; Farr, C L; Kaguni, L S

    1995-11-01

    The nucleotide sequence of the regions flanking the A+T region of Drosophila melanogaster mitochondrial DNA (mtDNA) has been determined. Included are the genes encoding the transfer RNAs for valine, isoleucine, glutamine and methionine, the small ribosomal RNA and the 5'-coding sequences of the large ribosomal RNA and NADH dehydrogenase subunit II. This completes the nucleotide sequence of the D. melanogaster mitochondrial genome. The circular mtDNA of D. melanogaster varies in size among different populations largely due to length differences in the control region (Fauron & Wolstenholme, 1976; Fauron & Wolstenholme, 1980a, b); the mtDNA region we have sequenced, combined with those sequenced by others, yields a composite genome that is 19,517 bp in length as compared to 16,019 bp for the mtDNA of D. yakuba. D. melanogaster mtDNA exhibits an extreme bias in base composition; it comprises 82.2% deoxyadenylate and thymidylate residues as compared to 78.6% in D. yakuba mtDNA. All genes encoded in the mtDNA of both species are in identical locations and orientations. Nucleotide substitution analysis reveals that tRNA and rRNA genes evolve at less than half the rate of protein coding genes.

  3. De novo Sequencing, Characterization, and Comparison of Inflorescence Transcriptomes of Cornus canadensis and C. florida (Cornaceae)

    PubMed Central

    Zhang, Jian; Franks, Robert G.; Liu, Xiang; Kang, Ming; Keebler, Jonathan E. M.; Schaff, Jennifer E.; Huang, Hong-Wen; Xiang, Qiu-Yun (Jenny)

    2013-01-01

    Background Transcriptome sequencing analysis is a powerful tool in molecular genetics and evolutionary biology. Here we report the results of de novo 454 sequencing, characterization, and comparison of inflorescence transcriptomes of two closely related dogwood species, Cornus canadensis and C. florida (Cornaceae). Our goals were to build a preliminary source of genome sequence data, and to identify genes potentially expressed differentially between the inflorescence transcriptomes for these important horticultural species. Results The sequencing of cDNAs from inflorescence buds of C. canadensis (cc) and C. florida (cf), and normalized cDNAs from leaves of C. canadensis resulted in 251799 (ccBud), 96245 (ccLeaf) and 114648 (cfBud) raw reads, respectively. The de novo assembly of the high quality (HQ) reads resulted in 36088, 17802 and 21210 unigenes for ccBud, ccLeaf and cfBud. A reference transcriptome for C. canadensis was built by assembling HQ reads of ccBud and ccLeaf, containing 40884 unigenes. Reference mapping and comparative analyses found 10926 sequences were putatively specific to ccBud, and 6979 putatively specific to cfBud. Putative differentially expressed genes between ccBud and cfBud that are related to flower development and/or stress response were identified among 7718 shared sequences by ccBud and cfBud. Bi-directional BLAST found 87 (41.83% of 208) of Arabidopsis genes related to inflorescence development had putative orthologs in the dogwood transcriptomes. Comparisons of the shared sequences by ccBud and cfBud yielded 65931 high quality SNPs between two species. The twenty unigenes with the most SNPs are listed as potential genetic markers for evolutionary studies. Conclusions The data provide an important, although preliminary, information platform for functional genomics and evolutionary developmental biology in Cornus. The study identified putative candidates potentially involved in the genetic regulation of