Science.gov

Sample records for aligned nucleotide sequences

  1. [Tabular excel editor for analysis of aligned nucleotide sequences].

    PubMed

    Demkin, V V

    2010-01-01

    Excel platform was used for transition of results of multiple aligned nucleotide sequences obtained using the BLAST network service to the form appropriate for visual analysis and editing. Two macros operators for MS Excel 2007 were constructed. The array of aligned sequences transformed into Excel table and processed using macros operators is more appropriate for analysis than initial html data.

  2. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations.

    PubMed

    Abascal, Federico; Zardoya, Rafael; Telford, Maximilian J

    2010-07-01

    We present TranslatorX, a web server designed to align protein-coding nucleotide sequences based on their corresponding amino acid translations. Many comparisons between biological sequences (nucleic acids and proteins) involve the construction of multiple alignments. Alignments represent a statement regarding the homology between individual nucleotides or amino acids within homologous genes. As protein-coding DNA sequences evolve as triplets of nucleotides (codons) and it is known that sequence similarity degrades more rapidly at the DNA than at the amino acid level, alignments are generally more accurate when based on amino acids than on their corresponding nucleotides. TranslatorX novelties include: (i) use of all documented genetic codes and the possibility of assigning different genetic codes for each sequence; (ii) a battery of different multiple alignment programs; (iii) translation of ambiguous codons when possible; (iv) an innovative criterion to clean nucleotide alignments with GBlocks based on protein information; and (v) a rich output, including Jalview-powered graphical visualization of the alignments, codon-based alignments coloured according to the corresponding amino acids, measures of compositional bias and first, second and third codon position specific alignments. The TranslatorX server is freely available at http://translatorx.co.uk.

  3. WAViS server for handling, visualization and presentation of multiple alignments of nucleotide or amino acids sequences.

    PubMed

    Zika, Radek; Paces, Jan; Pavlícek, Adam; Paces, Václav

    2004-07-01

    Web Alignment Visualization Server contains a set of web-tools designed for quick generation of publication-quality color figures of multiple alignments of nucleotide or amino acids sequences. It can be used for identification of conserved regions and gaps within many sequences using only common web browsers. The server is accessible at http://wavis.img.cas.cz.

  4. Nucleotide sequence alignment of hdcA from Gram-positive bacteria.

    PubMed

    Diaz, Maria; Ladero, Victor; Redruello, Begoña; Sanchez-Llana, Esther; Del Rio, Beatriz; Fernandez, Maria; Martin, Maria Cruz; Alvarez, Miguel A

    2016-03-01

    The decarboxylation of histidine -carried out mainly by some gram-positive bacteria- yields the toxic dietary biogenic amine histamine (Ladero et al. 2010 〈10.2174/157340110791233256〉 [1], Linares et al. 2016 〈http://dx.doi.org/10.1016/j.foodchem.2015.11.013〉〉 [2]). The reaction is catalyzed by a pyruvoyl-dependent histidine decarboxylase (Linares et al. 2011 〈10.1080/10408398.2011.582813〉 [3]), which is encoded by the gene hdcA. In order to locate conserved regions in the hdcA gene of Gram-positive bacteria, this article provides a nucleotide sequence alignment of all the hdcA sequences from Gram-positive bacteria present in databases. For further utility and discussion, see 〈http://dx.doi.org/ 10.1016/j.foodcont.2015.11.035〉〉 [4].

  5. Nucleotide sequence alignment of hdcA from Gram-positive bacteria

    PubMed Central

    Diaz, Maria; Ladero, Victor; Redruello, Begoña; Sanchez-Llana, Esther; del Rio, Beatriz; Fernandez, Maria; Martin, Maria Cruz; Alvarez, Miguel A.

    2016-01-01

    The decarboxylation of histidine -carried out mainly by some gram-positive bacteria- yields the toxic dietary biogenic amine histamine (Ladero et al. 2010 〈10.2174/157340110791233256〉 [1], Linares et al. 2016 〈http://dx.doi.org/10.1016/j.foodchem.2015.11.013〉〉 [2]). The reaction is catalyzed by a pyruvoyl-dependent histidine decarboxylase (Linares et al. 2011 〈10.1080/10408398.2011.582813〉 [3]), which is encoded by the gene hdcA. In order to locate conserved regions in the hdcA gene of Gram-positive bacteria, this article provides a nucleotide sequence alignment of all the hdcA sequences from Gram-positive bacteria present in databases. For further utility and discussion, see 〈http://dx.doi.org/ 10.1016/j.foodcont.2015.11.035〉〉 [4]. PMID:26958625

  6. ANTICALIgN: visualizing, editing and analyzing combined nucleotide and amino acid sequence alignments for combinatorial protein engineering.

    PubMed

    Jarasch, Alexander; Kopp, Melanie; Eggenstein, Evelyn; Richter, Antonia; Gebauer, Michaela; Skerra, Arne

    2016-07-01

    ANTIC ALIGN: is an interactive software developed to simultaneously visualize, analyze and modify alignments of DNA and/or protein sequences that arise during combinatorial protein engineering, design and selection. ANTIC ALIGN: combines powerful functions known from currently available sequence analysis tools with unique features for protein engineering, in particular the possibility to display and manipulate nucleotide sequences and their translated amino acid sequences at the same time. ANTIC ALIGN: offers both template-based multiple sequence alignment (MSA), using the unmutated protein as reference, and conventional global alignment, to compare sequences that share an evolutionary relationship. The application of similarity-based clustering algorithms facilitates the identification of duplicates or of conserved sequence features among a set of selected clones. Imported nucleotide sequences from DNA sequence analysis are automatically translated into the corresponding amino acid sequences and displayed, offering numerous options for selecting reading frames, highlighting of sequence features and graphical layout of the MSA. The MSA complexity can be reduced by hiding the conserved nucleotide and/or amino acid residues, thus putting emphasis on the relevant mutated positions. ANTIC ALIGN: is also able to handle suppressed stop codons or even to incorporate non-natural amino acids into a coding sequence. We demonstrate crucial functions of ANTIC ALIGN: in an example of Anticalins selected from a lipocalin random library against the fibronectin extradomain B (ED-B), an established marker of tumor vasculature. Apart from engineered protein scaffolds, ANTIC ALIGN: provides a powerful tool in the area of antibody engineering and for directed enzyme evolution.

  7. Data in support of the discovery of alternative splicing variants of quail LEPR and the evolutionary conservation of qLEPRl by nucleotide and amino acid sequences alignment

    PubMed Central

    Wang, Dandan; Xu, Chunlin; Wang, Taian; Li, Hong; Li, Yanmin; Ren, Junxiao; Tian, Yadong; Li, Zhuanjian; Jiao, Yuping; Kang, Xiangtao; Liu, Xiaojun

    2015-01-01

    Leptin receptor (LEPR) belongs to the class I cytokine receptor superfamily which share common structural features and signal transduction pathways. Although multiple LEPR isoforms, which are derived from one gene, were identified in mammals, they were rarely found in avian except the long LEPR. Four alternative splicing variants of quail LEPR (qLEPR) had been cloned and sequenced for the first time (Wang et al., 2015 [1]). To define patterns of the four splicing variants (qLEPRl, qLEPR-a, qLEPR-b and qLEPR-c) and locate the conserved regions of qLEPRl, this data article provides nucleotide sequence alignment of qLEPR and amino acid sequence alignment of representative vertebrate LEPR. The detailed analysis was shown in [1]. PMID:26759819

  8. Pairwise Sequence Alignment Library

    SciTech Connect

    Jeff Daily, PNNL

    2015-05-20

    Vector extensions, such as SSE, have been part of the x86 CPU since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprint that features complex data dependencies. The trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. Therefore, a novel SIMD implementation of a parallel scan-based sequence alignment algorithm that can better exploit wider SIMD units was implemented as part of the Parallel Sequence Alignment Library (parasail). Parasail features: Reference implementations of all known vectorized sequence alignment approaches. Implementations of Smith Waterman (SW), semi-global (SG), and Needleman Wunsch (NW) sequence alignment algorithms. Implementations across all modern CPU instruction sets including AVX2 and KNC. Language interfaces for C/C++ and Python.

  9. Pairwise Sequence Alignment Library

    2015-05-20

    Vector extensions, such as SSE, have been part of the x86 CPU since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprintmore » that features complex data dependencies. The trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. Therefore, a novel SIMD implementation of a parallel scan-based sequence alignment algorithm that can better exploit wider SIMD units was implemented as part of the Parallel Sequence Alignment Library (parasail). Parasail features: Reference implementations of all known vectorized sequence alignment approaches. Implementations of Smith Waterman (SW), semi-global (SG), and Needleman Wunsch (NW) sequence alignment algorithms. Implementations across all modern CPU instruction sets including AVX2 and KNC. Language interfaces for C/C++ and Python.« less

  10. Alignment of nucleotide or amino acid sequences on microcomputers, using a modification of Sellers' (1974) algorithm which avoids the need for calculation of the complete distance matrix.

    PubMed

    Tyson, H; Haley, B

    1985-10-01

    A program to calculate optimum alignment between two sequences, which may be DNA, amino acid or other information, has been written in PASCAL. The Sellers' algorithm for calculating distance between sequences has been modified to reduce its demands on microcomputer memory space by more than half. Gap penalties and mismatch scores are user-adjustable. In 48 K of memory the program aligns sequences up to 170 elements in length; optimum alignment and total distance between a pair of sequences are displayed. The program aligns longer sequences by subdivision of both sequences into corresponding, overlapping sections. Section length and amount of section overlap are user-defined. More importantly, extension of this modification of Sellers' algorithm to align longer sequences, given hardware and compilers/languages capable of using a larger memory space (e.g. 640 K), shows that it is now possible to align, without subdivision, sequences with up to 700 elements each. The increase in computation time for this program with increasing sequence lengths aligned without subdivision is curvilinear, but total times are essentially dependent on hardware/language/compiler combinations. The statistical significance of an alignment is examined with conventional Monte Carlo approaches. PMID:3852712

  11. The EMBL nucleotide sequence database.

    PubMed Central

    Stoesser, G; Moseley, M A; Sleep, J; McGowran, M; Garcia-Pastor, M; Sterk, P

    1998-01-01

    The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl. html ) constitutes Europe's primary nucleotide sequence resource. DNA and RNA sequences are directly submitted from researchers and genome sequencing groups and collected from the scientific literature and patent applications (Fig. 1). In collaboration with DDBJ and GenBank the database is produced, maintained and distributed at the European Bioinformatics Institute. Database releases are produced quarterly and are distributed on CD-ROM. EBI's network services allow access to the most up-to-date data collection via Internet and World Wide Web interface, providing database searching and sequence similarity facilities plus access to a large number of additional databases. PMID:9399791

  12. Automated Identification of Nucleotide Sequences

    NASA Technical Reports Server (NTRS)

    Osman, Shariff; Venkateswaran, Kasthuri; Fox, George; Zhu, Dian-Hui

    2007-01-01

    STITCH is a computer program that processes raw nucleotide-sequence data to automatically remove unwanted vector information, perform reverse-complement comparison, stitch shorter sequences together to make longer ones to which the shorter ones presumably belong, and search against the user s choice of private and Internet-accessible public 16S rRNA databases. ["16S rRNA" denotes a ribosomal ribonucleic acid (rRNA) sequence that is common to all organisms.] In STITCH, a template 16S rRNA sequence is used to position forward and reverse reads. STITCH then automatically searches known 16S rRNA sequences in the user s chosen database(s) to find the sequence most similar to (the sequence that lies at the smallest edit distance from) each spliced sequence. The result of processing by STITCH is the identification of the most similar well-described bacterium. Whereas previously commercially available software for analyzing genetic sequences operates on one sequence at a time, STITCH can manipulate multiple sequences simultaneously to perform the aforementioned operations. A typical analysis of several dozen sequences (length of the order of 103 base pairs) by use of STITCH is completed in a few minutes, whereas such an analysis performed by use of prior software takes hours or days.

  13. Multiple sequence alignment with DIALIGN.

    PubMed

    Morgenstern, Burkhard

    2014-01-01

    DIALIGN is a software tool for multiple sequence alignment by combining global and local alignment features. It composes multiple alignments from local pairwise sequence similarities. This approach is particularly useful to discover conserved functional regions in sequences that share only local homologies but are otherwise unrelated. An anchoring option allows to use external information and expert knowledge in addition to primary-sequence similarity alone. The latest version of DIALIGN optionally uses matches to the PFAM database to detect weak homologies. Various versions of the program are available through Göttingen Bioinformatics Compute Server (GOBICS) at http://www.gobics.de/department/software.

  14. Pareto optimal pairwise sequence alignment.

    PubMed

    DeRonne, Kevin W; Karypis, George

    2013-01-01

    Sequence alignment using evolutionary profiles is a commonly employed tool when investigating a protein. Many profile-profile scoring functions have been developed for use in such alignments, but there has not yet been a comprehensive study of Pareto optimal pairwise alignments for combining multiple such functions. We show that the problem of generating Pareto optimal pairwise alignments has an optimal substructure property, and develop an efficient algorithm for generating Pareto optimal frontiers of pairwise alignments. All possible sets of two, three, and four profile scoring functions are used from a pool of 11 functions and applied to 588 pairs of proteins in the ce_ref data set. The performance of the best objective combinations on ce_ref is also evaluated on an independent set of 913 protein pairs extracted from the BAliBASE RV11 data set. Our dynamic-programming-based heuristic approach produces approximated Pareto optimal frontiers of pairwise alignments that contain comparable alignments to those on the exact frontier, but on average in less than 1/58th the time in the case of four objectives. Our results show that the Pareto frontiers contain alignments whose quality is better than the alignments obtained by single objectives. However, the task of identifying a single high-quality alignment among those in the Pareto frontier remains challenging.

  15. Nucleotide sequences encoding a thermostable alkaline protease

    DOEpatents

    Wilson, David B.; Lao, Guifang

    1998-01-01

    Nucleotide sequences, derived from a thermophilic actinomycete microorganism, which encode a thermostable alkaline protease are disclosed. Also disclosed are variants of the nucleotide sequences which encode a polypeptide having thermostable alkaline proteolytic activity. Recombinant thermostable alkaline protease or recombinant polypeptide may be obtained by culturing in a medium a host cell genetically engineered to contain and express a nucleotide sequence according to the present invention, and recovering the recombinant thermostable alkaline protease or recombinant polypeptide from the culture medium.

  16. Nucleotide sequences encoding a thermostable alkaline protease

    DOEpatents

    Wilson, D.B.; Lao, G.

    1998-01-06

    Nucleotide sequences, derived from a thermophilic actinomycete microorganism, which encode a thermostable alkaline protease are disclosed. Also disclosed are variants of the nucleotide sequences which encode a polypeptide having thermostable alkaline proteolytic activity. Recombinant thermostable alkaline protease or recombinant polypeptide may be obtained by culturing in a medium a host cell genetically engineered to contain and express a nucleotide sequence according to the present invention, and recovering the recombinant thermostable alkaline protease or recombinant polypeptide from the culture medium. 3 figs.

  17. Long-range correlations in nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Peng, C.-K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simons, M.; Stanley, H. E.

    1992-03-01

    DNA SEQUENCES have been analysed using models, such as an it-step Markov chain, that incorporate the possibility of short-range nucleotide correlations1. We propose here a method for studying the stochastic properties of nucleotide sequences by constructing a 1:1 map of the nucleotide sequence onto a walk, which we term a 'DNA walk'. We then use the mapping to provide a quantitative measure of the correlation between nucleotides over long distances along the DNA chain. Thus we uncover in the nucleotide sequence a remarkably long-range power law correlation that implies a new scale-invariant property of DNA. We find such long-range correlations in intron-containing genes and in nontranscribed regulatory DNA sequences, but not in complementary DNA sequences or intron-less genes.

  18. Long-range correlations in nucleotide sequences

    NASA Technical Reports Server (NTRS)

    Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simons, M.; Stanley, H. E.

    1992-01-01

    DNA sequences have been analysed using models, such as an n-step Markov chain, that incorporate the possibility of short-range nucleotide correlations. We propose here a method for studying the stochastic properties of nucleotide sequences by constructing a 1:1 map of the nucleotide sequence onto a walk, which we term a 'DNA walk'. We then use the mapping to provide a quantitative measure of the correlation between nucleotides over long distances along the DNA chain. Thus we uncover in the nucleotide sequence a remarkably long-range power law correlation that implies a new scale-invariant property of DNA. We find such long-range correlations in intron-containing genes and in nontranscribed regulatory DNA sequences, but not in complementary DNA sequences or intron-less genes.

  19. Global Alignment System for Large Genomic Sequencing

    2002-03-01

    AVID is a global alignment system tailored for the alignment of large genomic sequences up to megabases in length. Features include the possibility of one sequence being in draft form, fast alignment, robustness and accuracy. The method is an anchor based alignment using maximal matches derived from suffix trees.

  20. Moss Phylogeny Reconstruction Using Nucleotide Pangenome of Complete Mitogenome Sequences.

    PubMed

    Goryunov, D V; Nagaev, B E; Nikolaev, M Yu; Alexeevski, A V; Troitsky, A V

    2015-11-01

    Stability of composition and sequence of genes was shown earlier in 13 mitochondrial genomes of mosses (Rensing, S. A., et al. (2008) Science, 319, 64-69). It is of interest to study the evolution of mitochondrial genomes not only at the gene level, but also on the level of nucleotide sequences. To do this, we have constructed a "nucleotide pangenome" for mitochondrial genomes of 24 moss species. The nucleotide pangenome is a set of aligned nucleotide sequences of orthologous genome fragments covering the totality of all genomes. The nucleotide pangenome was constructed using specially developed new software, NPG-explorer (NPGe). The stable part of the mitochondrial genome (232 stable blocks) is shown to be, on average, 45% of its length. In the joint alignment of stable blocks, 82% of positions are conserved. The phylogenetic tree constructed with the NPGe program is in good correlation with other phylogenetic reconstructions. With the NPGe program, 30 blocks have been identified with repeats no shorter than 50 bp. The maximal length of a block with repeats is 140 bp. Duplications in the mitochondrial genomes of mosses are rare. On average, the genome contains about 500 bp in large duplications. The total length of insertions and deletions was determined in each genome. The losses and gains of DNA regions are rather active in mitochondrial genomes of mosses, and such rearrangements presumably can be used as additional markers in the reconstruction of phylogeny. PMID:26615445

  1. Nucleotide Sequence-Based Multitarget Identification

    PubMed Central

    Vinayagamoorthy, T.; Mulatz, Kirk; Hodkinson, Roger

    2003-01-01

    MULTIGEN technology (T. Vinayagamoorthy, U.S. patent 6,197,510, March 2001) is a modification of conventional sequencing technology that generates a single electropherogram consisting of short nucleotide sequences from a mixture of known DNA targets. The target sequences may be present on the same or different nucleic acid molecules. For example, when two DNA targets are sequenced, the first and second sequencing primers are annealed to their respective target sequences, and then a polymerase causes chain extension by the addition of new deoxyribose nucleotides. Since the electrophoretic separation depends on the relative molecular weights of the truncated molecules, the molecular weight of the second sequencing primer was specifically designed to be higher than the combined molecular weight of the first sequencing primer plus the molecular weight of the largest truncated molecule generated from the first target sequence. Thus, the series of truncated molecules produced by the second sequencing primer will have higher molecular weights than those produced by the first sequencing primer. Hence, the truncated molecules produced by these two sequencing primers can be effectively separated in a single lane by standard gel electrophoresis in a single electropherogram without any overlapping of the nucleotide sequences. By using sequencing primers with progressively higher molecular weights, multiple short DNA sequences from a variety of targets can be determined simultaneously. We describe here the basic concept of MULTIGEN technology and three applications: detection of sexually transmitted pathogens (Neisseria gonorrhoeae, Chlamydia trachomatis, and Ureaplasma urealyticum), detection of contaminants in meat samples (coliforms, fecal coliforms, and Escherichia coli O157:H7), and detection of single-nucleotide polymorphisms in the human N-acetyltransferase (NAT1) gene (S. Fronhoffs et al., Carcinogenesis 22:1405-1412, 2001). PMID:12843076

  2. R3D Align web server for global nucleotide to nucleotide alignments of RNA 3D structures

    PubMed Central

    Rahrig, Ryan R.; Petrov, Anton I.; Leontis, Neocles B.; Zirbel, Craig L.

    2013-01-01

    The R3D Align web server provides online access to ‘RNA 3D Align’ (R3D Align), a method for producing accurate nucleotide-level structural alignments of RNA 3D structures. The web server provides a streamlined and intuitive interface, input data validation and output that is more extensive and easier to read and interpret than related servers. The R3D Align web server offers a unique Gallery of Featured Alignments, providing immediate access to pre-computed alignments of large RNA 3D structures, including all ribosomal RNAs, as well as guidance on effective use of the server and interpretation of the output. By accessing the non-redundant lists of RNA 3D structures provided by the Bowling Green State University RNA group, R3D Align connects users to structure files in the same equivalence class and the best-modeled representative structure from each group. The R3D Align web server is freely accessible at http://rna.bgsu.edu/r3dalign/. PMID:23716643

  3. Simultaneous Alignment and Folding of Protein Sequences

    PubMed Central

    Waldispühl, Jérôme; O'Donnell, Charles W.; Will, Sebastian; Devadas, Srinivas; Backofen, Rolf

    2014-01-01

    Abstract Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present partiFold-Align, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm's complexity is polynomial in time and space. Algorithmically, partiFold-Align exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane β-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments, partiFold-Align significantly outperforms state-of-the-art pairwise and multiple sequence alignment tools in the most difficult low-sequence homology case. It also improves secondary structure prediction where current approaches fail. Importantly, partiFold-Align requires no prior training. These general techniques are widely applicable to many more protein families (partiFold-Align is available at http://partifold.csail.mit.edu/). PMID:24766258

  4. Complete Nucleotide Sequence of Tn10

    PubMed Central

    Chalmers, Ronald; Sewitz, Sven; Lipkow, Karen; Crellin, Paul

    2000-01-01

    The complete nucleotide sequence of Tn10 has been determined. The dinucleotide signature and percent G+C of the sequence had no discontinuities, indicating that Tn10 constitutes a homogeneous unit. The new sequence contained three new open reading frames corresponding to a glutamate permease, repressors of heavy metal resistance operons, and a hypothetical protein in Bacillus subtilis. The glutamate permease was fully functional when expressed, but Tn10 did not protect Escherichia coli from the toxic effects of various metals. PMID:10781570

  5. Comparing compressed sequences for faster nucleotide BLAST searches.

    PubMed

    Cameron, Michael; Williams, Hugh E

    2007-01-01

    Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 BLAST paper and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from http://www.fsa-blast.org/. PMID:17666756

  6. Comparing compressed sequences for faster nucleotide BLAST searches.

    PubMed

    Cameron, Michael; Williams, Hugh E

    2007-01-01

    Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 BLAST paper and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from http://www.fsa-blast.org/.

  7. The nucleotide sequence of cloned wheat dwarf virus DNA

    PubMed Central

    MacDowell, S. W.; Macdonald, H.; Hamilton, W. D. O.; Coutts, R. H. A.; Buck, K. W.

    1985-01-01

    Restriction analysis and cloning of virus-specific double-stranded DNA isolated from plants infected with wheat dwarf virus (WDV) indicated that the virus genome, like that of maize streak virus (MSV), consists of a single DNA circle. The complete nucleotide sequence of cloned WDV DNA (2749 nucleotides) has been determined. Comparison of the potential coding regions in WDV DNA with those in the DNA of two strains of MSV suggests that these viruses encode at least two functional proteins, the coat protein read in the virion (+) DNA sense and a composite protein, formed from two open reading regions, in the complementary (−) DNA sense. Although WDV and MSV are serologically unrelated their coat proteins showed 35% direct amino acid sequence and their DNAs showed 46% nucleotide sequence homology. There was too little homology between the DNAs of WDV and those of two geminiviruses with bipartite genomes, cassava latent virus (CLV) and tomato golden mosaic virus (TGMV), to align the sequences. However comparison of the amino acid sequences of predicted proteins of WDV, MSV, TGMV and CLV revealed clear relationships between these viruses and suggested that the monopartite and the bipartite geminiviruses have a common ancestral origin. Four inverted repeat sequences which have the potential to form hairpin structures of △G≥-14 kcal/mol were detected in WDV DNA. The sequence TAATATTAC present in the loop of one of these hairpins is conserved in similar putative structures in MSV DNA and in both DNA components of CLV and TGMV and may function as a recognition sequence for a protein involved in virus DNA replication. PMID:15938050

  8. GASSST: global alignment short sequence search tool

    PubMed Central

    Rizk, Guillaume; Lavenier, Dominique

    2010-01-01

    Motivation: The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold—achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. Results: We propose a new efficient filtering step that discards most alignments coming from the seed phase before they are checked by the costly dynamic programming algorithm. We use a carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments in a wide range of configurations. The main filter uses a precomputed table containing the alignment score of short four base words aligned against each other. This table is reused several times by a new algorithm designed to approximate the score of the full dynamic programming algorithm. We compare the performance of GASSST against BWA, BFAST, SSAHA2 and PASS. We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners. Availability: GASSST is distributed under the CeCILL software license at http://www.irisa.fr/symbiose/projects/gassst/ Contact: guillaume.rizk@irisa.fr; dominique.lavenier@irisa.fr Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20739310

  9. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

    PubMed

    Katoh, Kazutaka; Standley, Daron M

    2013-04-01

    We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.

  10. Two Hybrid Algorithms for Multiple Sequence Alignment

    NASA Astrophysics Data System (ADS)

    Naznin, Farhana; Sarker, Ruhul; Essam, Daryl

    2010-01-01

    In order to design life saving drugs, such as cancer drugs, the design of Protein or DNA structures has to be accurate. These structures depend on Multiple Sequence Alignment (MSA). MSA is used to find the accurate structure of Protein and DNA sequences from existing approximately correct sequences. To overcome the overly greedy nature of the well known global progressive alignment method for multiple sequence alignment, we have proposed two different algorithms in this paper; one is using an iterative approach with a progressive alignment method (PAMIM) and the second one is using a genetic algorithm with a progressive alignment method (PAMGA). Both of our methods started with a "kmer" distance table to generate single guide-tree. In the iterative approach, we have introduced two new techniques: the first technique is to generate Guide-trees with randomly selected sequences and the second is of shuffling the sequences inside that tree. The output of the tree is a multiple sequence alignment which has been evaluated by the Sum of Pairs Method (SPM) considering the real value data from PAM250. In our second GA approach, these two techniques are used to generate an initial population and also two different approaches of genetic operators are implemented in crossovers and mutation. To test the performance of our two algorithms, we have compared these with the existing well known methods: T-Coffee, MUSCEL, MAFFT and Probcon, using BAliBase benchmarks. The experimental results show that the first algorithm works well for some situations, where other existing methods face difficulties in obtaining better solutions. The proposed second method works well compared to the existing methods for all situations and it shows better performance over the first one.

  11. Robust temporal alignment of multimodal cardiac sequences

    NASA Astrophysics Data System (ADS)

    Perissinotto, Andrea; Queirós, Sandro; Morais, Pedro; Baptista, Maria J.; Monaghan, Mark; Rodrigues, Nuno F.; D'hooge, Jan; Vilaça, João. L.; Barbosa, Daniel

    2015-03-01

    Given the dynamic nature of cardiac function, correct temporal alignment of pre-operative models and intraoperative images is crucial for augmented reality in cardiac image-guided interventions. As such, the current study focuses on the development of an image-based strategy for temporal alignment of multimodal cardiac imaging sequences, such as cine Magnetic Resonance Imaging (MRI) or 3D Ultrasound (US). First, we derive a robust, modality-independent signal from the image sequences, estimated by computing the normalized cross-correlation between each frame in the temporal sequence and the end-diastolic frame. This signal is a resembler for the left-ventricle (LV) volume curve over time, whose variation indicates different temporal landmarks of the cardiac cycle. We then perform the temporal alignment of these surrogate signals derived from MRI and US sequences of the same patient through Dynamic Time Warping (DTW), allowing to synchronize both sequences. The proposed framework was evaluated in 98 patients, which have undergone both 3D+t MRI and US scans. The end-systolic frame could be accurately estimated as the minimum of the image-derived surrogate signal, presenting a relative error of 1.6 +/- 1.9% and 4.0 +/- 4.2% for the MRI and US sequences, respectively, thus supporting its association with key temporal instants of the cardiac cycle. The use of DTW reduces the desynchronization of the cardiac events in MRI and US sequences, allowing to temporally align multimodal cardiac imaging sequences. Overall, a generic, fast and accurate method for temporal synchronization of MRI and US sequences of the same patient was introduced. This approach could be straightforwardly used for the correct temporal alignment of pre-operative MRI information and intra-operative US images.

  12. MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

    PubMed Central

    Ranwez, Vincent; Harispe, Sébastien; Delsuc, Frédéric; Douzery, Emmanuel J. P.

    2011-01-01

    Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment. We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence. MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface at: http://mbb.univ-montp2.fr/macse. PMID:21949676

  13. [Evolution of non-coding nucleotide sequences in Newcastle disease virus genomes ].

    PubMed

    Xu, Huaiying; Qin, Zhuoming; Qi, Lihong; Zhang, Wei; Wang, Youling; Liu, Jinhua

    2014-09-01

    [OBJECTIVE] Although much is done in the coding genes of Newcastle disease virus (NDV) , limited papers can be found with non-coding sequences. In this paper, the evolution tendency of non-coding sequences was studied. [METHODS] NDV strain LC12 isolated from duck with egg drop syndrome in 2012, and others 35 strains genome cDNA of different NDV genotype were sought and obtained from GenBank. Analytical approaches including nucleotide homology, nucleotide alignment and phylogenetic tree were associated with the leading sequences, trailer sequences, intergenic sequences (IGS), and coding gene between 5 'and 3' UTR nucleotide, respectively. [RESULTS] The location and the length of the non-coding sequences highly conserve, and the variation trend of non-coding sequences is synchronous with the entire genomes and coding genes. [ CONCLUSION] The molecular variation of the coding gene was indistinguishable with the non-coding gene in view of the NDV genome. PMID:25522596

  14. DNA sequence alignment by microhomology sampling during homologous recombination

    PubMed Central

    Qi, Zhi; Redding, Sy; Lee, Ja Yil; Gibb, Bryan; Kwon, YoungHo; Niu, Hengyao; Gaines, William A.; Sung, Patrick

    2015-01-01

    Summary Homologous recombination (HR) mediates the exchange of genetic information between sister or homologous chromatids. During HR, members of the RecA/Rad51 family of recombinases must somehow search through vast quantities of DNA sequence to align and pair ssDNA with a homologous dsDNA template. Here we use single-molecule imaging to visualize Rad51 as it aligns and pairs homologous DNA sequences in real-time. We show that Rad51 uses a length-based recognition mechanism while interrogating dsDNA, enabling robust kinetic selection of 8-nucleotide (nt) tracts of microhomology, which kinetically confines the search to sites with a high probability of being a homologous target. Successful pairing with a 9th nucleotide coincides with an additional reduction in binding free energy and subsequent strand exchange occurs in precise 3-nt steps, reflecting the base triplet organization of the presynaptic complex. These findings provide crucial new insights into the physical and evolutionary underpinnings of DNA recombination. PMID:25684365

  15. Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV).

    PubMed

    Martin, Andrew C R

    2014-01-01

    The JavaScript Sequence Alignment Viewer (JSAV) is designed as a simple-to-use JavaScript component for displaying sequence alignments on web pages. The display of sequences is highly configurable with options to allow alternative coloring schemes, sorting of sequences and 'dotifying' repeated amino acids. An option is also available to submit selected sequences to another web site, or to other JavaScript code. JSAV is implemented purely in JavaScript making use of the JQuery and JQuery-UI libraries. It does not use any HTML5-specific options to help with browser compatibility. The code is documented using JSDOC and is available from http://www.bioinf.org.uk/software/jsav/. PMID:25653836

  16. Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV).

    PubMed

    Martin, Andrew C R

    2014-01-01

    The JavaScript Sequence Alignment Viewer (JSAV) is designed as a simple-to-use JavaScript component for displaying sequence alignments on web pages. The display of sequences is highly configurable with options to allow alternative coloring schemes, sorting of sequences and 'dotifying' repeated amino acids. An option is also available to submit selected sequences to another web site, or to other JavaScript code. JSAV is implemented purely in JavaScript making use of the JQuery and JQuery-UI libraries. It does not use any HTML5-specific options to help with browser compatibility. The code is documented using JSDOC and is available from http://www.bioinf.org.uk/software/jsav/.

  17. DUC-Curve, a highly compact 2D graphical representation of DNA sequences and its application in sequence alignment

    NASA Astrophysics Data System (ADS)

    Li, Yushuang; Liu, Qian; Zheng, Xiaoqi

    2016-08-01

    A highly compact and simple 2D graphical representation of DNA sequences, named DUC-Curve, is constructed through mapping four nucleotides to a unit circle with a cyclic order. DUC-Curve could directly detect nucleotide, di-nucleotide compositions and microsatellite structure from DNA sequences. Moreover, it also could be used for DNA sequence alignment. Taking geometric center vectors of DUC-Curves as sequence descriptor, we perform similarity analysis on the first exons of β-globin genes of 11 species, oncogene TP53 of 27 species and twenty-four Influenza A viruses, respectively. The obtained reasonable results illustrate that the proposed method is very effective in sequence comparison problems, and will at least play a complementary role in classification and clustering problems.

  18. Reading biological processes from nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Murugan, Anand

    Cellular processes have traditionally been investigated by techniques of imaging and biochemical analysis of the molecules involved. The recent rapid progress in our ability to manipulate and read nucleic acid sequences gives us direct access to the genetic information that directs and constrains biological processes. While sequence data is being used widely to investigate genotype-phenotype relationships and population structure, here we use sequencing to understand biophysical mechanisms. We present work on two different systems. First, in chapter 2, we characterize the stochastic genetic editing mechanism that produces diverse T-cell receptors in the human immune system. We do this by inferring statistical distributions of the underlying biochemical events that generate T-cell receptor coding sequences from the statistics of the observed sequences. This inferred model quantitatively describes the potential repertoire of T-cell receptors that can be produced by an individual, providing insight into its potential diversity and the probability of generation of any specific T-cell receptor. Then in chapter 3, we present work on understanding the functioning of regulatory DNA sequences in both prokaryotes and eukaryotes. Here we use experiments that measure the transcriptional activity of large libraries of mutagenized promoters and enhancers and infer models of the sequence-function relationship from this data. For the bacterial promoter, we infer a physically motivated 'thermodynamic' model of the interaction of DNA-binding proteins and RNA polymerase determining the transcription rate of the downstream gene. For the eukaryotic enhancers, we infer heuristic models of the sequence-function relationship and use these models to find synthetic enhancer sequences that optimize inducibility of expression. Both projects demonstrate the utility of sequence information in conjunction with sophisticated statistical inference techniques for dissecting underlying biophysical

  19. Nucleotide sequence of SHV-2 beta-lactamase gene

    SciTech Connect

    Garbarg-Chenon, A.; Godard, V.; Labia, R.; Nicolas, J.C. )

    1990-07-01

    The nucleotide sequence of plasmid-mediated beta-lactamase SHV-2 from Salmonella typhimurium (SHV-2pHT1) was determined. The gene was very similar to chromosomally encoded beta-lactamase LEN-1 of Klebsiella pneumoniae. Compared with the sequence of the Escherichia coli SHV-2 enzyme (SHV-2E.coli) obtained by protein sequencing, the deduced amino acid sequence of SHV-2pHT1 differed by three amino acid substitutions.

  20. PROMALS web server for accurate multiple protein sequence alignments.

    PubMed

    Pei, Jimin; Kim, Bong-Hyun; Tang, Ming; Grishin, Nick V

    2007-07-01

    Multiple sequence alignments are essential in homology inference, structure modeling, functional prediction and phylogenetic analysis. We developed a web server that constructs multiple protein sequence alignments using PROMALS, a progressive method that improves alignment quality by using additional homologs from PSI-BLAST searches and secondary structure predictions from PSIPRED. PROMALS shows higher alignment accuracy than other advanced methods, such as MUMMALS, ProbCons, MAFFT and SPEM. The PROMALS web server takes FASTA format protein sequences as input. The output includes a colored alignment augmented with information about sequence grouping, predicted secondary structures and positional conservation. The PROMALS web server is available at: http://prodata.swmed.edu/promals/ PMID:17452345

  1. 77 FR 65537 - Requirements for Patent Applications Containing Nucleotide Sequence and/or Amino Acid Sequence...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-29

    ... Amino Acid Sequence Disclosures ACTION: Proposed collection; comment request. SUMMARY: The United States....'' SUPPLEMENTARY INFORMATION: I. Abstract Patent applications that contain nucleotide and/or amino acid...

  2. Blasting and Zipping: Sequence Alignment and Mutual Information

    NASA Astrophysics Data System (ADS)

    Penner, Orion; Grassberger, Peter; Paczuski, Maya

    2009-03-01

    Alignment of biological sequences such as DNA, RNA or proteins is one of the most widely used tools in computational bioscience. While the accomplishments of sequence alignment algorithms are undeniable the fact remains that these algorithms are based upon heuristic scoring schemes. Therefore, these algorithms do not provide model independent and objective measures for how similar two (or more) sequences actually are. Although information theory provides such a similarity measure - the mutual information (MI) - numerous previous attempts to connect sequence alignment and information have not produced realistic estimates for the MI from a given alignment. We report on a simple and flexible approach to get robust estimates of MI from global alignments. The presented results may help establish MI as a reliable tool for evaluating the quality of global alignments, judging the relative merits of different alignment algorithms, and estimating the significance of specific alignments.

  3. Nucleotide sequence stability of the genome of hepatitis delta virus.

    PubMed Central

    Netter, H J; Wu, T T; Bockol, M; Cywinski, A; Ryu, W S; Tennant, B C; Taylor, J M

    1995-01-01

    Cultured cells were cotransfected with a fully sequenced 1,679-base cDNA clone of human hepatitis delta virus (HDV) RNA genome and a cDNA for the genome of woodchuck hepatitis virus (WHV). The HDV particles released were able to infect a woodchuck that was chronically infected with WHV. The HDV so produced was passaged a total of six times in woodchucks in order to determine the stability of the HDV nucleotide sequence. During a final chronic infection with such virus, liver RNA was extracted, and the HDV nucleotide sequence for the 352-base region, positions 905 to 1256, was obtained. By means of PCR, we obtained double-stranded cDNA both for direct sequencing and also for molecular cloning followed by sequencing. By direct sequencing, we found that a consensus sequence existed and was identical to the original sequence. From the sequences of 31 clones, we found 32% (10 of 31) to be identical to the original single nucleotide sequence. For the remainder, there were neither insertions nor deletions but there was a small number of single-nucleotide changes. These changes were predominantly transitions rather than transversions. Furthermore, the transitions were largely of just two types, uridine to cytidine and adenosine to guanosine. Of the 40 changes detected on HDV, 35% (14 of 40) occurred within an eight-nucleotide region that included position 1012, previously shown to be a site of RNA editing. These findings may have significant implications regarding both the stability of the HDV RNA genome and the mechanism of RNA editing. PMID:7853505

  4. The nucleotide sequence of cowpea mosaic virus B RNA

    PubMed Central

    Lomonossoff, G.P.; Shanks, M.

    1983-01-01

    The complete sequence of the bottom component RNA (B RNA) of cowpea mosaic virus (CPMV) has been determined. Restriction enzyme fragments of double-stranded cDNA were cloned in M13 and the sequence of the inserts was determined by a combination of enzymatic and chemical sequencing techniques. Additional sequence information was obtained by primed synthesis on first strand cDNA. The complete sequence deduced is 5889 nucleotides long excluding the 3' poly(A), and contains an open reading frame sufficient to code for a polypeptide of mol. wt. 207 760. The coding region is flanked by a 5' leader sequence of 206 nucleotides and a 3' non-coding region of 82 residues which does not contain a polyadenylation signal. PMID:16453487

  5. Nucleotide sequence composition and method for detection of neisseria gonorrhoeae

    SciTech Connect

    Lo, A.; Yang, H.L.

    1990-02-13

    This patent describes a composition of matter that is specific for {ital Neisseria gonorrhoeae}. It comprises: at least one nucleotide sequence for which the ratio of the amount of the sequence which hybridizes to chromosomal DNA of {ital Neisseria gonorrhoeae} to the amount of the sequence which hybridizes to chromosomal DNA of {ital Neisseria meningitidis} is greater than about five. The ratio being obtained by a method described.

  6. MULTAN: a program to align multiple DNA sequences.

    PubMed Central

    Bains, W

    1986-01-01

    I describe a computer program which can align a large number of nucleic acid sequences with one another. The program uses an heuristic, iterative algorithm which has been tested extensively, and is found to produce useful alignments of a variety of sequence families. The algorithm is fast enough to be practical for the analysis of large number of sequences, and is implemented in a program which contains a variety of other functions to facilitate the analysis of the aligned result. PMID:3003672

  7. High-speed multiple sequence alignment on a reconfigurable platform.

    PubMed

    Oliver, Tim; Schmidt, Bertil; Maskell, Douglas; Nathan, Darran; Clemens, Ralf

    2006-01-01

    Progressive alignment is a widely used approach to compute multiple sequence alignments (MSAs). However, aligning several hundred sequences by popular progressive alignment tools requires hours on sequential computers. Due to the rapid growth of sequence databases biologists have to compute MSAs in a far shorter time. In this paper we present a new approach to MSA on reconfigurable hardware platforms to gain high performance at low cost. We have constructed a linear systolic array to perform pairwise sequence distance computations using dynamic programming. This results in an implementation with significant runtime savings on a standard FPGA.

  8. The number of reduced alignments between two DNA sequences

    PubMed Central

    2014-01-01

    Background In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained. Results We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments. Conclusions A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods. AMS Subject Classification Primary 92B05, 33C20, secondary 39A14, 65Q30 PMID:24684679

  9. Mining of haplotype-based expressed sequence tag single nucleotide polymorphisms in citrus

    PubMed Central

    2013-01-01

    Background Single nucleotide polymorphisms (SNPs), the most abundant variations in a genome, have been widely used in various studies. Detection and characterization of citrus haplotype-based expressed sequence tag (EST) SNPs will greatly facilitate further utilization of these gene-based resources. Results In this paper, haplotype-based SNPs were mined out of publicly available citrus expressed sequence tags (ESTs) from different citrus cultivars (genotypes) individually and collectively for comparison. There were a total of 567,297 ESTs belonging to 27 cultivars in varying numbers and consequentially yielding different numbers of haplotype-based quality SNPs. Sweet orange (SO) had the most (213,830) ESTs, generating 11,182 quality SNPs in 3,327 out of 4,228 usable contigs. Summed from all the individually mining results, a total of 25,417 quality SNPs were discovered – 15,010 (59.1%) were transitions (AG and CT), 9,114 (35.9%) were transversions (AC, GT, CG, and AT), and 1,293 (5.0%) were insertion/deletions (indels). A vast majority of SNP-containing contigs consisted of only 2 haplotypes, as expected, but the percentages of 2 haplotype contigs varied widely in these citrus cultivars. BLAST of the 25,417 25-mer SNP oligos to the Clementine reference genome scaffolds revealed 2,947 SNPs had “no hits found”, 19,943 had 1 unique hit / alignment, 1,571 had one hit and 2+ alignments per hit, and 956 had 2+ hits and 1+ alignment per hit. Of the total 24,293 scaffold hits, 23,955 (98.6%) were on the main scaffolds 1 to 9, and only 338 were on 87 minor scaffolds. Most alignments had 100% (25/25) or 96% (24/25) nucleotide identities, accounting for 93% of all the alignments. Considering almost all the nucleotide discrepancies in the 24/25 alignments were at the SNP sites, it served well as in silico validation of these SNPs, in addition to and consistent with the rate (81%) validated by sequencing and SNaPshot assay. Conclusions High-quality EST-SNPs from different

  10. Information capacity of nucleotide sequences and its applications.

    PubMed

    Sadovsky, M G

    2006-05-01

    The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.

  11. Local alignment of two-base encoded DNA sequence

    PubMed Central

    Homer, Nils; Merriman, Barry; Nelson, Stanley F

    2009-01-01

    Background DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity. Results We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions. Conclusion The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data. PMID:19508732

  12. A comparative analysis of multiple sequence alignments for biological data.

    PubMed

    Manzoor, Umar; Shahid, Sarosh; Zafar, Bassam

    2015-01-01

    Multiple sequence alignment plays a key role in the computational analysis of biological data. Different programs are developed to analyze the sequence similarity. This paper highlights the algorithmic techniques of the most popular multiple sequence alignment programs. These programs are then evaluated on the basis of execution time and scalability. The overall performance of these programs is assessed to highlight their strengths and weaknesses with reference to their algorithmic techniques. In terms of overall alignment quality, T-Coffee and Mafft attain the highest average scores, whereas K-align has the minimum computation time. PMID:26405947

  13. R3D-2-MSA: the RNA 3D structure-to-multiple sequence alignment server

    PubMed Central

    Cannone, Jamie J.; Sweeney, Blake A.; Petrov, Anton I.; Gutell, Robin R.; Zirbel, Craig L.; Leontis, Neocles

    2015-01-01

    The RNA 3D Structure-to-Multiple Sequence Alignment Server (R3D-2-MSA) is a new web service that seamlessly links RNA three-dimensional (3D) structures to high-quality RNA multiple sequence alignments (MSAs) from diverse biological sources. In this first release, R3D-2-MSA provides manual and programmatic access to curated, representative ribosomal RNA sequence alignments from bacterial, archaeal, eukaryal and organellar ribosomes, using nucleotide numbers from representative atomic-resolution 3D structures. A web-based front end is available for manual entry and an Application Program Interface for programmatic access. Users can specify up to five ranges of nucleotides and 50 nucleotide positions per range. The R3D-2-MSA server maps these ranges to the appropriate columns of the corresponding MSA and returns the contents of the columns, either for display in a web browser or in JSON format for subsequent programmatic use. The browser output page provides a 3D interactive display of the query, a full list of sequence variants with taxonomic information and a statistical summary of distinct sequence variants found. The output can be filtered and sorted in the browser. Previous user queries can be viewed at any time by resubmitting the output URL, which encodes the search and re-generates the results. The service is freely available with no login requirement at http://rna.bgsu.edu/r3d-2-msa. PMID:26048960

  14. Probabilistic sequence alignment of stratigraphic records

    NASA Astrophysics Data System (ADS)

    Lin, Luan; Khider, Deborah; Lisiecki, Lorraine E.; Lawrence, Charles E.

    2014-10-01

    The assessment of age uncertainty in stratigraphically aligned records is a pressing need in paleoceanographic research. The alignment of ocean sediment cores is used to develop mutually consistent age models for climate proxies and is often based on the δ18O of calcite from benthic foraminifera, which records a global ice volume and deep water temperature signal. To date, δ18O alignment has been performed by manual, qualitative comparison or by deterministic algorithms. Here we present a hidden Markov model (HMM) probabilistic algorithm to find 95% confidence bands for δ18O alignment. This model considers the probability of every possible alignment based on its fit to the δ18O data and transition probabilities for sedimentation rate changes obtained from radiocarbon-based estimates for 37 cores. Uncertainty is assessed using a stochastic back trace recursion to sample alignments in exact proportion to their probability. We applied the algorithm to align 35 late Pleistocene records to a global benthic δ18O stack and found that the mean width of 95% confidence intervals varies between 3 and 23 kyr depending on the resolution and noisiness of the record's δ18O signal. Confidence bands within individual cores also vary greatly, ranging from ~0 to >40 kyr. These alignment uncertainty estimates will allow researchers to examine the robustness of their conclusions, including the statistical evaluation of lead-lag relationships between events observed in different cores.

  15. A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method.

    PubMed

    Hatje, Klas; Kollmar, Martin

    2012-01-01

    Phylogenetic analyses reveal the evolutionary derivation of species. A phylogenetic tree can be inferred from multiple sequence alignments of proteins or genes. The alignment of whole genome sequences of higher eukaryotes is a computational intensive and ambitious task as is the computation of phylogenetic trees based on these alignments. To overcome these limitations, we here used an alignment-free method to compare genomes of the Brassicales clade. For each nucleotide sequence a Chaos Game Representation (CGR) can be computed, which represents each nucleotide of the sequence as a point in a square defined by the four nucleotides as vertices. Each CGR is therefore a unique fingerprint of the underlying sequence. If the CGRs are divided by grid lines each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence (Frequency Chaos Game Representation, FCGR). Here, we used distance measures between FCGRs to infer phylogenetic trees of Brassicales species. Three types of data were analyzed because of their different characteristics: (A) Whole genome assemblies as far as available for species belonging to the Malvidae taxon. (B) EST data of species of the Brassicales clade. (C) Mitochondrial genomes of the Rosids branch, a supergroup of the Malvidae. The trees reconstructed based on the Euclidean distance method are in general agreement with single gene trees. The Fitch-Margoliash and Neighbor joining algorithms resulted in similar to identical trees. Here, for the first time we have applied the bootstrap re-sampling concept to trees based on FCGRs to determine the support of the branchings. FCGRs have the advantage that they are fast to calculate, and can be used as additional information to alignment based data and morphological characteristics to improve the phylogenetic classification of species in ambiguous cases.

  16. Method for the detection of specific nucleic acid sequences by polymerase nucleotide incorporation

    DOEpatents

    Castro, Alonso

    2004-06-01

    A method for rapid and efficient detection of a target DNA or RNA sequence is provided. A primer having a 3'-hydroxyl group at one end and having a sequence of nucleotides sufficiently homologous with an identifying sequence of nucleotides in the target DNA is selected. The primer is hybridized to the identifying sequence of nucleotides on the DNA or RNA sequence and a reporter molecule is synthesized on the target sequence by progressively binding complementary nucleotides to the primer, where the complementary nucleotides include nucleotides labeled with a fluorophore. Fluorescence emitted by fluorophores on single reporter molecules is detected to identify the target DNA or RNA sequence.

  17. ProbCons: Probabilistic consistency-based multiple sequence alignment.

    PubMed

    Do, Chuong B; Mahabhashyam, Mahathi S P; Brudno, Michael; Batzoglou, Serafim

    2005-02-01

    To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web resource.

  18. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm

    PubMed Central

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770

  19. A simple method to control over-alignment in the MAFFT multiple sequence alignment program

    PubMed Central

    Katoh, Kazutaka; Standley, Daron M.

    2016-01-01

    Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact: katoh@ifrec.osaka-u.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27153688

  20. PhyPA: Phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences.

    PubMed

    Xia, Xuhua

    2016-09-01

    While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing.

  1. PhyPA: Phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences.

    PubMed

    Xia, Xuhua

    2016-09-01

    While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing. PMID:27377322

  2. The primary nucleotide sequence of U4 RNA.

    PubMed

    Reddy, R; Henning, D; Busch, H

    1981-04-10

    U4 RNA is one of the "capped" nuclear snRNAs recently found to be precipitable by anti-Sm antibodies as ribonucleoprotein particles. U4 RNA, along with other snRNAs, has been implicated in hnRNA processing, mRNA transport, or both (Lerner, M. R., Boyle, J., Mount, S., Wolin, S., and Steitz, J. A. (1980) Nature 283, 220-224). Since the proteins bound to different snRNAs appear to be the same, the functions of different snRNPs might be dependent on the RNA components. To help understand the function of U4 RNP, the nucleotide sequence of U4 RNA was determined. The sequence is (formula see text) In addition to the modified nucleotides in the "cap," U4 RNA contains Am at position 63 and m6A at position 98. It also exhibited A-C microheterogeneity at position 97. PMID:6162848

  3. Nucleotide-Specific Contrast for DNA Sequencing by Electron Spectroscopy.

    PubMed

    Mankos, Marian; Persson, Henrik H J; N'Diaye, Alpha T; Shadman, Khashayar; Schmid, Andreas K; Davis, Ronald W

    2016-01-01

    DNA sequencing by imaging in an electron microscope is an approach that holds promise to deliver long reads with low error rates and without the need for amplification. Earlier work using transmission electron microscopes, which use high electron energies on the order of 100 keV, has shown that low contrast and radiation damage necessitates the use of heavy atom labeling of individual nucleotides, which increases the read error rates. Other prior work using scattering electrons with much lower energy has shown to suppress beam damage on DNA. Here we explore possibilities to increase contrast by employing two methods, X-ray photoelectron and Auger electron spectroscopy. Using bulk DNA samples with monomers of each base, both methods are shown to provide contrast mechanisms that can distinguish individual nucleotides without labels. Both spectroscopic techniques can be readily implemented in a low energy electron microscope, which may enable label-free DNA sequencing by direct imaging. PMID:27149617

  4. Nucleotide-Specific Contrast for DNA Sequencing by Electron Spectroscopy

    PubMed Central

    Schmid, Andreas K.; Davis, Ronald W.

    2016-01-01

    DNA sequencing by imaging in an electron microscope is an approach that holds promise to deliver long reads with low error rates and without the need for amplification. Earlier work using transmission electron microscopes, which use high electron energies on the order of 100 keV, has shown that low contrast and radiation damage necessitates the use of heavy atom labeling of individual nucleotides, which increases the read error rates. Other prior work using scattering electrons with much lower energy has shown to suppress beam damage on DNA. Here we explore possibilities to increase contrast by employing two methods, X-ray photoelectron and Auger electron spectroscopy. Using bulk DNA samples with monomers of each base, both methods are shown to provide contrast mechanisms that can distinguish individual nucleotides without labels. Both spectroscopic techniques can be readily implemented in a low energy electron microscope, which may enable label-free DNA sequencing by direct imaging. PMID:27149617

  5. A novel randomized iterative strategy for aligning multiple protein sequences.

    PubMed

    Berger, M P; Munson, P J

    1991-10-01

    The rigorous alignment of multiple protein sequences becomes impractical even with a modest number of sequences, since computer memory and time requirements increase as the product of the lengths of the sequences. We have devised a strategy to approach such an optimal alignment, which modifies the intensive computer storage and time requirements of dynamic programming. Our algorithm randomly divides a group of unaligned sequences into two subgroups, between which an optimal alignment is then obtained by a Needleman-Wunsch style of algorithm. Our algorithm uses a matrix with dimensions corresponding to the lengths of the two aligned sequence subgroups. The pairwise alignment process is repeated using different random divisions of the whole group into two subgroups. Compared with the rigorous approach of solving the n-dimensional lattice by dynamic programming, our iterative algorithm results in alignments that match or are close to the optimal solution, on a limited set of test problems. We have implemented this algorithm in a computer program that runs on the IBM PC class of machines, together with a user-friendly environment for interactively selecting sequences or groups of sequences to be aligned either simultaneously or progressively.

  6. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment

    PubMed Central

    Lunter, Gerton; Rocco, Andrea; Mimouni, Naila; Heger, Andreas; Caldeira, Alexandre; Hein, Jotun

    2008-01-01

    Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human–mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman–Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/. PMID:18073381

  7. Large-scale detection and application of expressed sequence tag single nucleotide polymorphisms in Nicotiana.

    PubMed

    Wang, Y; Zhou, D; Wang, S; Yang, L

    2015-01-01

    Single nucleotide polymorphisms (SNPs) are widespread in the Nicotiana genome. Using an alignment and variation detection method, we developed 20,607,973 SNPs, based on the expressed sequence tag sequences of 10 Nicotiana species. The replacement rate was much higher than the transversion rate in the SNPs, and SNPs widely exist in the Nicotiana. In vitro verification indicated that all of the SNPs were high quality and accurate. Evolutionary relationships between 15 varieties were investigated by polymerase chain reaction with a special primer; the specific 302 locus of these sequence results clearly indicated the origin of Zhongyan 100. A database of Nicotiana SNPs (NSNP) was developed to store and search for SNPs in Nicotiana. NSNP is a tool for researchers to develop SNP markers of sequence data. PMID:26214460

  8. Large-scale detection and application of expressed sequence tag single nucleotide polymorphisms in Nicotiana.

    PubMed

    Wang, Y; Zhou, D; Wang, S; Yang, L

    2015-07-14

    Single nucleotide polymorphisms (SNPs) are widespread in the Nicotiana genome. Using an alignment and variation detection method, we developed 20,607,973 SNPs, based on the expressed sequence tag sequences of 10 Nicotiana species. The replacement rate was much higher than the transversion rate in the SNPs, and SNPs widely exist in the Nicotiana. In vitro verification indicated that all of the SNPs were high quality and accurate. Evolutionary relationships between 15 varieties were investigated by polymerase chain reaction with a special primer; the specific 302 locus of these sequence results clearly indicated the origin of Zhongyan 100. A database of Nicotiana SNPs (NSNP) was developed to store and search for SNPs in Nicotiana. NSNP is a tool for researchers to develop SNP markers of sequence data.

  9. The complete nucleotide sequence of pelargonium leaf curl virus.

    PubMed

    McGavin, Wendy J; MacFarlane, Stuart A

    2016-05-01

    Investigation of a tombusvirus isolated from tulip plants in Scotland revealed that it was pelargonium leaf curl virus (PLCV) rather than the originally suggested tomato bushy stunt virus. The complete sequence of the PLCV genome was determined for the first time, revealing it to be 4789 nucleotides in size and to have an organization similar to that of the other, previously described tombusviruses. Primers derived from the sequence were used to construct a full-length infectious clone of PLCV that recapitulates the disease symptoms of leaf curling in systemically infected pelargonium plants.

  10. The complete nucleotide sequence of pelargonium leaf curl virus.

    PubMed

    McGavin, Wendy J; MacFarlane, Stuart A

    2016-05-01

    Investigation of a tombusvirus isolated from tulip plants in Scotland revealed that it was pelargonium leaf curl virus (PLCV) rather than the originally suggested tomato bushy stunt virus. The complete sequence of the PLCV genome was determined for the first time, revealing it to be 4789 nucleotides in size and to have an organization similar to that of the other, previously described tombusviruses. Primers derived from the sequence were used to construct a full-length infectious clone of PLCV that recapitulates the disease symptoms of leaf curling in systemically infected pelargonium plants. PMID:26906694

  11. Fluorogenic sequencing using halogen-fluorescein-labeled nucleotides.

    PubMed

    Chen, Zitian; Duan, Haifeng; Qiao, Shuo; Zhou, Wenxiong; Qiu, Haiwei; Kang, Li; Xie, X Sunney; Huang, Yanyi

    2015-05-26

    Fluorogenic sequencing is a sequencing-by-synthesis technology that combines the advantages of pyrosequencing and fluorescence detection. With native duplex DNA as the major product, we employ polymerase to incorporate the complement- arily matched terminal phosphate-labeled fluorogenic nucleotides into the DNA template and release halogen-fluorescein as the reporter. This red-emitting fluorophore successfully avoids spectral overlap with the autofluorescence background of the flow chip. We fully characterized the enzymatic reaction kinetics of the new substrates, and performed a 35-base sequencing experiment with 60 reaction cycles. Our achievement expands the substrate repertoire for fluorogenic sequencing, and extends the spectral range to obtain better signal-to-background performance.

  12. Refinement by shifting secondary structure elements improves sequence alignments.

    PubMed

    Tong, Jing; Pei, Jimin; Otwinowski, Zbyszek; Grishin, Nick V

    2015-03-01

    Constructing a model of a query protein based on its alignment to a homolog with experimentally determined spatial structure (the template) is still the most reliable approach to structure prediction. Alignment errors are the main bottleneck for homology modeling when the query is distantly related to the template. Alignment methods often misalign secondary structural elements by a few residues. Therefore, better alignment solutions can be found within a limited set of local shifts of secondary structures. We present a refinement method to improve pairwise sequence alignments by evaluating alignment variants generated by local shifts of template-defined secondary structures. Our method SFESA is based on a novel scoring function that combines the profile-based sequence score and the structure score derived from residue contacts in a template. Such a combined score frequently selects a better alignment variant among a set of candidate alignments generated by local shifts and leads to overall increase in alignment accuracy. Evaluation of several benchmarks shows that our refinement method significantly improves alignments made by automatic methods such as PROMALS, HHpred and CNFpred. The web server is available at http://prodata.swmed.edu/sfesa. PMID:25546158

  13. Refinement by shifting secondary structure elements improves sequence alignments

    PubMed Central

    Tong, Jing; Pei, Jimin; Otwinowski, Zbyszek; Grishin, Nick V.

    2015-01-01

    Constructing a model of a query protein based on its alignment to a homolog with experimentally determined spatial structure (the template) is still the most reliable approach to structure prediction. Alignment errors are the main bottleneck for homology modeling when the query is distantly related to the template. Alignment methods often misalign secondary structural elements by a few residues. Therefore, better alignment solutions can be found within a limited set of local shifts of secondary structures. We present a refinement method to improve pairwise sequence alignments by evaluating alignment variants generated by local shifts of template-defined secondary structures. Our method SFESA is based on a novel scoring function that combines the profile-based sequence score and the structure score derived from residue contacts in a template. Such a combined score frequently selects a better alignment variant among a set of candidate alignments generated by local shifts and leads to overall increase in alignment accuracy. Evaluation of several benchmarks shows that our refinement method significantly improves alignments made by automatic methods such as PROMALS, HHpred and CNFpred. The web server is available at http://prodata.swmed.edu/sfesa. PMID:25546158

  14. Bioinformatics comparison of sulfate-reducing metabolism nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Nguyen, A.; Cheung, E.; Sullivan, R.; Holden, T.; Lieberman, D.; Cheung, T.

    2015-09-01

    The sulfate-reducing bacteria can be traced back to 3.5 billion years ago. The thermodynamics details of the sulfur cycle have been well documented. A recent sulfate-reducing bacteria report (Robator, Jungbluth, et al , 2015 Jan, Front. Microbiol) with Genbank nucleotide data has been analyzed in terms of the sulfite reductase (dsrAB) via fractal dimension and entropy values. Comparison to oil field sulfate-reducing sequences was included. The AUCG translational mass fractal dimension versus ATCG transcriptional mass fractal dimension for the low temperature dsrB and dsrA sequences reported in Reference Thirteen shows correlation R-sq ~ 0.79 , with a probably of about 3% in simulation. A recent report of using Cystathionine gamma-lyase sequence to produce CdS quantum dot in a biological method, where the sulfur is reduced just like in the H2S production process, was included for comparison. The AUCG mass fractal dimension versus ATCG mass fractal dimension for the Cystathionine gamma-lyase sequences was found to have R-sq of 0.72, similar to the low temperature dissimilatory sulfite reductase dsr group with 3% probability, in contrary to the oil field group having R-sq ~ 0.94, a high probable outcome in the simulation. The other two simulation histograms, namely, fractal dimension versus entropy R-sq outcome values, and di-nucleotide entropy versus mono-nucleotide entropy R-sq outcome values are also discussed in the data analysis focusing on low probability outcomes.

  15. Characterization and partial nucleotide sequence of endogenous type C retrovirus segments in human chromosomal DNA.

    PubMed Central

    Repaske, R; O'Neill, R R; Steele, P E; Martin, M A

    1983-01-01

    Twenty-six different murine leukemia virus (MuLV)-related clones have been isolated from a human DNA library and characterized by restriction enzyme mapping and reciprocal nucleic acid hybridization reactions. The sequence of approximately 2,600 nucleotides, spanning more than 4.0 kilobases, of one of the MuLV-related cloned human DNAs was also determined. The deduced amino acid sequence permitted the alignment of this prototype cloned human DNA segment with the p12 gag, p30 gag, p10 gag, and pol regions of Moloney MuLV. A majority of the endogenous type C retrovirus-related segments present in human DNA are approximately 6.0 kilobases in size and appear to contain a deletion of env sequences. Images PMID:6298769

  16. Cytochrome b nucleotide sequence variation among the Atlantic Alcidae.

    PubMed

    Friesen, V L; Montevecchi, W A; Davidson, W S

    1993-01-01

    Analysis of cytochrome b nucleotide sequences of the six extant species of Atlantic alcids and a gull revealed an excess of adenines and cytosines and a deficit of guanines at silent sites on the coding strand. Phylogenetic analyses grouped the sequences of the common (Uria aalge) and Brünnich's (U. lomvia) guillemots, followed by the razorbill (Alca torda) and little auk (Alle alle). The black guillemot (Cepphus grylle) sequence formed a sister taxon, and the puffin (Fratercula arctica) fell outside the other alcids. Phylogenetic comparisons of substitutions indicated that mutabilities of bases did not differ, but that C was much more likely to be incorporated than was G. Imbalances in base composition appear to result from a strand bias in replication errors, which may result from selection on secondary RNA structure and/or the energetics of codon-anticodon interactions. PMID:7916741

  17. Detection of protein similarities using nucleotide sequence databases.

    PubMed

    Henikoff, S; Wallace, J C

    1988-07-11

    A simple procedure is described for finding similarities between proteins using nucleotide sequence databases. The approach is illustrated by several examples of previously unknown correspondences with important biological implications: Drosophila elongation factor Tu is shown to be encoded by two genes that are differently expressed during development; a cluster of three Drosophila genes likely encode maltases; a flesh-fly fat body protein resembles the hypothesized Drosophila alcohol dehydrogenase ancestral protein; an unknown protein encoded at the multifunctional E. coli hisT locus resembles aspartate beta-semialdehyde dehydrogenase; and the E. coli tyrR protein is related to nitrogen regulatory proteins. These and other matches were discovered using a personal computer of the type available in most laboratories collecting DNA sequence data. As relatively few sequences were sampled to find these matches, it is likely that much of the existing data has not been adequately examined.

  18. Nucleotide sequence and expression of a Drosophila metallothionein.

    PubMed

    Lastowski-Perry, D; Otto, E; Maroni, G

    1985-02-10

    A Drosophila melanogaster cDNA clone was isolated based on its more intense hybridization to RNA sequences from copper-fed larvae than from control larval RNA. This clone showed strong hybridization to mouse metallothionein I cDNA at reduced stringency. Its nucleotide sequence includes an open reading segment which codes for a 40-amino acid protein; this protein is identified as metallothionein based on its similarity to the amino-terminal portion of mammalian and crab metalloproteins. The 10 cysteine residues present occur in five pairs of near vicinal cysteines (Cys-X-Cys). This cDNA sequence hybridized to a 400-nucleotide polyadenylated RNA whose presence in the cells of the alimentary canal of larvae was stimulated by ingestion of cadmium or copper; in other tissues this RNA was present at much lower levels. Mercury, silver, and zinc induced metallothionein to a lesser extent. The level of metallothionein RNA increased very soon after the initiation of metal treatment and reached a maximum after approximately 36 h. PMID:2578462

  19. Nucleotide sequence of the vaccinia virus hemagglutinin gene.

    PubMed

    Shida, H

    1986-04-30

    Vaccinia virus hemagglutinin (HA) is expressed at late time of infection cycle, and it is nonessential for virus growth. Location of the HA structural gene was determined by hybrid-arrested and hybrid-selected translation methods at the right terminus of the HindIII A fragment. The position of the HA gene was confirmed by the production of the complete HA protein in the cells transfected with the plasmid containing that region. Examination of this nucleotide sequence revealed the positions of cleavage sites for a number of restriction endonucleases. The deduced amino acid sequence revealed that the HA protein is a member of typical surface membrane glycoproteins. Comparison of the nucleotide sequence upstream of the HA coding region with corresponding region of other late genes suggested the existence of the consensus decanucleotides TTCATTTa/tGT between 34 to 18 bp upstream to the initiation codon followed by a cluster of A or T, a unique feature of the late genes of vaccinia virus. These results in conjunction with the ease of isolating HA- mutants provide a basis for a new site suitable for inserting foreign genes.

  20. Finding similar nucleotide sequences using network BLAST searches.

    PubMed

    Ladunga, Istvan

    2009-06-01

    The Basic Local Alignment Search Tool (BLAST) is a keystone of bioinformatics due to its performance and user-friendliness. Beginner and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNA, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn. Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low-complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez, PUBMED, structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge. PMID:19496060

  1. Nucleotide sequence of Bacillus phage Nf terminal protein gene.

    PubMed Central

    Leavitt, M C; Ito, J

    1987-01-01

    The nucleotide sequence of Bacillus phage Nf gene E has been determined. Gene E codes for phage terminal protein which is the primer necessary for the initiation of DNA replication. The deduced amino acid sequence of Nf terminal protein is approximately 66% homologous with the terminal proteins of Bacillus phages PZA and luminal diameter 29, and shows similar hydropathy and secondary structure predictions. A serine which has been identified as the residue which covalently links the protein to the 5' end of the genome in luminal diameter 29, is conserved in all three phages. The hydropathic and secondary structural environment of this serine is similar in these phage terminal proteins and also similar to the linking serine of adenovirus terminal protein. PMID:3601672

  2. Mercury BLASTP: Accelerating Protein Sequence Alignment

    PubMed Central

    Jacob, Arpith; Lancaster, Joseph; Buhler, Jeremy; Harris, Brandon; Chamberlain, Roger D.

    2008-01-01

    Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results. PMID:19492068

  3. Genome nucleotide composition shapes variation in simple sequence repeats.

    PubMed

    Tian, Xiangjun; Strassmann, Joan E; Queller, David C

    2011-02-01

    Simple sequence repeats (SSRs) or microsatellites are a common component of genomes but vary greatly across species in their abundance. We tested the hypothesis that this variation is due in part to AT/GC content of genomes, with genomes biased toward either high AT or high CG generating more short random repeats that are long enough to enhance expansion through slippage during replication. To test this hypothesis, we identified repeats with perfect tandem iterations of 1-6 bp from 25 protists with complete or near-complete genome sequences. As expected, the density and the frequency are highly related to genome AT content, with excellent fits to quadratic regressions with minima near a 50% AT content and rising toward both extremes. Within species, the same trends hold, except the limited variation in AT content within each species places each mainly on the descending (GC rich), middle, or ascending (AT rich) part of the curve. The base usages of repeat motifs are also significantly correlated with genome nucleotide compositions: Percentages of AT-rich motifs rise with the increase of genome AT content but vice versa for GC-rich subgroups. Amino acid homopolymer repeats also show the expected quadratic relationship, with higher abundance in species with AT content biased in either direction. Our results show that genome nucleotide composition explains up to half of the variance in the abundance and motif constitution of SSRs.

  4. Evaluating the Accuracy and Efficiency of Multiple Sequence Alignment Methods

    PubMed Central

    Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Muhammad; Awan, Ali Raza; Aslam, Naeem; Hussain, Tanveer; Naveed, Nasir; Qadri, Salman; Waheed, Usman; Shoaib, Muhammad

    2014-01-01

    A comparison of 10 most popular Multiple Sequence Alignment (MSA) tools, namely, MUSCLE, MAFFT(L-INS-i), MAFFT (FFT-NS-2), T-Coffee, ProbCons, SATe, Clustal Omega, Kalign, Multalin, and Dialign-TX is presented. We also focused on the significance of some implementations embedded in algorithm of each tool. Based on 10 simulated trees of different number of taxa generated by R, 400 known alignments and sequence files were constructed using indel-Seq-Gen. A total of 4000 test alignments were generated to study the effect of sequence length, indel size, deletion rate, and insertion rate. Results showed that alignment quality was highly dependent on the number of deletions and insertions in the sequences and that the sequence length and indel size had a weaker effect. Overall, ProbCons was consistently on the top of list of the evaluated MSA tools. SATe, being little less accurate, was 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). Among other tools, Kalign and MUSCLE achieved the highest sum of pairs. We also considered BALiBASE benchmark datasets and the results relative to BAliBASE- and indel-Seq-Gen-generated alignments were consistent in the most cases. PMID:25574120

  5. Evaluating the accuracy and efficiency of multiple sequence alignment methods.

    PubMed

    Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Muhammad; Awan, Ali Raza; Aslam, Naeem; Hussain, Tanveer; Naveed, Nasir; Qadri, Salman; Waheed, Usman; Shoaib, Muhammad

    2014-01-01

    A comparison of 10 most popular Multiple Sequence Alignment (MSA) tools, namely, MUSCLE, MAFFT(L-INS-i), MAFFT (FFT-NS-2), T-Coffee, ProbCons, SATe, Clustal Omega, Kalign, Multalin, and Dialign-TX is presented. We also focused on the significance of some implementations embedded in algorithm of each tool. Based on 10 simulated trees of different number of taxa generated by R, 400 known alignments and sequence files were constructed using indel-Seq-Gen. A total of 4000 test alignments were generated to study the effect of sequence length, indel size, deletion rate, and insertion rate. Results showed that alignment quality was highly dependent on the number of deletions and insertions in the sequences and that the sequence length and indel size had a weaker effect. Overall, ProbCons was consistently on the top of list of the evaluated MSA tools. SATe, being little less accurate, was 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). Among other tools, Kalign and MUSCLE achieved the highest sum of pairs. We also considered BALiBASE benchmark datasets and the results relative to BAliBASE- and indel-Seq-Gen-generated alignments were consistent in the most cases.

  6. Protein folds and families: sequence and structure alignments.

    PubMed

    Holm, L; Sander, C

    1999-01-01

    Dali and HSSP are derived databases organizing protein space in the structurally known regions. We use an automatic structure alignment program (Dali) for the classification of all known 3D structures based on all-against-all comparison of 3D structures in the Protein Data Bank. The HSSP database associates 1D sequences with known 3D structures using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). As a result, the HSSP database not only provides aligned sequence families, but also implies secondary and tertiary structures covering 36% of all sequences in Swiss-Prot. The structure classification by Dali and the sequence families in HSSP can be browsed jointly from a web interface providing a rich network of links between neighbours in fold space, between domains and proteins, and between structures and sequences. In particular, this results in a database of explicit multiple alignments of protein families in the twilight zone of sequence similarity. The organization of protein structures and families provides a map of the currently known regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The databases are available from http://www.embl-ebi.ac.uk/dali/

  7. Nucleotide sequences specific to Yersinia pestis and methods for the detection of Yersinia pestis

    DOEpatents

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.; Motin, Vladinir L.

    2009-02-24

    Nucleotide sequences specific to Yersinia pestis that serve as markers or signatures for identification of this bacterium were identified. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  8. Nucleotide sequences specific to Brucella and methods for the detection of Brucella

    SciTech Connect

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.

    2009-02-24

    Nucleotide sequences specific to Brucella that serves as a marker or signature for identification of this bacterium were identified. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  9. Nucleotide sequences specific to Francisella tularensis and methods for the detection of Francisella tularensis

    DOEpatents

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.; Vitalis, Elizabeth A

    2007-02-06

    Described herein is the identification of nucleotide sequences specific to Francisella tularensis that serves as a marker or signature for identification of this bacterium. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  10. Nucleotide sequences specific to Francisella tularensis and methods for the detection of Francisella tularensis

    DOEpatents

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.; Vitalis, Elizabeth A

    2009-02-24

    Described herein is the identification of nucleotide sequences specific to Francisella tularensis that serves as a marker or signature for identification of this bacterium. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  11. Recursive dynamic programming for adaptive sequence and structure alignment

    SciTech Connect

    Thiele, R.; Zimmer, R.; Lengauer, T.

    1995-12-31

    We propose a new alignment procedure that is capable of aligning protein sequences and structures in a unified manner. Recursive dynamic programming (RDP) is a hierarchical method which, on each level of the hierarchy, identifies locally optimal solutions and assembles them into partial alignments of sequences and/or structures. In contrast to classical dynamic programming, RDP can also handle alignment problems that use objective functions not obeying the principle of prefix optimality, e.g. scoring schemes derived from energy potentials of mean force. For such alignment problems, RDP aims at computing solutions that are near-optimal with respect to the involved cost function and biologically meaningful at the same time. Towards this goal, RDP maintains a dynamic balance between different factors governing alignment fitness such as evolutionary relationships and structural preferences. As in the RDP method gaps are not scored explicitly, the problematic assignment of gap cost parameters is circumvented. In order to evaluate the RDP approach we analyse whether known and accepted multiple alignments based on structural information can be reproduced with the RDP method.

  12. Generalized Levy-walk model for DNA nucleotide sequences

    NASA Technical Reports Server (NTRS)

    Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Simons, M.; Stanley, H. E.

    1993-01-01

    We propose a generalized Levy walk to model fractal landscapes observed in noncoding DNA sequences. We find that this model provides a very close approximation to the empirical data and explains a number of statistical properties of genomic DNA sequences such as the distribution of strand-biased regions (those with an excess of one type of nucleotide) as well as local changes in the slope of the correlation exponent alpha. The generalized Levy-walk model simultaneously accounts for the long-range correlations in noncoding DNA sequences and for the apparently paradoxical finding of long subregions of biased random walks (length lj) within these correlated sequences. In the generalized Levy-walk model, the lj are chosen from a power-law distribution P(lj) varies as lj(-mu). The correlation exponent alpha is related to mu through alpha = 2-mu/2 if 2 < mu < 3. The model is consistent with the finding of "repetitive elements" of variable length interspersed within noncoding DNA.

  13. Image-based temporal alignment of echocardiographic sequences

    NASA Astrophysics Data System (ADS)

    Danudibroto, Adriyana; Bersvendsen, Jørn; Mirea, Oana; Gerard, Olivier; D'hooge, Jan; Samset, Eigil

    2016-04-01

    Temporal alignment of echocardiographic sequences enables fair comparisons of multiple cardiac sequences by showing corresponding frames at given time points in the cardiac cycle. It is also essential for spatial registration of echo volumes where several acquisitions are combined for enhancement of image quality or forming larger field of view. In this study, three different image-based temporal alignment methods were investigated. First, a method based on dynamic time warping (DTW). Second, a spline-based method that optimized the similarity between temporal characteristic curves of the cardiac cycle using 1D cubic B-spline interpolation. Third, a method based on the spline-based method with piecewise modification. These methods were tested on in-vivo data sets of 19 echo sequences. For each sequence, the mitral valve opening (MVO) time was manually annotated. The results showed that the average MVO timing error for all methods are well under the time resolution of the sequences.

  14. Empirical Bayes Estimation of Coalescence Times from Nucleotide Sequence Data.

    PubMed

    King, Leandra; Wakeley, John

    2016-09-01

    We demonstrate the advantages of using information at many unlinked loci to better calibrate estimates of the time to the most recent common ancestor (TMRCA) at a given locus. To this end, we apply a simple empirical Bayes method to estimate the TMRCA. This method is both asymptotically optimal, in the sense that the estimator converges to the true value when the number of unlinked loci for which we have information is large, and has the advantage of not making any assumptions about demographic history. The algorithm works as follows: we first split the sample at each locus into inferred left and right clades to obtain many estimates of the TMRCA, which we can average to obtain an initial estimate of the TMRCA. We then use nucleotide sequence data from other unlinked loci to form an empirical distribution that we can use to improve this initial estimate. PMID:27440864

  15. Testing evolutionary models to explain the process of nucleotide substitution in gut bacterial 16S rRNA gene sequences.

    PubMed

    Garcia-Mazcorro, Jose F

    2013-09-01

    The 16S rRNA gene has been widely used as a marker of gut bacterial diversity and phylogeny, yet we do not know the model of evolution that best explains the differences in its nucleotide composition within and among taxa. Over 46 000 good-quality near-full-length 16S rRNA gene sequences from five bacterial phyla were obtained from the ribosomal database project (RDP) by study and, when possible, by within-study characteristics (e.g. anatomical region). Using alignments (RDPX and MUSCLE) of unique sequences, the FINDMODEL tool available at http://www.hiv.lanl.gov/ was utilized to find the model of character evolution (28 models were available) that best describes the input sequence data, based on the Akaike information criterion. The results showed variable levels of agreement (from 33% to 100%) in the chosen models between the RDP-based and the MUSCLE-based alignments among the taxa. Moreover, subgroups of sequences (using either alignment method) from the same study were often explained by different models. Nonetheless, the different representatives of the gut microbiota were explained by different proportions of the available models. This is the first report using evolutionary models to explain the process of nucleotide substitution in gut bacterial 16S rRNA gene sequences. PMID:23808388

  16. Cloning, nucleotide sequence, and expression of Achromobacter protease I gene.

    PubMed

    Ohara, T; Makino, K; Shinagawa, H; Nakata, A; Norioka, S; Sakiyama, F

    1989-12-01

    Achromobacter protease I (API) is a lysine-specific serine protease which hydrolyzes specifically the lysyl peptide bond. A gene coding for API was cloned from Achromobacter lyticus M497-1. Nucleotide sequence of the cloned DNA fragment revealed that the gene coded for a single polypeptide chain of 653 amino acids. The N-terminal 205 amino acids, including signal peptide and the threonine/serine-rich C-terminal 180 amino acids are flanking the 268 amino acid-mature protein which was identified by protein sequencing. Escherichia coli carrying a plasmid containing the cloned API gene overproduced and secreted a protein of Mr 50,000 (API') into the periplasm. This protein exhibited a distinct endopeptidase activity specific for lysyl bonds as well. The N-terminal amino acid sequence of API' was the same as mature API, suggesting that the enzyme retained the C-terminal extended peptide chain. The present experiments indicate that API, an extracellular protease produced by gram-negative bacteria, is synthesized in vivo as a precursor protein bearing long extended peptide chains at both N and C termini. PMID:2684982

  17. Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments

    SciTech Connect

    Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.; Brudno, Michael; Batzoglou, Serafim; Bethel, E. Wes; Rubin, Edward M.; Hamann, Bernd; Dubchak, Inna

    2004-01-15

    The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a framework based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu

  18. Complete nucleotide sequence of Nootka lupine vein-clearing virus.

    PubMed

    Robertson, Nancy L; Côté, Fabien; Paré, Christine; Leblanc, Eric; Bergeron, Michel G; Leclerc, Denis

    2007-12-01

    The complete genome sequence of Nootka lupine vein-clearing virus (NLVCV) was determined to be 4,172 nucleotides in length containing four open reading frames (ORFs) with a similar genetic organization of virus species in the genus Carmovirus, family Tombusviridae. The order and gene product size, starting from the 5'-proximal ORF consisted of: (1) polymerase/replicase gene, ORF1 (p27) and ORF1RT (readthrough) (p87), (2) movement proteins ORF2 (p7) and ORF3 (p9), and, (3) the 3'-proximal coat protein ORF4, (p37). The genomic 5'- and 3'-proximal termini contained a short (59 nt) and a relatively longer 405 nt untranslated region, respectively. The longer replicase gene product contained the GDD motif common to RNA-dependent RNA polymerases. Phylogenetically, NLVCV formed a subgroup with the following four carmoviruses when separately comparing the amino acids of the coat protein or replicase protein: Angelonia flower break virus (AnFBV), Carnation mottle virus (CarMV), Pelargonium flower break virus (PFBV), and Saguaro cactus virus (SgCV). Whole genome nucleotide analysis (percent identities) among the carmoviruses with NLVCV suggested a similar pattern. The species demarcation criteria in the genus Carmovirus for the amino acid sequence identity of the polymerase (<52%) and coat (<41%) protein genes restricted NLVCV as a distinct species, and instead, placed it as a tentative strain of CarMV, PFBV, or SgCV when both the polymerase and CP were used as the determining factors. In contrast, the species criteria that included different host ranges with no overlap and lack of serology relatedness between NLVCV and the carmoviruses, suggested that NLVCV was a distinct species. The relatively low cutoff percentages allowed for the polymerase and CP genes to dictate the inclusion/exclusion of a distinct carmovirus species should be reevaluated. Therefore, at this time we have concluded that NLVCV should be classified as a tentative new species in the genus Carmovirus

  19. DNA sequencing using differential extension with nucleotide subsets (DENS).

    PubMed Central

    Raja, M C; Zevin-Sonkin, D; Shwartzburd, J; Rozovskaya, T A; Sobolev, I A; Chertkov, O; Ramanathan, V; Lvovsky, L; Ulanovsky, L E

    1997-01-01

    Here we describe template directed enzymatic synthesis of unique primers, avoiding the chemical synthesis step in primer walking. We have termed this conceptually new technique DENS (differential extension with nucleotide subsets). DENS works by selectively extending a short primer, making it a long one at the intended site only. The procedure starts with a limited initial extension of the primer (at 20-30 degrees C) in the presence of only two out of the four possible dNTPs. The primer is extended by 6-9 bases or longer at the intended priming site, which is deliberately selected, (as is the two-dNTP set), to maximize the extension length. The subsequent termination reaction at 60-65 degrees C then accepts the extended primer at the intended site, but not at alternative sites, where the initial extension (if any) is generally much shorter. DENS allows the use of primers as long as 8mers (degenerate in two positions) which prime much more strongly than modular primers involving 5-7mers and which (unlike the latter) can be used with thermostable polymerases, thus allowing cycle-sequencing with dye-terminators compatible with Taq DNA polymerase, as well as making double-stranded DNA sequencing more robust. PMID:9016632

  20. Nucleotide sequence determines the accelerated rate of point mutations.

    PubMed

    Kini, R Manjunatha; Chinnasamy, Arunkumar

    2010-09-01

    Although the theory of evolution was put forth about 150 years ago our understanding of how molecules drive evolution remains poor. It is well-established that proteins evolve at different rates, essentially based on their functional role and three-dimensional structure. However, the highly variable rates of evolution of different proteins - especially the rapidly evolving ones - within a single organism are poorly understood. Using examples of genes for fast-evolving toxins and human hereditary diseases, we show for the first time that specific nucleotide sequences appear to determine point mutation rates. Based on mutation rates, we have classified triplets (not just codons) into stable, unstable and intermediate groups. Toxin genes contain a relatively higher percentage of unstable triplets in their exons compared to introns, whereas non-toxin genes contain a higher percentage of unstable triplets in their introns. Thus the distribution of stable and unstable triplets is correlated with and may explain the accelerated evolution of point mutations in toxins. Similarly, at the genomic level, lower organisms with genes that evolve faster contain a higher percentage of unstable triplets compared to higher organisms. These findings show that mutation rates of proteins, and hence of the organisms, are DNA sequence-dependent and thus provide a proximate mechanism of evolution at the molecular level. PMID:20362603

  1. Single nucleotide polymorphisms associated with rat expressed sequences.

    PubMed

    Guryev, Victor; Berezikov, Eugene; Malik, Rainer; Plasterk, Ronald H A; Cuppen, Edwin

    2004-07-01

    Single nucleotide polymorphisms (SNPs) are the most common source of genetic variation in populations and are thus most likely to account for the majority of phenotypic and behavioral differences between individuals or strains. Although the rat is extensively studied for the latter, data on naturally occurring polymorphisms are mostly lacking. We have used publicly available sequences consisting of whole-genome shotgun (WGS), expressed sequence tag (EST), and mRNA data as a source for the in silico identification of SNPs in gene-coding regions and have identified a large collection of 33,305 high-quality candidate SNPs. Experimental verification of 471 candidate SNPs using a limited set of rat isolates revealed a confirmation rate of approximately 50%. Although the majority of SNPs were identified between Sprague-Dawley (EST data) and Brown Norway (WGS data) strains, we found that 66% of the verified variations are common among different rat strains. All SNPs were extensively annotated, including chromosomal and genetic map information, and nonsynonymous SNPs were analyzed by SIFT and PolyPhen prediction programs for their potential deleterious effect on protein function. Interestingly, we retrieved three SNPs from the database that result in the introduction of a premature stop codon and that could be confirmed experimentally. Two of these "in silico-identified knockouts" reside in interesting QTL regions. Data are publicly available via a Web interface (http://cascad.niob.knaw.nl), allowing simple and advanced search queries.

  2. HIVE-Hexagon: High-Performance, Parallelized Sequence Alignment for Next-Generation Sequencing Data Analysis

    PubMed Central

    Santana-Quintero, Luis; Dingerdissen, Hayley; Thierry-Mieg, Jean; Mazumder, Raja; Simonyan, Vahan

    2014-01-01

    Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations. Availability https://hive.biochemistry.gwu.edu/hive/ PMID:24918764

  3. The nucleotide sequence of the uvrD gene of E. coli.

    PubMed Central

    Finch, P W; Emmerson, P T

    1984-01-01

    The nucleotide sequence of a cloned section of the E. coli chromosome containing the uvrD gene has been determined. The coding region for the UvrD protein consists of 2,160 nucleotides which would direct the synthesis of a polypeptide 720 amino acids long with a calculated molecular weight of 82 kd. The predicted amino acid sequence of the UvrD protein has been compared with the amino acid sequences of other known adenine nucleotide binding proteins and a common sequence has been identified, thought to contribute towards adenine nucleotide binding. PMID:6379604

  4. SDT: a virus classification tool based on pairwise sequence alignment and identity calculation.

    PubMed

    Muhire, Brejnev Muhizi; Varsani, Arvind; Martin, Darren Patrick

    2014-01-01

    The perpetually increasing rate at which viral full-genome sequences are being determined is creating a pressing demand for computational tools that will aid the objective classification of these genome sequences. Taxonomic classification approaches that are based on pairwise genetic identity measures are potentially highly automatable and are progressively gaining favour with the International Committee on Taxonomy of Viruses (ICTV). There are, however, various issues with the calculation of such measures that could potentially undermine the accuracy and consistency with which they can be applied to virus classification. Firstly, pairwise sequence identities computed based on multiple sequence alignments rather than on multiple independent pairwise alignments can lead to the deflation of identity scores with increasing dataset sizes. Also, when gap-characters need to be introduced during sequence alignments to account for insertions and deletions, methodological variations in the way that these characters are introduced and handled during pairwise genetic identity calculations can cause high degrees of inconsistency in the way that different methods classify the same sets of sequences. Here we present Sequence Demarcation Tool (SDT), a free user-friendly computer program that aims to provide a robust and highly reproducible means of objectively using pairwise genetic identity calculations to classify any set of nucleotide or amino acid sequences. SDT can produce publication quality pairwise identity plots and colour-coded distance matrices to further aid the classification of sequences according to ICTV approved taxonomic demarcation criteria. Besides a graphical interface version of the program for Windows computers, command-line versions of the program are available for a variety of different operating systems (including a parallel version for cluster computing platforms). PMID:25259891

  5. SDT: a virus classification tool based on pairwise sequence alignment and identity calculation.

    PubMed

    Muhire, Brejnev Muhizi; Varsani, Arvind; Martin, Darren Patrick

    2014-01-01

    The perpetually increasing rate at which viral full-genome sequences are being determined is creating a pressing demand for computational tools that will aid the objective classification of these genome sequences. Taxonomic classification approaches that are based on pairwise genetic identity measures are potentially highly automatable and are progressively gaining favour with the International Committee on Taxonomy of Viruses (ICTV). There are, however, various issues with the calculation of such measures that could potentially undermine the accuracy and consistency with which they can be applied to virus classification. Firstly, pairwise sequence identities computed based on multiple sequence alignments rather than on multiple independent pairwise alignments can lead to the deflation of identity scores with increasing dataset sizes. Also, when gap-characters need to be introduced during sequence alignments to account for insertions and deletions, methodological variations in the way that these characters are introduced and handled during pairwise genetic identity calculations can cause high degrees of inconsistency in the way that different methods classify the same sets of sequences. Here we present Sequence Demarcation Tool (SDT), a free user-friendly computer program that aims to provide a robust and highly reproducible means of objectively using pairwise genetic identity calculations to classify any set of nucleotide or amino acid sequences. SDT can produce publication quality pairwise identity plots and colour-coded distance matrices to further aid the classification of sequences according to ICTV approved taxonomic demarcation criteria. Besides a graphical interface version of the program for Windows computers, command-line versions of the program are available for a variety of different operating systems (including a parallel version for cluster computing platforms).

  6. Multiple sequence alignment in HTML: colored, possibly hyperlinked, compact representations.

    PubMed

    Campagne, F; Maigret, B

    1998-02-01

    Protein sequence alignments are widely used in protein structure prediction, protein engineering, modeling of proteins, etc. This type of representation is useful at different stages of scientific activity: looking at previous results, working on a research project, and presenting the results. There is a need to make it available through a network (intranet or WWW), in a way that allows biologists, chemists, and noncomputer specialists to look at the data and carry on research--possibly in a collaborative research. Previous methods (text-based, Java-based) are reported and their advantages are discussed. We have developed two novel approaches to represent the alignments as colored, hyper-linked HTML pages. The first method creates an HTML page that uses efficiently the image cache mechanism of a WWW browser, thereby allowing the user to browse different alignments without waiting for the images to be loaded through the network, but only for the first viewed alignment. The generated pages can be browsed with any HTML2.0-compliant browser. The second method that we propose uses W3C-CSS1-style sheets to render alignments. This new method generates pages that require recent browsers to be viewed. We implemented these methods in the Viseur program and made a WWW service available that allows a user to convert an MSF alignment file in HTML for WWW publishing. The latter service is available at http:@www.lctn.u-nancy.fr/viseur/services.htm l.

  7. Spatially localized generation of nucleotide sequence-specific DNA damage

    PubMed Central

    Oh, Dennis H.; King, Brett A.; Boxer, Steven G.; Hanawalt, Philip C.

    2001-01-01

    Psoralens linked to triplex-forming oligonucleotides (psoTFOs) have been used in conjunction with laser-induced two-photon excitation (TPE) to damage a specific DNA target sequence. To demonstrate that TPE can initiate photochemistry resulting in psoralen–DNA photoadducts, target DNA sequences were incubated with psoTFOs to form triple-helical complexes and then irradiated in liquid solution with pulsed 765-nm laser light, which is half the quantum energy required for conventional one-photon excitation, as used in psoralen + UV A radiation (320–400 nm) therapy. Target DNA acquired strand-specific psoralen monoadducts in a light dose-dependent fashion. To localize DNA damage in a model tissue-like medium, a DNA–psoTFO mixture was prepared in a polyacrylamide gel and then irradiated with a converging laser beam targeting the rear of the gel. The highest number of photoadducts formed at the rear while relatively sparing DNA at the front of the gel, demonstrating spatial localization of sequence-specific DNA damage by TPE. To assess whether TPE treatment could be extended to cells without significant toxicity, cultured monolayers of normal human dermal fibroblasts were incubated with tritium-labeled psoralen without TFO to maximize detectable damage and irradiated by TPE. DNA from irradiated cells treated with psoralen exhibited a 4- to 7-fold increase in tritium activity relative to untreated controls. Functional survival assays indicated that the psoralen–TPE treatment was not toxic to cells. These results demonstrate that DNA damage can be simultaneously manipulated at the nucleotide level and in three dimensions. This approach for targeting photochemical DNA damage may have photochemotherapeutic applications in skin and other optically accessible tissues. PMID:11572980

  8. Spatially localized generation of nucleotide sequence-specific DNA damage.

    PubMed

    Oh, D H; King, B A; Boxer, S G; Hanawalt, P C

    2001-09-25

    Psoralens linked to triplex-forming oligonucleotides (psoTFOs) have been used in conjunction with laser-induced two-photon excitation (TPE) to damage a specific DNA target sequence. To demonstrate that TPE can initiate photochemistry resulting in psoralen-DNA photoadducts, target DNA sequences were incubated with psoTFOs to form triple-helical complexes and then irradiated in liquid solution with pulsed 765-nm laser light, which is half the quantum energy required for conventional one-photon excitation, as used in psoralen + UV A radiation (320-400 nm) therapy. Target DNA acquired strand-specific psoralen monoadducts in a light dose-dependent fashion. To localize DNA damage in a model tissue-like medium, a DNA-psoTFO mixture was prepared in a polyacrylamide gel and then irradiated with a converging laser beam targeting the rear of the gel. The highest number of photoadducts formed at the rear while relatively sparing DNA at the front of the gel, demonstrating spatial localization of sequence-specific DNA damage by TPE. To assess whether TPE treatment could be extended to cells without significant toxicity, cultured monolayers of normal human dermal fibroblasts were incubated with tritium-labeled psoralen without TFO to maximize detectable damage and irradiated by TPE. DNA from irradiated cells treated with psoralen exhibited a 4- to 7-fold increase in tritium activity relative to untreated controls. Functional survival assays indicated that the psoralen-TPE treatment was not toxic to cells. These results demonstrate that DNA damage can be simultaneously manipulated at the nucleotide level and in three dimensions. This approach for targeting photochemical DNA damage may have photochemotherapeutic applications in skin and other optically accessible tissues. PMID:11572980

  9. The impact of single substitutions on multiple sequence alignments.

    PubMed

    Klaere, Steffen; Gesell, Tanja; von Haeseler, Arndt

    2008-12-27

    We introduce another view of sequence evolution. Contrary to other approaches, we model the substitution process in two steps. First we assume (arbitrary) scaled branch lengths on a given phylogenetic tree. Second we allocate a Poisson distributed number of substitutions on the branches. The probability to place a mutation on a branch is proportional to its relative branch length. More importantly, the action of a single mutation on an alignment column is described by a doubly stochastic matrix, the so-called one-step mutation matrix. This matrix leads to analytical formulae for the posterior probability distribution of the number of substitutions for an alignment column.

  10. Review of alignment and SNP calling algorithms for next-generation sequencing data.

    PubMed

    Mielczarek, M; Szyda, J

    2016-02-01

    Application of the massive parallel sequencing technology has become one of the most important issues in life sciences. Therefore, it was crucial to develop bioinformatics tools for next-generation sequencing (NGS) data processing. Currently, two of the most significant tasks include alignment to a reference genome and detection of single nucleotide polymorphisms (SNPs). In many types of genomic analyses, great numbers of reads need to be mapped to the reference genome; therefore, selection of the aligner is an essential step in NGS pipelines. Two main algorithms-suffix tries and hash tables-have been introduced for this purpose. Suffix array-based aligners are memory-efficient and work faster than hash-based aligners, but they are less accurate. In contrast, hash table algorithms tend to be slower, but more sensitive. SNP and genotype callers may also be divided into two main different approaches: heuristic and probabilistic methods. A variety of software has been subsequently developed over the past several years. In this paper, we briefly review the current development of NGS data processing algorithms and present the available software.

  11. Sequence Alignment Tools: One Parallel Pattern to Rule Them All?

    PubMed Central

    2014-01-01

    In this paper, we advocate high-level programming methodology for next generation sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools to their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols, and task scheduling, gaining more possibility for seamless performance tuning. In this work, we show some use cases in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets. PMID:25147803

  12. Nucleotide sequence and temporal expression of a baculovirus regulatory gene.

    PubMed

    Guarino, L A; Summers, M D

    1987-07-01

    The nucleotide sequence of a trans-activating regulatory gene (IE-1) of the baculovirus Autographa californica nuclear polyhedrosis virus has been determined. This gene encodes a protein of 581 amino acids with a predicted molecular weight of 66,856. A DNA fragment containing the entire coding sequence of IE-1 was inserted downstream of an RNA promoter. Subsequent cell-free transcription and translation directed the synthesis of a single peptide with an apparent molecular weight of 70,000. Quantitative S1 nuclease analysis indicated that IE-1 was maximally synthesized during a 1-h virus adsorption period and that steady-state levels of IE-1 message were maintained during the first 24 h of infection. Northern blot hybridization indicated that several late transcripts which overlap the IE-1 gene were transcribed from both strands. The precise locations of the 5' and 3' ends of these overlapping transcripts were mapped using S1 nuclease. The overlapping transcripts were grouped in two transcriptional units. One unit was composed of IE-1 and overlapping gamma transcripts which initiated upstream of IE-1 and terminated downstream of IE-1. The other unit, transcribed from the opposite strand, consisted of gamma transcripts with coterminal 5' ends and extended 3' ends. The shorter, more abundant transcripts in this unit overlapped 30 to 40 bases of IE-1 at the 3' end, while the longer transcripts overlapped the entire IE-1 gene. Transcription of several early A. californica nuclear polyhedrosis virus genes, in addition to 39K, was shown to be trans-activated by IE-1, indicating that IE-1 may have a central role in the regulation of beta-gene expression. PMID:16789264

  13. Complete nucleotide sequence of a monopartite Begomovirus and associated satellites infecting Carica papaya in Nepal.

    PubMed

    Shahid, M S; Yoshida, S; Khatri-Chhetri, G B; Briddon, R W; Natsuaki, K T

    2013-06-01

    Carica papaya (papaya) is a fruit crop that is cultivated mostly in kitchen gardens throughout Nepal. Leaf samples of C. papaya plants with leaf curling, vein darkening, vein thickening, and a reduction in leaf size were collected from a garden in Darai village, Rampur, Nepal in 2010. Full-length clones of a monopartite Begomovirus, a betasatellite and an alphasatellite were isolated. The complete nucleotide sequence of the Begomovirus showed the arrangement of genes typical of Old World begomoviruses with the highest nucleotide sequence identity (>99 %) to an isolate of Ageratum yellow vein virus (AYVV), confirming it as an isolate of AYVV. The complete nucleotide sequence of betasatellite showed greater than 89 % nucleotide sequence identity to an isolate of Tomato leaf curl Java betasatellite originating from Indonesian. The sequence of the alphasatellite displayed 92 % nucleotide sequence identity to Sida yellow vein China alphasatellite. This is the first identification of these components in Nepal and the first time they have been identified in papaya.

  14. The nucleotide sequence of the amiE gene of Pseudomonas aeruginosa.

    PubMed

    Brammar, W J; Charles, I G; Matfield, M; Liu, C P; Drew, R E; Clarke, P H

    1987-05-11

    The nucleotide sequence of the amiE gene, encoding the aliphatic amidase of Pseudomonas aeruginosa, has been determined. The sequence of 1038 nucleotides shows a strong bias in favour of codons with G or C in the third position, and only 44 different codons are utilised.

  15. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus) Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms.

    PubMed

    Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca

    2015-01-01

    Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources.

  16. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus) Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms.

    PubMed

    Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca

    2015-01-01

    Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450

  17. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus) Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms

    PubMed Central

    Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca

    2015-01-01

    Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450

  18. FootPrinter3: phylogenetic footprinting in partially alignable sequences.

    PubMed

    Fang, Fei; Blanchette, Mathieu

    2006-07-01

    FootPrinter3 is a web server for predicting transcription factor binding sites by using phylogenetic footprinting. Until now, phylogenetic footprinting approaches have been based either on multiple alignment analysis (e.g. PhyloVista, PhastCons), or on motif-discovery algorithms (e.g. FootPrinter2). FootPrinter3 integrates these two approaches, making use of local multiple sequence alignment blocks when those are available and reliable, but also allowing finding motifs in unalignable regions. The result is a set of predictions that joins the advantages of alignment-based methods (good specificity) to those of motif-based methods (good sensitivity, even in the presence of highly diverged species). FootPrinter3 is thus a tool of choice to exploit the wealth of vertebrate genomes being sequenced, as it allows taking full advantage of the sequences of highly diverged species (e.g. chicken, zebrafish), as well as those of more closely related species (e.g. mammals). The FootPrinter3 web server is available at: http://www.mcb.mcgill.ca/~blanchem/FootPrinter3.

  19. Exploring Dance Movement Data Using Sequence Alignment Methods

    PubMed Central

    Chavoshi, Seyed Hossein; De Baets, Bernard; Neutens, Tijs; De Tré, Guy; Van de Weghe, Nico

    2015-01-01

    Despite the abundance of research on knowledge discovery from moving object databases, only a limited number of studies have examined the interaction between moving point objects in space over time. This paper describes a novel approach for measuring similarity in the interaction between moving objects. The proposed approach consists of three steps. First, we transform movement data into sequences of successive qualitative relations based on the Qualitative Trajectory Calculus (QTC). Second, sequence alignment methods are applied to measure the similarity between movement sequences. Finally, movement sequences are grouped based on similarity by means of an agglomerative hierarchical clustering method. The applicability of this approach is tested using movement data from samba and tango dancers. PMID:26181435

  20. MACSIMS : multiple alignment of complete sequences information management system

    PubMed Central

    Thompson, Julie D; Muller, Arnaud; Waterhouse, Andrew; Procter, Jim; Barton, Geoffrey J; Plewniak, Frédéric; Poch, Olivier

    2006-01-01

    Background In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. Results MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. Conclusion MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at . PMID:16792820

  1. Complete nucleotide sequence of the temperate bacteriophage LBR48, a new member of the family Myoviridae.

    PubMed

    Jang, Se Hwan; Yoon, Bo Hyun; Chang, Hyo Ihl

    2011-02-01

    The complete genomic sequence of LBR48, a temperate bacteriophage induced from a lysogenic strain of Lactobacillus brevis, was found to be 48,211 nucleotides long and to contain 90 putative open reading frames. Based on structural characteristics obtained from microscopic analysis and nucleic acid sequence determination, phage LBR48 can be classified as a member of the family Myoviridae. Analysis of the genome showed the conserved gene order of previously reported phages of the family Siphoviridae from lactic acid bacteria, despite low nucleotide sequence similarity. Analysis of the attachment sites revealed 15-nucleotide-long core sequences. PMID:20976608

  2. Identification and nucleotide sequence of the glycoprotein gB gene of equine herpesvirus 4.

    PubMed

    Riggio, M P; Cullinane, A A; Onions, D E

    1989-03-01

    The nucleotide sequence of the glycoprotein gB gene of equine herpesvirus 4 (EHV-4) was determined. The gene was located within a BamHI genomic library by a combination of Southern and dot-blot hybridization with probes derived from the herpes simplex virus type 1 (HSV-1) gB DNA sequence. The predominant portion of the coding sequences was mapped to a 2.95-kilobase BamHI-EcoRI subfragment at the left-hand end of BamHI-C. Potential TATA box, CAT box, and mRNA start site sequences and the translational initiation codon were located in the BamHI M fragment of the virus, which is located immediately to the left of BamHI-C. A polyadenylation signal, AATAAA, occurs nine nucleotides past the chain termination codon. Translation of these sequences would give a 110-kilodalton protein possessing a 5' hydrophobic signal sequence, a hydrophilic surface domain containing 11 potential N-linked glycosylation sites, a hydrophobic transmembrane domain, and a 3' highly charged cytoplasmic domain. A potential internal proteolytic cleavage site, Arg-Arg/Ser, was identified at residues 459 to 461. Analysis of this protein revealed amino acid sequence homologies of 47% with HSV-1 gB, 54% with pseudorabies virus gpII, 51% with varicella-zoster virus gpII, 29% with human cytomegalovirus gB, and 30% with Epstein-Barr virus gB. Alignment of EHV-4 gB with HSV-1 (KOS) gB further revealed that four potential N-linked glycosylation sites and all 10 cysteine residues on the external surface of the molecules are perfectly conserved, suggesting that the proteins possess similar secondary and tertiary structures. Thus, we showed that EHV-4 gB is highly conserved with the gB and gpII glycoproteins of other herpesviruses, suggesting that this glycoprotein has a similar overall function in each virus. PMID:2915378

  3. Extracting protein alignment models from the sequence database.

    PubMed Central

    Neuwald, A F; Liu, J S; Lipman, D J; Lawrence, C E

    1997-01-01

    Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans ; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences. PMID:9108146

  4. Genome-wide synteny through highly sensitive sequence alignment: Satsuma

    PubMed Central

    Grabherr, Manfred G.; Russell, Pamela; Meyer, Miriah; Mauceli, Evan; Alföldi, Jessica; Di Palma, Federica; Lindblad-Toh, Kerstin

    2010-01-01

    Motivation: Comparative genomics heavily relies on alignments of large and often complex DNA sequences. From an engineering perspective, the problem here is to provide maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes). Results: Satsuma addresses all three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous ‘battleship’-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine. Availability: Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license from http://www.broadinstitute.org/science/programs/genome-biology/spines/ Contact: grabherr@broadinstitute.org PMID:20208069

  5. Nucleotide sequence of HS-beta satellite DNA from kangaroo rat Dipodomys ordii.

    PubMed

    Fry, K; Poon, R; Whitcome, P; Idriss, J; Salser, W; Mazrimas, J; Hatch, F

    1973-09-01

    The sequence of the highly repetitive satellite HS-beta DNA fraction from kangaroo rat Dipodomys ordii was determined independently by RNA and DNA sequencing techniques. A basic iterated sequence of 10 nucleotides with several mutational variations was found. Base-composition data are consistent with the proposed sequence and revealed a high content of 5-methylcytosine. DNA and RNA sequencing techniques used gave identical results, showing that the fidelity of synthesis of riboguanidine-substituted DNA under our conditions is adequate for nucleotide sequence studies.

  6. Complete nucleotide sequence of a 16S ribosomal RNA gene from Escherichia coli.

    PubMed Central

    Brosius, J; Palmer, M L; Kennedy, P J; Noller, H F

    1978-01-01

    The complete nucleotide sequence of the 16S RNA gene from the rrnB cistron of Escherichia coli has been determined by using three rapid DNA sequencing methods. Nearly all of the structure has been confirmed by two to six independent sequence determinations on both DNA strands. The length of the 16S rRNA chain inferred from the DNA sequence is 1541 nucleotides, in close agreement with previous estimates. We note discrepancies between this sequence and the most recent version of it reported from direct RNA sequencing [Ehresmann, C., Stiegler, P., Carbon, P. & Ebel, J.P. (1977) FEBS Lett. 84, 337-341]. A few of these may be explained by heterogeneity among 16S rRNA sequences from different cistrons. No nucleotide sequences were found in the 16S rRNA gene that cannot be reconciled with RNase digestion products of mature 16S rRNA. Images PMID:368799

  7. Multiple sequence alignment with the Clustal series of programs.

    PubMed

    Chenna, Ramu; Sugawara, Hideaki; Koike, Tadashi; Lopez, Rodrigo; Gibson, Toby J; Higgins, Desmond G; Thompson, Julie D

    2003-07-01

    The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).

  8. Implied alignment: a synapomorphy-based multiple-sequence alignment method and its use in cladogram search

    NASA Technical Reports Server (NTRS)

    Wheeler, Ward C.

    2003-01-01

    A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.

  9. Diversity of preferred nucleotide sequences around the translation initiation codon in eukaryote genomes.

    PubMed

    Nakagawa, So; Niimura, Yoshihito; Gojobori, Takashi; Tanaka, Hiroshi; Miura, Kin-ichiro

    2008-02-01

    Understanding regulatory mechanisms of protein synthesis in eukaryotes is essential for the accurate annotation of genome sequences. Kozak reported that the nucleotide sequence GCCGCC(A/G)CCAUGG (AUG is the initiation codon) was frequently observed in vertebrate genes and that this 'consensus' sequence enhanced translation initiation. However, later studies using invertebrate, fungal and plant genes reported different 'consensus' sequences. In this study, we conducted extensive comparative analyses of nucleotide sequences around the initiation codon by using genomic data from 47 eukaryote species including animals, fungi, plants and protists. The analyses revealed that preferred nucleotide sequences are quite diverse among different species, but differences between patterns of nucleotide bias roughly reflect the evolutionary relationships of the species. We also found strong biases of A/G at position -3, A/C at position -2 and C at position +5 that were commonly observed in all species examined. Genes with higher expression levels showed stronger signals, suggesting that these nucleotides are responsible for the regulation of translation initiation. The diversity of preferred nucleotide sequences around the initiation codon might be explained by differences in relative contributions from two distinct patterns, GCCGCCAUG and AAAAAAAUG, which implies the presence of multiple molecular mechanisms for controlling translation initiation.

  10. Finding the right coverage: the impact of coverage and sequence quality on single nucleotide polymorphism genotyping error rates.

    PubMed

    Fountain, Emily D; Pauli, Jonathan N; Reid, Brendan N; Palsbøll, Per J; Peery, M Zachariah

    2016-07-01

    Restriction-enzyme-based sequencing methods enable the genotyping of thousands of single nucleotide polymorphism (SNP) loci in nonmodel organisms. However, in contrast to traditional genetic markers, genotyping error rates in SNPs derived from restriction-enzyme-based methods remain largely unknown. Here, we estimated genotyping error rates in SNPs genotyped with double digest RAD sequencing from Mendelian incompatibilities in known mother-offspring dyads of Hoffman's two-toed sloth (Choloepus hoffmanni) across a range of coverage and sequence quality criteria, for both reference-aligned and de novo-assembled data sets. Genotyping error rates were more sensitive to coverage than sequence quality and low coverage yielded high error rates, particularly in de novo-assembled data sets. For example, coverage ≥5 yielded median genotyping error rates of ≥0.03 and ≥0.11 in reference-aligned and de novo-assembled data sets, respectively. Genotyping error rates declined to ≤0.01 in reference-aligned data sets with a coverage ≥30, but remained ≥0.04 in the de novo-assembled data sets. We observed approximately 10- and 13-fold declines in the number of loci sampled in the reference-aligned and de novo-assembled data sets when coverage was increased from ≥5 to ≥30 at quality score ≥30, respectively. Finally, we assessed the effects of genotyping coverage on a common population genetic application, parentage assignments, and showed that the proportion of incorrectly assigned maternities was relatively high at low coverage. Overall, our results suggest that the trade-off between sample size and genotyping error rates be considered prior to building sequencing libraries, reporting genotyping error rates become standard practice, and that effects of genotyping errors on inference be evaluated in restriction-enzyme-based SNP studies.

  11. MSA-PAD: DNA multiple sequence alignment framework based on PFAM accessed domain information.

    PubMed

    Balech, Bachir; Vicario, Saverio; Donvito, Giacinto; Monaco, Alfonso; Notarangelo, Pasquale; Pesole, Graziano

    2015-08-01

    Here we present the MSA-PAD application, a DNA multiple sequence alignment framework that uses PFAM protein domain information to align DNA sequences encoding either single or multiple protein domains. MSA-PAD has two alignment options: gene and genome mode. PMID:25819080

  12. Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW.

    PubMed

    Oliver, Tim; Schmidt, Bertil; Nathan, Darran; Clemens, Ralf; Maskell, Douglas

    2005-08-15

    Aligning hundreds of sequences using progressive alignment tools such as ClustalW requires several hours on state-of-the-art workstations. We present a new approach to compute multiple sequence alignments in far shorter time using reconfigurable hardware. This results in an implementation of ClustalW with significant runtime savings on a standard off-the-shelf FPGA.

  13. Cloning and nucleotide sequence of the aroA gene of Bordetella pertussis.

    PubMed Central

    Maskell, D J; Morrissey, P; Dougan, G

    1988-01-01

    The aroA locus of Bordetella pertussis, encoding 5-enolpyruvylshikimate 3-phosphate synthase, has been cloned into Escherichia coli by using a cosmid vector. The gene is expressed in E. coli and complemented an E. coli aroA mutant. The nucleotide sequence of the B. pertussis aroA gene was determined and contains an open reading frame encoding 442 amino acids, with a calculated molecular weight for 5-enolpyruvylshikimate 3-phosphate synthase of 46,688. The amino acid sequence derived from the nucleotide sequence shows homology with the published amino acid sequences of aroA gene products of other microorganisms. PMID:2897356

  14. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison

    PubMed Central

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  15. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison.

    PubMed

    Kato, Mikio

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  16. PROMALS3D web server for accurate multiple protein sequence and structure alignments.

    PubMed

    Pei, Jimin; Tang, Ming; Grishin, Nick V

    2008-07-01

    Multiple sequence alignments are essential in computational sequence and structural analysis, with applications in homology detection, structure modeling, function prediction and phylogenetic analysis. We report PROMALS3D web server for constructing alignments for multiple protein sequences and/or structures using information from available 3D structures, database homologs and predicted secondary structures. PROMALS3D shows higher alignment accuracy than a number of other advanced methods. Input of PROMALS3D web server can be FASTA format protein sequences, PDB format protein structures and/or user-defined alignment constraints. The output page provides alignments with several formats, including a colored alignment augmented with useful information about sequence grouping, predicted secondary structures and consensus sequences. Intermediate results of sequence and structural database searches are also available. The PROMALS3D web server is available at: http://prodata.swmed.edu/promals3d/. PMID:18503087

  17. Multiple sequence alignment based on combining genetic algorithm with chaotic sequences.

    PubMed

    Gao, C; Wang, B; Zhou, C J; Zhang, Q

    2016-01-01

    In bioinformatics, sequence alignment is one of the most common problems. Multiple sequence alignment is an NP (nondeterministic polynomial time) problem, which requires further study and exploration. The chaos optimization algorithm is a type of chaos theory, and a procedure for combining the genetic algorithm (GA), which uses ergodicity, and inherent randomness of chaotic iteration. It is an efficient method to solve the basic premature phenomenon of the GA. Applying the Logistic map to the GA and using chaotic sequences to carry out the chaotic perturbation can improve the convergence of the basic GA. In addition, the random tournament selection and optimal preservation strategy are used in the GA. Experimental evidence indicates good results for this process. PMID:27420977

  18. Multiple sequence alignment based on combining genetic algorithm with chaotic sequences.

    PubMed

    Gao, C; Wang, B; Zhou, C J; Zhang, Q

    2016-06-24

    In bioinformatics, sequence alignment is one of the most common problems. Multiple sequence alignment is an NP (nondeterministic polynomial time) problem, which requires further study and exploration. The chaos optimization algorithm is a type of chaos theory, and a procedure for combining the genetic algorithm (GA), which uses ergodicity, and inherent randomness of chaotic iteration. It is an efficient method to solve the basic premature phenomenon of the GA. Applying the Logistic map to the GA and using chaotic sequences to carry out the chaotic perturbation can improve the convergence of the basic GA. In addition, the random tournament selection and optimal preservation strategy are used in the GA. Experimental evidence indicates good results for this process.

  19. Complete nucleotide sequences of two adjacent early vaccinia virus genes located within the inverted terminal repetition.

    PubMed

    Venkatesan, S; Gershowitz, A; Moss, B

    1982-11-01

    The proximal part of the 10,000-base pair (bp) inverted terminal repetition of vaccinia virus DNA encodes at least three early mRNAs. A 2,236-bp segment of the repetition was sequenced to characterize two of the genes. This task was facilitated by constructing a series of recombinants containing overlapping deletions; oligonucleotide linkers with synthetic restriction sites provided points for radioactive labeling before sequencing by the chemical degradation method of Maxam and Gilbert (Methods Enzymol. 65:499-560, 1980). The ends of the transcripts were mapped by hybridizing labeled DNA fragments to early viral RNA and resolving nuclease S1-protected fragments in sequencing gels, by sequencing cDNA clones, and from the lengths of the RNAs. The nucleotide sequences for at least 60 bp upstream of both transcriptional initiation sites are more than 80% adenine . thymine rich and contain long runs of adenines and thymines with some homology to procaryotic and eucaryotic consensus sequences. The gene transcribed in the rightward direction encodes an RNA of approximately 530 nucleotides with a single open reading frame of 420 nucleotides. Preceding the first AUG, there is a heptanucleotide that can hybridize to the 3' end of 18S rRNA with only one mismatch. The derived amino acid sequence of the protein indicated a molecular weight of 15,500. The gene transcribed in the leftward direction encodes an RNA 1,000 to 1,100 nucleotides long with an open reading frame of 996 nucleotides and a leader sequence of only 5 to 6 nucleotides. The derived amino acid sequence of this protein indicated a molecular weight of 38,500. The 3' ends of the two transcripts were located within 100 bp of each other. Although there are adenine . thymine-rich clusters near the putative transcriptional termination sites, specific AATAAA polyadenylic acid signal sequences are absent.

  20. Nucleotide sequence of 3' untranslated portion of human alpha globin mRNA.

    PubMed Central

    Wilson, J T; deRiel, J K; Forget, B G; Marotta, C A; Weissman, S M

    1977-01-01

    We have determined the nucleotide sequence of 75 nucleotides of the 3'-untranslated portion of normal human alpha globin mRNA which corresponds to the elongated amino acid sequence of the chain termination mutant Hb Constant Spring. This was accomplished by sequence analysis of cDNA fragments obtained by restriction endonuclease or T4 endonuclease IV cleavage of human globin cDNA synthesized from globin mRNA by use of viral reverse transcriptase. Analysis of cRNA synthesized from cDNA by use of RNA polymerase provided additional confirmatory sequence information. Possible polymorphism has been identified at one site of the sequence. Our sequence overlaps with, and extends the sequence of 43 nucleotides determined by Proudfood and coworkers for the very 3'-terminal portion of human alpha globin mRNA. The complete 3'-untranslated sequence of human alpha globin mRNA (112 nucleotides including termination codon) shows little homology to that of the human or rabbit beta globin mRNAs except for the presence of the hexanucleotide sequence AAUAAA which is found in most eukaryotic mRNAs near the 3'-terminal poly (A). Images PMID:909779

  1. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and 3-dimensional structural information

    PubMed Central

    Pei, Jimin; Grishin, Nick V.

    2015-01-01

    SUMMARY Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of 3-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  2. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information.

    PubMed

    Pei, Jimin; Grishin, Nick V

    2014-01-01

    Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of three-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D Web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  3. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information.

    PubMed

    Pei, Jimin; Grishin, Nick V

    2014-01-01

    Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of three-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D Web server and package are available at http://prodata.swmed.edu/PROMALS3D.

  4. Fast single-pass alignment and variant calling using sequencing data

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Sequencing research requires efficient computation. Few programs use already known information about DNA variants when aligning sequence data to the reference map. New program findmap.f90 reads the previous variant list before aligning sequence, calling variant alleles, and summing the allele counts...

  5. The nucleotide sequences of 5S rRNAs from three ciliated protozoa.

    PubMed Central

    Kumazaki, T; Hori, H; Osawa, S; Mita, T; Higashinakagawa, T

    1982-01-01

    The nucleotide sequences of 5S rRNAs from three ciliated protozoa, Paramecium tetraurelia, Tetrahymena thermophila and Blepharisma japonicum have been determined. All of them are 120 nucleotides long and the sequence of probable tRNA binding site of position 41-44 is GAAC which is characteristic of the plant 5S rRNAs. The sequence similarity percents are 87% (Paramecium/Tetrahymena), 86% (Paramecium/Blepharisma) and 79% (Tetrahymena/Blepharisma), suggesting a close relationship of these three ciliates. PMID:7122243

  6. The nucleotide sequence of 5S rRNA from a cellular slime mold Dictyostelium discoideum.

    PubMed

    Hori, H; Osawa, S; Iwabuchi, M

    1980-12-11

    The nucleotide sequence of ribosomal 5S rRNA from a cellular slime mold Dictyostelium discoideum is GUAUACGGCCAUACUAGGUUGGAAACACAUCAUCCCGUUCGAUCUGAUA AGUAAAUCGACCUCAGGCCUUCCAAGUACUCUGGUUGGAGACAACAGGGGAACAUAGGGUGCUGUAUACU. A model for the secondary structure of this 5S rRNA is proposed. The sequence is more similar to those of animals (62% similarity on the average) rather than those of yeasts (56%).

  7. Nucleotide sequences of 5S rRNAs from four jellyfishes.

    PubMed

    Hori, H; Ohama, T; Kumazaki, T; Osawa, S

    1982-11-25

    The nucleotide sequences of 5S rRNAs from four jellyfishes, Spirocodon saltatrix, Nemopsis dofleini, Aurelia aurita and Chrysaora quinquecirrha have been determined. The sequences are highly similar to each other. A fairly high similarity was also found between these jellyfishes and a sea anemone, Anthopleura japonica.

  8. Nucleotide sequences of 5S rRNAs from four jellyfishes.

    PubMed

    Hori, H; Ohama, T; Kumazaki, T; Osawa, S

    1982-11-25

    The nucleotide sequences of 5S rRNAs from four jellyfishes, Spirocodon saltatrix, Nemopsis dofleini, Aurelia aurita and Chrysaora quinquecirrha have been determined. The sequences are highly similar to each other. A fairly high similarity was also found between these jellyfishes and a sea anemone, Anthopleura japonica. PMID:6130512

  9. Diverse nucleotide compositions and sequence fluctuation in Rubisco protein genes

    NASA Astrophysics Data System (ADS)

    Holden, Todd; Dehipawala, S.; Cheung, E.; Bienaime, R.; Ye, J.; Tremberger, G., Jr.; Schneider, P.; Lieberman, D.; Cheung, T.

    2011-10-01

    The Rubisco protein-enzyme is arguably the most abundance protein on Earth. The biology dogma of transcription and translation necessitates the study of the Rubisco genes and Rubisco-like genes in various species. Stronger correlation of fractal dimension of the atomic number fluctuation along a DNA sequence with Shannon entropy has been observed in the studied Rubisco-like gene sequences, suggesting a more diverse evolutionary pressure and constraints in the Rubisco sequences. The strategy of using metal for structural stabilization appears to be an ancient mechanism, with data from the porphobilinogen deaminase gene in Capsaspora owczarzaki and Monosiga brevicollis. Using the chi-square distance probability, our analysis supports the conjecture that the more ancient Rubisco-like sequence in Microcystis aeruginosa would have experienced very different evolutionary pressure and bio-chemical constraint as compared to Bordetella bronchiseptica, the two microbes occupying either end of the correlation graph. Our exploratory study would indicate that high fractal dimension Rubisco sequence would support high carbon dioxide rate via the Michaelis- Menten coefficient; with implication for the control of the whooping cough pathogen Bordetella bronchiseptica, a microbe containing a high fractal dimension Rubisco-like sequence (2.07). Using the internal comparison of chi-square distance probability for 16S rRNA (~ E-22) versus radiation repair Rec-A gene (~ E-05) in high GC content Deinococcus radiodurans, our analysis supports the conjecture that high GC content microbes containing Rubisco-like sequence are likely to include an extra-terrestrial origin, relative to Deinococcus radiodurans. Similar photosynthesis process that could utilize host star radiation would not compete with radiation resistant process from the biology dogma perspective in environments such as Mars and exoplanets.

  10. Phylogenetic analysis of beta-papillomaviruses as inferred from nucleotide and amino acid sequence data.

    PubMed

    Gottschling, Marc; Köhler, Anja; Stockfleth, Eggert; Nindl, Ingo

    2007-01-01

    Human papillomaviruses (HPV) of the beta-group seem to be involved in the pathogenesis of non-melanoma skin cancer. Papillomaviruses are host specific and are considered closely co-evolving with their hosts. Evolutionary incongruence between early genes and late genes has been reported among oncogenic genital alpha-papillomaviruses and considerably challenge phylogenetic reconstructions. We investigated the relationships of 29 beta-HPV (25 types plus four putative new types, subtypes, or variants) as inferred from codon aligned and amino acid sequence data of the genes E1, E2, E6, E7, L1, and L2 using likelihood, distance, and parsimony approaches. An analysis of a L1 fragment included additional nucleotide and amino acid sequences from seven non-human beta-papillomaviruses. Early genes and late genes evolution did not conflict significantly in beta-papillomaviruses based on partition homogeneity tests (p > or = 0.001). As inferred from the complete genome analyses, beta-papillomaviruses were monophyletic and segregated into four highly supported monophyletic assemblages corresponding to the species 1, 2, 3, and fused 4/5. They basically split into the species 1 and the remainder of beta-papillomaviruses, whose species 3, 4, and 5 constituted the sistergroup of species 2. beta-Papillomaviruses have been isolated from humans, apes, and monkeys, and phylogenetic analyses of the L1 fragment showed non-human papillomaviruses highly polyphyletic nesting within the HPV species. Thus, host and virus phylogenies were not congruent in beta-papillomaviruses, and multiple invasions across species borders may contribute (additionally to host-linked evolution) to their diversification.

  11. Antibody-specific model of amino acid substitution for immunological inferences from alignments of antibody sequences.

    PubMed

    Mirsky, Alexander; Kazandjian, Linda; Anisimova, Maria

    2015-03-01

    Antibodies are glycoproteins produced by the immune system as a dynamically adaptive line of defense against invading pathogens. Very elegant and specific mutational mechanisms allow B lymphocytes to produce a large and diversified repertoire of antibodies, which is modified and enhanced throughout all adulthood. One of these mechanisms is somatic hypermutation, which stochastically mutates nucleotides in the antibody genes, forming new sequences with different properties and, eventually, higher affinity and selectivity to the pathogenic target. As somatic hypermutation involves fast mutation of antibody sequences, this process can be described using a Markov substitution model of molecular evolution. Here, using large sets of antibody sequences from mice and humans, we infer an empirical amino acid substitution model AB, which is specific to antibody sequences. Compared with existing general amino acid models, we show that the AB model provides significantly better description for the somatic evolution of mice and human antibody sequences, as demonstrated on large next generation sequencing (NGS) antibody data. General amino acid models are reflective of conservation at the protein level due to functional constraints, with most frequent amino acids exchanges taking place between residues with the same or similar physicochemical properties. In contrast, within the variable part of antibody sequences we observed an elevated frequency of exchanges between amino acids with distinct physicochemical properties. This is indicative of a sui generis mutational mechanism, specific to antibody somatic hypermutation. We illustrate this property of antibody sequences by a comparative analysis of the network modularity implied by the AB model and general amino acid substitution models. We recommend using the new model for computational studies of antibody sequence maturation, including inference of alignments and phylogenetic trees describing antibody somatic hypermutation in

  12. Score distributions of gapped multiple sequence alignments down to the low-probability tail

    NASA Astrophysics Data System (ADS)

    Fieth, Pascal; Hartmann, Alexander K.

    2016-08-01

    Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of finite sequence lengths. Here we extend the studies to multiple sequence alignments with gaps, which are much more relevant for practical applications in molecular biology. We study the distributions of scores over a large range of the support, reaching probabilities as small as 10-160, for global and local (sum-of-pair scores) multiple alignments. We find that even after suitable rescaling, eliminating the sequence-length dependence, the distributions for multiple alignment differ from the pairwise alignment case. Furthermore, we also show that the previously discussed Gaussian correction to the Gumbel distribution needs to be refined, also for the case of pairwise alignments.

  13. Score distributions of gapped multiple sequence alignments down to the low-probability tail.

    PubMed

    Fieth, Pascal; Hartmann, Alexander K

    2016-08-01

    Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of finite sequence lengths. Here we extend the studies to multiple sequence alignments with gaps, which are much more relevant for practical applications in molecular biology. We study the distributions of scores over a large range of the support, reaching probabilities as small as 10^{-160}, for global and local (sum-of-pair scores) multiple alignments. We find that even after suitable rescaling, eliminating the sequence-length dependence, the distributions for multiple alignment differ from the pairwise alignment case. Furthermore, we also show that the previously discussed Gaussian correction to the Gumbel distribution needs to be refined, also for the case of pairwise alignments. PMID:27627266

  14. B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC.

    PubMed

    Cui, Yingbo; Liao, Xiangke; Zhu, Xiaoqian; Wang, Bingqiang; Peng, Shaoliang

    2016-03-01

    Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.

  15. Methods for making nucleotide probes for sequencing and synthesis

    SciTech Connect

    Church, George M; Zhang, Kun; Chou, Joseph

    2014-07-08

    Compositions and methods for making a plurality of probes for analyzing a plurality of nucleic acid samples are provided. Compositions and methods for analyzing a plurality of nucleic acid samples to obtain sequence information in each nucleic acid sample are also provided.

  16. The nucleotide sequence of Saccharomyces cerevisiae chromosome XII.

    PubMed

    Johnston, M; Hillier, L; Riles, L; Albermann, K; André, B; Ansorge, W; Benes, V; Brückner, M; Delius, H; Dubois, E; Düsterhöft, A; Entian, K D; Floeth, M; Goffeau, A; Hebling, U; Heumann, K; Heuss-Neitzel, D; Hilbert, H; Hilger, F; Kleine, K; Kötter, P; Louis, E J; Messenguy, F; Mewes, H W; Hoheisel, J D

    1997-05-29

    The yeast Saccharomyces cerevisiae is the pre-eminent organism for the study of basic functions of eukaryotic cells. All of the genes of this simple eukaryotic cell have recently been revealed by an international collaborative effort to determine the complete DNA sequence of its nuclear genome. Here we describe some of the features of chromosome XII.

  17. Nucleotide sequence of a human tRNA gene heterocluster

    SciTech Connect

    Chang, Y.N.; Pirtle, I.L.; Pirtle, R.M.

    1986-05-01

    Leucine tRNA from bovine liver was used as a hybridization probe to screen a human gene library harbored in Charon-4A of bacteriophage lambda. The human DNA inserts from plaque-pure clones were characterized by restriction endonuclease mapping and Southern hybridization techniques, using both (3'-/sup 32/P)-labeled bovine liver leucine tRNA and total tRNA as hybridization probes. An 8-kb Hind III fragment of one of these ..gamma..-clones was subcloned into the Hind III site of pBR322. Subsequent fine restriction mapping and DNA sequence analysis of this plasmid DNA indicated the presence of four tRNA genes within the 8-kb DNA fragment. A leucine tRNA gene with an anticodon of AAG and a proline tRNA gene with an anticodon of AGG are in a 1.6-kb subfragment. A threonine tRNA gene with an anticodon of UGU and an as yet unidentified tRNA gene are located in a 1.1-kb subfragment. These two different subfragments are separated by 2.8 kb. The coding regions of the three sequenced genes contain characteristic internal split promoter sequences and do not have intervening sequences. The 3'-flanking region of these three genes have typical RNA polymerase III termination sites of at least four consecutive T residues.

  18. Nucleotide sequence conservation in paramyxoviruses; the concept of codon constellation.

    PubMed

    Rima, Bert K

    2015-05-01

    The stability and conservation of the sequences of RNA viruses in the field and the high error rates measured in vitro are paradoxical. The field stability indicates that there are very strong selective constraints on sequence diversity. The nature of these constraints is discussed. Apart from constraints on variation in cis-acting RNA and the amino acid sequences of viral proteins, there are other ones relating to the presence of specific dinucleotides such CpG and UpA as well as the importance of RNA secondary structures and RNA degradation rates. Recent other constraints identified in other RNA viruses, such as effects of secondary RNA structure on protein folding or modification of cellular tRNA complements, are also discussed. Using the family Paramyxoviridae, I show that the codon usage pattern (CUP) is (i) specific for each virus species and (ii) that it is markedly different from the host - it does not vary even in vaccine viruses that have been derived by passage in a number of inappropriate host cells. The CUP might thus be an additional constraint on variation, and I propose the concept of codon constellation to indicate the informational content of the sequences of RNA molecules relating not only to stability and structure but also to the efficiency of translation of a viral mRNA resulting from the CUP and the numbers and position of rare codons.

  19. Water buffalo (Bubalus bubalis): complete nucleotide mitochondrial genome sequence.

    PubMed

    Parma, Pietro; Erra-Pujada, Marta; Feligini, Maria; Greppi, Gianfranco; Enne, Giuseppe

    2004-01-01

    In this work, we report the whole sequence of the water buffalo (Bubalus bubalis) mitochondrial genome. The water buffalo mt molecule is 16.355 base pair length and shows a genome organization similar to those reported for other mitochondrial genome. These new data provide an useful tool for many research area, i.e. evolutionary study and identification of food origin.

  20. Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost

    PubMed Central

    Yamada, Shinsuke; Gotoh, Osamu; Yamana, Hayato

    2006-01-01

    Background Multiple sequence alignment (MSA) is a useful tool in bioinformatics. Although many MSA algorithms have been developed, there is still room for improvement in accuracy and speed. In the alignment of a family of protein sequences, global MSA algorithms perform better than local ones in many cases, while local ones perform better than global ones when some sequences have long insertions or deletions (indels) relative to others. Many recent leading MSA algorithms have incorporated pairwise alignment information obtained from a mixture of sources into their scoring system to improve accuracy of alignment containing long indels. Results We propose a novel group-to-group sequence alignment algorithm that uses a piecewise linear gap cost. We developed a program called PRIME, which employs our proposed algorithm to optimize the well-defined sum-of-pairs score. PRIME stands for Profile-based Randomized Iteration MEthod. We evaluated PRIME and some recent MSA programs using BAliBASE version 3.0 and PREFAB version 4.0 benchmarks. The results of benchmark tests showed that PRIME can construct accurate alignments comparable to the most accurate programs currently available, including L-INS-i of MAFFT, ProbCons, and T-Coffee. Conclusion PRIME enables users to construct accurate alignments without having to employ pairwise alignment information. PRIME is available at . PMID:17137519

  1. Nucleotide sequence of the tobacco (Nicotiana tabacum) anionic peroxidase gene

    SciTech Connect

    Diaz-De-Leon, F.; Klotz, K.L.; Lagrimini, L.M. )

    1993-03-01

    Peroxidases have been implicated in numerous physiological processes including lignification (Grisebach, 1981), wound-healing (Espelie et al., 1986), phenol oxidation (Lagrimini, 1991), pathogen defense (Ye et al., 1990), and the regulation of cell elongation through the formation of interchain covalent bonds between various cell wall polymers (Fry, 1986; Goldberg et al., 1986; Bradley et al., 1992). However, a complete description of peroxidase action in vivo is not available because of the vast number of potential substrates and the existence of multiple isoenzymes. The tobacco anionic peroxidase is one of the better-characterized isoenzymes. This enzyme has been shown to oxidize a number of significant plant secondary compounds in vitro including cinnamyl alcohols, phenolic acids, and indole-3-acetic acid (Maeder, 1980; Lagrimini, 1991). A cDNA encoding the enzyme has been obtained, and this enzyme was shown to be expressed at the highest levels in lignifying tissues (xylem and tracheary elements) and also in epidermal tissue (Lagrimini et al., 1987). It was shown at this time that there were four distinct copies of the anionic peroxidase gene in tobacco (Nicotiana tabacum). A tobacco genomic DNA library was constructed in the [lambda]-phase EMBL3, from which two unique peroxidase genes were sequenced. One of these clones, [lambda]POD1, was designated as a pseudogene when the exonic sequences were found to differ from the cDNA sequences by 1%, and several frame shifts in the coding sequences indicated a dysfunctional gene (the authors' unpublished results). The other clone, [lambda]POD3, described in this manuscript, was designated as the functional tobacco anionic peroxidase gene because of 100% homology with the cDNA. Significant structural elements include an AS-2 box indicated in shoot-specific expression (Lam and Chua, 1989), a TATA box, and two intervening sequences. 10 refs., 1 tab.

  2. Phylogenetic analysis of Brassiceae based on the nucleotide sequences of the S-locus related gene, SLR1.

    PubMed

    Inaba, Ryuichi; Nishio, Takeshi

    2002-12-01

    Nucleotide sequences of orthologs of the S-locus related gene, SLR1, in 20 species of Brassicaceae were determined and compared with the previously reported SLR1 sequences of six species. Identities of deduced amino-acid sequences with Brassica oleracea SLR1 ranged from 66.0% to 97.6%, and those with B. oleracea SRK and SLR2 were less than 62% and 55%, respectively. In multiple alignment of deduced amino-acid sequences, the 180-190th amino-acid residues from the initial methionine were highly variable, this variable region corresponding to hypervariable region I of SLG and SRK. A phylogenetic tree based on the deduced amino-acid sequences showed a close relationship of SLR1 orthologs of species in the Brassicinae and Raphaninae. Brassica nigra SLR1 was found to belong to the same clade as Sinapis arvensis and Diplotaxis siifolia, while the sequences of the other Brassica species belonged to another clade together with B. oleracea and Brassica rapa. The phylogenetic tree was similar to previously reported trees constructed using the data of electrophoretic band patterns of chloroplast DNA, though minor differences were found. Based on synonymous substitution rates in SLR1, the diversification time of SLR1 orthologs between species in the Brassicinae was estimated. The evolution and function of SLR1 and the phylogenetic relationship of Brassiceae plants are discussed.

  3. Statistical analysis of nucleotide runs in coding and noncoding DNA sequences.

    PubMed

    Sprizhitsky YuA; Nechipurenko YuD; Alexandrov, A A; Volkenstein, M V

    1988-10-01

    A statistical analysis of the occurrence of particular nucleotide runs in DNA sequences of different species has been carried out. There are considerable differences of run distributions in DNA sequences of procaryotes, invertebrates and vertebrates. There is an abundance of short runs (1-2 nucleotides long) in the coding sequences and there is a deficiency of such runs in the noncoding regions. However, some interesting exceptions from this rule exist for the run distribution of adenine in procaryotes and for the arrangement of purine-pyrimidine runs in eucaryotes. The similarity in the distributions of such runs in the coding and noncoding regions may be due to some structural features of the DNA molecule as a whole. Runs of guanine (or cytosine) of three to six nucleotides occur predominantly in noncoding DNA regions in eucaryotes, especially in vertebrates.

  4. Enhanced spatio-temporal alignment of plantar pressure image sequences using B-splines.

    PubMed

    Oliveira, Francisco P M; Tavares, João Manuel R S

    2013-03-01

    This article presents an enhanced methodology to align plantar pressure image sequences simultaneously in time and space. The temporal alignment of the sequences is accomplished using B-splines in the time modeling, and the spatial alignment can be attained using several geometric transformation models. The methodology was tested on a dataset of 156 real plantar pressure image sequences (3 sequences for each foot of the 26 subjects) that was acquired using a common commercial plate during barefoot walking. In the alignment of image sequences that were synthetically deformed both in time and space, an outstanding accuracy was achieved with the cubic B-splines. This accuracy was significantly better (p < 0.001) than the one obtained using the best solution proposed in our previous work. When applied to align real image sequences with unknown transformation involved, the alignment based on cubic B-splines also achieved superior results than our previous methodology (p < 0.001). The consequences of the temporal alignment on the dynamic center of pressure (COP) displacement was also assessed by computing the intraclass correlation coefficients (ICC) before and after the temporal alignment of the three image sequence trials of each foot of the associated subject at six time instants. The results showed that, generally, the ICCs related to the medio-lateral COP displacement were greater when the sequences were temporally aligned than the ICCs of the original sequences. Based on the experimental findings, one can conclude that the cubic B-splines are a remarkable solution for the temporal alignment of plantar pressure image sequences. These findings also show that the temporal alignment can increase the consistency of the COP displacement on related acquired plantar pressure image sequences.

  5. Nucleotide sequence of equine caspase-1 cDNA.

    PubMed

    Wardlow, S; Penha-Goncalves, M N; Argyle, D J; Onions, D E; Nicolson, L

    1999-01-01

    Caspases are a family of cysteine proteases which have important roles in activation of cytokines and in apoptosis. Caspase-1, or interleukin-1 beta converting enzyme (ICE), promotes maturation of interleukin-1 beta (IL-1 beta) and interleukin-18 (IL-18) by proteolytic cleavage of precursor forms to generate biologically active peptides. We report the cloning and sequencing of equine caspase-1 cDNA. Equine caspase-1 is 405 amino acids in length and has 72% and 63% identity to human and mouse caspase-1, respectively, at the amino acid level. Sites of proteolytic cleavage and catalytic activity as identified in human caspase-1, are conserved. PMID:10376217

  6. The nucleotide sequence of spinach chloroplast tryptophan transfer RNA.

    PubMed Central

    Canaday, J; Guillemaut, P; Gloeckler, R; Weil, J H

    1981-01-01

    Spinach chloroplast tRNATrp, purified by column chromatography and two-dimensional gel electrophoresis, has been sequenced using in vitro labeling techniques. The sequence is : pG-C-G-C-U-C-U-U-A-G-U-U-C-A-G-U-U-C-Gm-G-D-A-G-A-A-C-m2G-psi-G-G-G-psi-C-U-C-A-A*-A-A-C-C-C-G-A-U-G-N-C-G-U-A-G-G-T-psi-C-A-A-G-U-C-C-U-A-C-A-G-A-G-C-G-U-G -C-C-AOH. Like the E. coli suppressor tRNA psu+UGA which translates both the opal terminator codon U-G-A and the tryptophan codon U-G-G, spinach chloroplast tRNATrp has C-C-A as an anticodon and contains an A-U pair in the D-stem. Images PMID:6907845

  7. Complete nucleotide sequence and transcriptional analysis of snakehead fish retrovirus.

    PubMed Central

    Hart, D; Frerichs, G N; Rambaut, A; Onions, D E

    1996-01-01

    The complete genome of the snakehead fish retrovirus has been cloned and sequenced, and its transcriptional profile in cell culture has been determined. The 11.2-kb provirus displays a complex expression pattern capable of encoding accessory proteins and is unique in the predicted location of the env initiation codon and signal peptide upstream of gag and the common splice donor site. The virus is distinguishable from all known retrovirus groups by the presence of an arginine tRNA primer binding site. The coding regions are highly divergent and show a number of unusual characteristics, including a large Gag coiled-coil region, a Pol domain of unknown function, and a long, lentiviral-like, Env cytoplasmic domain. Phylogenetic analysis of the Pol sequence emphasizes the divergent nature of the virus from the avian and mammalian retroviruses. The snakehead virus is also distinct from a previously characterized complex fish retrovirus, suggesting that discrete groups of these viruses have yet to be identified in the lower vertebrates. PMID:8648695

  8. Complete nucleotide sequence and transcriptional analysis of snakehead fish retrovirus.

    PubMed

    Hart, D; Frerichs, G N; Rambaut, A; Onions, D E

    1996-06-01

    The complete genome of the snakehead fish retrovirus has been cloned and sequenced, and its transcriptional profile in cell culture has been determined. The 11.2-kb provirus displays a complex expression pattern capable of encoding accessory proteins and is unique in the predicted location of the env initiation codon and signal peptide upstream of gag and the common splice donor site. The virus is distinguishable from all known retrovirus groups by the presence of an arginine tRNA primer binding site. The coding regions are highly divergent and show a number of unusual characteristics, including a large Gag coiled-coil region, a Pol domain of unknown function, and a long, lentiviral-like, Env cytoplasmic domain. Phylogenetic analysis of the Pol sequence emphasizes the divergent nature of the virus from the avian and mammalian retroviruses. The snakehead virus is also distinct from a previously characterized complex fish retrovirus, suggesting that discrete groups of these viruses have yet to be identified in the lower vertebrates.

  9. Nucleotide sequences of five IncF plasmid finP alleles.

    PubMed Central

    Finlay, B B; Frost, L S; Paranchych, W; Willetts, N S

    1986-01-01

    The nucleotide sequences of five finP alleles from various IncF plasmids (finP types I to V) as well as of three finP mutations were determined and compared. The finP gene specificity could be attributed to a variable, six-to-seven-nucleotide loop located between inverted repeats, and the sequence data were consistent with the product of finP being an RNA molecule rather than a protein. The finP mutations interrupted a proposed finP promoter or destabilized a predicted stem-and-loop structure in the finP RNA molecule. PMID:2426248

  10. Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications

    PubMed Central

    Cvicek, Vaclav; Goddard, William A.; Abrol, Ravinder

    2016-01-01

    The understanding of G-protein coupled receptors (GPCRs) is undergoing a revolution due to increased information about their signaling and the experimental determination of structures for more than 25 receptors. The availability of at least one receptor structure for each of the GPCR classes, well separated in sequence space, enables an integrated superfamily-wide analysis to identify signatures involving the role of conserved residues, conserved contacts, and downstream signaling in the context of receptor structures. In this study, we align the transmembrane (TM) domains of all experimental GPCR structures to maximize the conserved inter-helical contacts. The resulting superfamily-wide GpcR Sequence-Structure (GRoSS) alignment of the TM domains for all human GPCR sequences is sufficient to generate a phylogenetic tree that correctly distinguishes all different GPCR classes, suggesting that the class-level differences in the GPCR superfamily are encoded at least partly in the TM domains. The inter-helical contacts conserved across all GPCR classes describe the evolutionarily conserved GPCR structural fold. The corresponding structural alignment of the inactive and active conformations, available for a few GPCRs, identifies activation hot-spot residues in the TM domains that get rewired upon activation. Many GPCR mutations, known to alter receptor signaling and cause disease, are located at these conserved contact and activation hot-spot residue positions. The GRoSS alignment places the chemosensory receptor subfamilies for bitter taste (TAS2R) and pheromones (Vomeronasal, VN1R) in the rhodopsin family, known to contain the chemosensory olfactory receptor subfamily. The GRoSS alignment also enables the quantification of the structural variability in the TM regions of experimental structures, useful for homology modeling and structure prediction of receptors. Furthermore, this alignment identifies structurally and functionally important residues in all human GPCRs

  11. Nucleotide sequence of a small cryptic plasmid from Acidithiobacillus ferrooxidans strain A-6

    SciTech Connect

    F. Roberto

    2003-10-01

    A 2.1 kb cryptic plasmid from Acidithiobacillus ferrooxidans strain A-6 was isolated and cloned into the E. coli vector plasmid, pUC128. The cloned plasmid was mapped by restriction enzyme fragment analysis and subsequently sequenced. At this time over half the plasmid sequence has been determined and compared to sequences in the GenBank nucleotide and protein sequence databases. Much of the plasmid remains cryptic, but substantial nucleotide and protein sequence similarities have been observed to the putative replication protein, RepA, of the small cryptic plasmids pAYS and pAYL found in the ammonia-oxidizing Nitrosomonas sp. Strain ENI-11. These results suggest an entirely new class of plasmid is maintained in at least one strain of Acidithiobacillus ferrooxidans and other acidophilic bacteria, and raises interesting questions about the origin of this plasmid in acidic environments.

  12. The complete nucleotide sequence and genomic characterization of tropical soda apple mosaic virus.

    PubMed

    Fillmer, Kornelia; Adkins, Scott; Pongam, Patchara; D'Elia, Tom

    2016-08-01

    We report the first complete genome sequence of tropical soda apple mosaic virus (TSAMV), a tobamovirus originally isolated from tropical soda apple (Solanum viarum) collected in Okeechobee, Florida. The complete genome of TSAMV is 6,350 nucleotides long and contains four open reading frames encoding the following proteins: i) 126-kDa methyltransferase/helicase (3354 nt), ii) 183-kDa polymerase (4839 nt), iii) movement protein (771 nt) and iv) coat protein (483 nt). The complete genome sequence of TSAMV shares 80.4 % nucleotide sequence identity with pepper mild mottle virus (PMMoV) and 71.2-74.2 % identity with other tobamoviruses naturally infecting members of the Solanaceae plant family. Phylogenetic analysis of the deduced amino acid sequences of the 126-kDa and 183-kDa proteins and the complete genome sequence place TSAMV in a subcluster with PMMoV within the Solanaceae-infecting subgroup of tobamoviruses.

  13. Relationships amongst bluetongue viruses revealed by comparisons of capsid and outer coat protein nucleotide sequences.

    PubMed

    Gould, A R; Pritchard, L I

    1990-08-01

    Sequence data from the gene segments coding for the capsid protein. VP3, of all eight Australian bluetongue virus serotypes were compared. The high degree of nucleotide sequence homology for VP3 genes amongst BTV isolates from the same geographic region supported previous studies (Gould, 1987; 1988b, c; Gould et al., 1988b) and was proposed as a basis for "topotyping" a bluetongue virus isolate (Gould et al., 1989). The complete nucleotide sequences which coded for the VP2 outer coat proteins of South African BTV serotypes 1 and 3 (vaccine strains) were determined and compared to cognate gene sequences from North American and Australian BTVs. These VP2 comparisons demonstrated that BTVs of the same serotype, but from different geographical regions, were closely related at the nucleotide and amino acid levels. However, close inter-relationships were also demonstrated amongst other BTVs irrespective of serotype or geographic origin. These data enabled phylogenic relationships of the BTV serotypes to be analysed using VP2 nucleotide sequences as a determinant.

  14. Characterization, nucleotide sequence and genome organization of leek white stripe virus, a putative new species of the genus Necrovirus.

    PubMed

    Lot, H; Rubino, L; Delecolle, B; Jacquemond, M; Turturo, C; Russo, M

    1996-01-01

    White stripe is a disease affecting leek in France with which an isometric virus c. 30 nm in diameter is associated. The most evident symptom is the presence of white stripes on the leaves extending to the stem. Attempts to demonstrate transmission through the soil by sowing or transplanting leek in contaminated soil were unsuccessful. The virus was transmitted by sap inoculation to a narrow range of herbaceous hosts, all of which were infected only locally. Virus purification was from infected leek tissues, where it accumulated in large amounts, as demonstrated by ultrastructural observations. RNA was extracted from purified virus preparations and cDNA clones were prepared. The complete nucleotide sequence of the viral RNA was determined: The genome is 3,662 nucleotides long and contains five open reading frames (ORFs). The first (ORF 1) encodes a putative translation product of M(r) 23,803 (p24) and read through of its amber stop codon results in a protein of M(r) 82,625 (p83) (ORF 2). ORF 3 and ORF 4 encode two small polypeptides of M(r) 11,280 (p11) and M(r) 6,261 (p6), respectively. ORF 5 encodes the capsid protein of M(r) 27,460 (p27). The genome organization and sequence alignments with the corresponding products of necroviruses suggest that the virus isolated from leek is a new species in the genus Necrovirus, for which the name of leek white stripe virus (LWSV) is proposed.

  15. Nucleotide composition of CO1 sequences in Chelicerata (Arthropoda): detecting new mitogenomic rearrangements.

    PubMed

    Arabi, Juliette; Judson, Mark L I; Deharveng, Louis; Lourenço, Wilson R; Cruaud, Corinne; Hassanin, Alexandre

    2012-02-01

    Here we study the evolution of nucleotide composition in third codon-positions of CO1 sequences of Chelicerata, using a phylogenetic framework, based on 180 taxa and three markers (CO1, 18S, and 28S rRNA; 5,218 nt). The analyses of nucleotide composition were also extended to all CO1 sequences of Chelicerata found in GenBank (1,701 taxa). The results show that most species of Chelicerata have a positive strand bias in CO1, i.e., in favor of C nucleotides, including all Amblypygi, Palpigradi, Ricinulei, Solifugae, Uropygi, and Xiphosura. However, several taxa show a negative strand bias, i.e., in favor of G nucleotides: all Scorpiones, Opisthothelae spiders and several taxa within Acari, Opiliones, Pseudoscorpiones, and Pycnogonida. Several reversals of strand-specific bias can be attributed to either a rearrangement of the control region or an inversion of a fragment containing the CO1 gene. Key taxa for which sequencing of complete mitochondrial genomes will be necessary to determine the origin and nature of mtDNA rearrangements involved in the reversals are identified. Acari, Opiliones, Pseudoscorpiones, and Pycnogonida were found to show a strong variability in nucleotide composition. In addition, both mitochondrial and nuclear genomes have been affected by higher substitution rates in Acari and Pseudoscorpiones. The results therefore indicate that these two orders are more liable to fix mutations of all types, including base substitutions, indels, and genomic rearrangements.

  16. [Research on the recombinant plasmid pDJH2 of L. interrogans serovar lai: sequencing and alignment with other known bacterial Omp sequence].

    PubMed

    Jiang, N; Dai, B; Yan, Z; Yang, W; Li, S; Fang, Z; Zhao, H; Wu, W; Ye, D; Yan, R; Liu, J; Song, S; Yang, Y; Zhang, Y; Liu, F; Tu, Y; Yang, H; Huang, Z; Liang, L; Hu, L; Zhao, M

    1996-12-01

    The Leptospira whole cell vaccine (LWCV) currently used in China is safe and effective, out the immunity following vaccination with two doses of the fluid medium vaccine is of low order. The duration of immunity conferred by this vaccine is rather short, six months or at most one year. Therefore, it is necessary to develop new generation vaccines against Leptospirosis for the developing world. In this paper we report the sequencing of the insert fragment of pDJH2 from genomic DNA of L. interrogans sevovar lai strain 017 and its alignment with other bacterial omp sequences. A genomic library of Leptospira interrogaans serovar lai strain 017 was constructed with the plasmid vector pUC18. A recombinant plasmid designated pJDH2 was screened from the genomic library. Inserted fragment of pDH2 is 1.9 kb by gel electrophoresis. Immunization/protection was studied in BALB/c mice model. The results showed highly significant difference between pDJH2 and pUC18 (control). Inserted fragment of pDJH2 DNA sequencing was performed by Dr Yan Zhengxin (Max-Planck-Institut for Biology. Tubingen, Germany). Insert fragment was cloned into pBluescript II KS-(stratagene) and sequenced by using AB1 (Applied Bio Systems, Model 373A). Two open reading frames of 565 and 662 nucleotides were identified. There were identifiable initiation codons, terminators, Shine-Dalgano ribosome combining site, Pribnow boxes and Sextama boxes within the 2 sequenced regions. Nucleotide sequences were analysed using Gene Work, a suit of computer program developed by Department of Biochemistry St. Jude Children's Research Hospital Memphis. U.S.A. The results of formatted alignment showed the predicted nucleotide sequence of ORF1 of the serovar lai had significant similarity with ORF2 (49.36%). L. kirschneri ompL1 (49.26%), Borrelia burgdoferi omp (48.97%), Treponema phagedenis omp (47.3%); Salmonella typhimurium ompC(46.87%), Yersinia enterocolitica ompH (46.7%), Leptospira borgpeterseni pfap (46.3%), and

  17. Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments.

    PubMed

    Poirot, Olivier; O'Toole, Eamonn; Notredame, Cedric

    2003-07-01

    This paper presents Tcoffee@igs, a new server provided to the community by Hewlet Packard computers and the Centre National de la Recherche Scientifique. This server is a web-based tool dedicated to the computation, the evaluation and the combination of multiple sequence alignments. It uses the latest version of the T-Coffee package. Given a set of unaligned sequences, the server returns an evaluated multiple sequence alignment and the associated phylogenetic tree. This server also makes it possible to evaluate the local reliability of an existing alignment and to combine several alternative multiple alignments into a single new one. Tcoffee@igs can be used for aligning protein, RNA or DNA sequences. Datasets of up to 100 sequences (2000 residues long) can be processed. The server and its documentation are available from: http://igs-server.cnrs-mrs.fr/Tcoffee/.

  18. Single nucleotide polymorphism mining and nucleotide sequence analysis of Mx1 gene in exonic regions of Japanese quail

    PubMed Central

    Niraj, Diwesh Kumar; Kumar, Pushpendra; Mishra, Chinmoy; Narayan, Raj; Bhattacharya, Tarun Kumar; Shrivastava, Kush; Bhushan, Bharat; Tiwari, Ashok Kumar; Saxena, Vishesh; Sahoo, Nihar Ranjan; Sharma, Deepak

    2015-01-01

    Aim: An attempt has been made to study the Myxovirus resistant (Mx1) gene polymorphism in Japanese quail. Materials and Methods: In the present, investigation four fragments viz. Fragment I of 185 bp (Exon 3 region), Fragment II of 148 bp (Exon 5 region), Fragment III of 161 bp (Exon 7 region), and Fragment IV of 176 bp (Exon 13 region) of Mx1 gene were amplified and screened for polymorphism by polymerase chain reaction-single-strand conformation polymorphism technique in 170 Japanese quail birds. Results: Out of the four fragments, one fragment (Fragment II) was found to be polymorphic. Remaining three fragments (Fragment I, III, and IV) were found to be monomorphic which was confirmed by custom sequencing. Overall nucleotide sequence analysis of Mx1 gene of Japanese quail showed 100% homology with common quail and more than 80% homology with reported sequence of chicken breeds. Conclusion: The Mx1 gene is mostly conserved in Japanese quail. There is an urgent need of comprehensive analysis of other regions of Mx1 gene along with its possible association with the traits of economic importance in Japanese quail. PMID:27047057

  19. Complete nucleotide sequence of the chlorarachniophyte nucleomorph: nature's smallest nucleus.

    PubMed

    Gilson, Paul R; Su, Vanessa; Slamovits, Claudio H; Reith, Michael E; Keeling, Patrick J; McFadden, Geoffrey I

    2006-06-20

    The introduction of plastids into different heterotrophic protists created lineages of algae that diversified explosively, proliferated in marine and freshwater environments, and radically altered the biosphere. The origins of these secondary plastids are usually inferred from the presence of additional plastid membranes. However, two examples provide unique snapshots of secondary-endosymbiosis-in-action, because they retain a vestige of the endosymbiont nucleus known as the nucleomorph. These are chlorarachniophytes and cryptomonads, which acquired their plastids from a green and red alga respectively. To allow comparisons between them, we have sequenced the nucleomorph genome from the chlorarachniophyte Bigelowiella natans: at a mere 373,000 bp and with only 331 genes, the smallest nuclear genome known and a model for extreme reduction. The genome is eukaryotic in nature, with three linear chromosomes containing densely packed genes with numerous overlaps. The genome is replete with 852 introns, but these are the smallest introns known, being only 18, 19, 20, or 21 nt in length. These pygmy introns are shown to be miniaturized versions of normal-sized introns present in the endosymbiont at the time of capture. Seventeen nucleomorph genes encode proteins that function in the plastid. The other nucleomorph genes are housekeeping entities, presumably underpinning maintenance and expression of these plastid proteins. Chlorarachniophyte plastids are thus serviced by three different genomes (plastid, nucleomorph, and host nucleus) requiring remarkable coordination and targeting. Although originating by two independent endosymbioses, chlorarachniophyte and cryptomonad nucleomorph genomes have converged upon remarkably similar architectures but differ in many molecular details that reflect two distinct trajectories to hypercompaction and reduction.

  20. On the feasibility of using the intrinsic fluorescence of nucleotides for DNA sequencing.

    SciTech Connect

    Chowdhury, M. H.; Ray, K.; Johnson, R. L.; Gray, S. K.; Pond, J.; Lakowicz, J. R.; Univ. of Maryland; Univ. of Virginia; Lumerical Solutions, Inc.

    2010-04-29

    There is presently a worldwide effort to increase the speed and decrease the cost of DNA sequencing as exemplified by the goal of the National Human Genome Research Institute (NHGRI) to sequence a human genome for under $1000. Several high throughput technologies are under development. Among these, single strand sequencing using exonuclease appear very promising. However, this approach requires complete labeling of at least two bases at a time, with extrinsic high quantum yield probes. This is necessary because nucleotides absorb in the deep ultraviolet (UV) and emit with extremely low quantum yields. Hence intrinsic emission from DNA and nucleotides is not being exploited for DNA sequencing. In the present paper we consider the possibility of identifying single nucleotides using their intrinsic emission. We used the finite-difference time-domain (FDTD) method to calculate the effects of aluminum nanoparticles on nearby fluorophores that emit in the UV. We find that the radiated power of UV fluorophores is significantly increased when they are in close proximity to aluminum nanostructures. We show that there will be increased localized excitation near aluminum particles at wavelengths used to excite intrinsic nucleotide emission. Using FDTD simulation we show that a typical DNA base when coupled to appropriate aluminum nanostructures leads to highly directional emission. Additionally we present experimental results showing that a thin film of nucleotides show enhanced emission when in close proximity to aluminum nanostructures. Finally we provide Monte Carlo simulations that predict high levels of base calling accuracy for an assumed number of photons that is derived from the emission spectra of the intrinsic fluorescence of the bases. Our results suggest that single nucleotides can be detected and identified using aluminum nanostructures that enhance their intrinsic emission. This capability would be valuable for the ongoing efforts toward the $1000 genome.

  1. Secretory pancreatic stone protein messenger RNA. Nucleotide sequence and expression in chronic calcifying pancreatitis.

    PubMed Central

    Giorgi, D; Bernard, J P; Rouquier, S; Iovanna, J; Sarles, H; Dagorn, J C

    1989-01-01

    The pancreatic stone protein and its secretory form (PSP-S) are inhibitors of CaCO3 crystal growth, possibly involved in the stabilization of pancreatic juice. We have established the structure of PSP-S mRNA and monitored its expression in chronic calcifying pancreatitis (CCP). A cDNA encoding pre-PSP-S has been cloned from a human pancreatic cDNA library. Its nucleotide sequence revealed that it comprised all but the 5' end of PSP-S mRNA, which was obtained by sequencing the first exon of the PSP-S gene. The complete mRNA sequence is 775 nucleotides long, including 5'- and 3'- noncoding regions of 80 and 197 nucleotides, respectively, attached to a poly(A) tail of approximately 125 nucleotides. It encodes a preprotein of 166 amino acids, including a prepeptide of 22 amino acids. No overall sequence homology was found between PSP-S and other pancreatic proteins. Some homology with several serine proteases was observed in the COOH-terminal region, however. The mRNA levels of PSP-S, trypsinogen, chymotrypsinogen, and colipase in CCP and control pancreas were compared. PSP-S mRNA was three times lower in CCP than in control, whereas the others were not altered. It was concluded that PSP-S gene expression is specifically reduced in CCP patients. Images PMID:2525567

  2. A Novel Method for Alignment-free DNA Sequence Similarity Analysis Based on the Characterization of Complex Networks

    PubMed Central

    Zhou, Jie; Zhong, Pianyu; Zhang, Tinghui

    2016-01-01

    Determination of sequence similarity is one of the major steps in computational phylogenetic studies. One of the major tasks of computational biologists is to develop novel mathematical descriptors for similarity analysis. DNA clustering is an important technology that automatically identifies inherent relationships among large-scale DNA sequences. The comparison between the DNA sequences of different species helps determine phylogenetic relationships among species. Alignment-free approaches have continuously gained interest in various sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, particularly for large-scale sequence datasets. Here, we construct a novel and simple mathematical descriptor based on the characterization of cis sequence complex DNA networks. This new approach is based on a code of three cis nucleotides in a gene that could code for an amino acid. In particular, for each DNA sequence, we will set up a cis sequence complex network that will be used to develop a characterization vector for the analysis of mitochondrial DNA sequence phylogenetic relationships among nine species. The resulting phylogenetic relationships among the nine species were determined to be in agreement with the actual situation. PMID:27746676

  3. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB.

    PubMed

    Pruesse, Elmar; Quast, Christian; Knittel, Katrin; Fuchs, Bernhard M; Ludwig, Wolfgang; Peplies, Jörg; Glöckner, Frank Oliver

    2007-01-01

    Sequencing ribosomal RNA (rRNA) genes is currently the method of choice for phylogenetic reconstruction, nucleic acid based detection and quantification of microbial diversity. The ARB software suite with its corresponding rRNA datasets has been accepted by researchers worldwide as a standard tool for large scale rRNA analysis. However, the rapid increase of publicly available rRNA sequence data has recently hampered the maintenance of comprehensive and curated rRNA knowledge databases. A new system, SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains. All sequences are checked for anomalies, carry a rich set of sequence associated contextual information, have multiple taxonomic classifications, and the latest validly described nomenclature. Furthermore, two precompiled sequence datasets compatible with ARB are offered for download on the SILVA website: (i) the reference (Ref) datasets, comprising only high quality, nearly full length sequences suitable for in-depth phylogenetic analysis and probe design and (ii) the comprehensive Parc datasets with all publicly available rRNA sequences longer than 300 nucleotides suitable for biodiversity analyses. The latest publicly available database release 91 (August 2007) hosts 547 521 sequences split into 461 823 small subunit and 85 689 large subunit rRNAs.

  4. Nucleotide sequence of the alpha-amylase-pullulanase gene from Clostridium thermohydrosulfuricum.

    PubMed

    Melasniemi, H; Paloheimo, M; Hemiö, L

    1990-03-01

    The nucleotide sequence of the gene (apu) encoding the thermostable alpha-amylase-pullulanase of Clostridium thermohydrosulfuricum was determined. An open reading frame of 4425 bp was present. The deduced polypeptide (Mr 165,600), including a 31 amino acid putative signal sequence, comprised 1475 amino acids, with no cysteine residues. The structural gene was preceded by the consensus promoter sequence TTGACA TATAAT, a putative regulatory sequence and a putative ribosome-binding sequence AAAGGGGG. The codon usage resembled that of Bacillus genes. The deduced sequence of the mature apu product showed similarities to various amylolytic enzymes, especially the neopullulanase of Bacillus stearothermophilus, whereas the signal sequence showed similarity to those of the alpha-amylases of B. stearothermophilus and B. subtilis. Three regions thought to be highly conserved in the primary structure of alpha-amylases could also be distinguished in the apu product, two being partly 'duplicated' in this alpha-1,4/alpha-1,6-active enzyme.

  5. BarraCUDA - a fast short read sequence aligner using graphics processing units

    PubMed Central

    2012-01-01

    Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497

  6. Nucleotide sequence of a cloned woodchuck hepatitis virus genome: evolutional relationship between hepadnaviruses.

    PubMed Central

    Kodama, K; Ogasawara, N; Yoshikawa, H; Murakami, S

    1985-01-01

    We have determined the complete nucleotide sequence of a cloned DNA of woodchuck hepatitis virus (WHV), the most oncogenic virus among hepadnaviruses. The genome, designated WHV2, is 3,320 base pairs long and contains four major open reading frames (ORFs) coded on the same strand of nucleotide sequence as in the human hepatitis B virus (HBV) genome. Comparison of the nucleotide sequence and amino acid sequences deduced from it among the genomes of various hepadnaviruses demonstrates that each protein shows an intrinsic property in conserving its amino acid sequence. A parameter, the ratio of the number of triplets with one-letter change but no amino acid substitution to the total number of triplets in which one-letter change occurred, was introduced to measure the intrinsic properties quantitatively. For each ORF, the parameter gave characteristic values in all combinations. Therefore, the relative evolutional distance between these hepadnaviruses can be measured by the amino acid substitution rate of any ORF. These comparisons suggest that (i) the difference between two WHV clones, WHV1 and WHV2, corresponds to that among clones of a HBV subtype, HBVadr, and (ii) WHV and ground squirrel hepatitis virus can be categorized in a way similar to the subgroups of HBV. PMID:3855246

  7. Drosophila melanogaster mitochondrial DNA: completion of the nucleotide sequence and evolutionary comparisons.

    PubMed

    Lewis, D L; Farr, C L; Kaguni, L S

    1995-11-01

    The nucleotide sequence of the regions flanking the A+T region of Drosophila melanogaster mitochondrial DNA (mtDNA) has been determined. Included are the genes encoding the transfer RNAs for valine, isoleucine, glutamine and methionine, the small ribosomal RNA and the 5'-coding sequences of the large ribosomal RNA and NADH dehydrogenase subunit II. This completes the nucleotide sequence of the D. melanogaster mitochondrial genome. The circular mtDNA of D. melanogaster varies in size among different populations largely due to length differences in the control region (Fauron & Wolstenholme, 1976; Fauron & Wolstenholme, 1980a, b); the mtDNA region we have sequenced, combined with those sequenced by others, yields a composite genome that is 19,517 bp in length as compared to 16,019 bp for the mtDNA of D. yakuba. D. melanogaster mtDNA exhibits an extreme bias in base composition; it comprises 82.2% deoxyadenylate and thymidylate residues as compared to 78.6% in D. yakuba mtDNA. All genes encoded in the mtDNA of both species are in identical locations and orientations. Nucleotide substitution analysis reveals that tRNA and rRNA genes evolve at less than half the rate of protein coding genes.

  8. The human myelin oligodendrocyte glycoprotein (MOG) gene: Complete nucleotide sequence and structural characterization

    SciTech Connect

    Paule Roth, M.; Malfroy, L.; Offer, C.; Sevin, J.; Enault, G.; Borot, N.; Pontarotti, P.; Coppin, H.

    1995-07-20

    Human myelin oligodendrocyte glycoprotein (MOG), a myelin component of the central nervous system, is a candidate target antigen for autoimmune-mediated demyelination. We have isolated and sequenced part of a cosmid clone that contains the entire human MOG gene. The primary nuclear transcript, extending from the putative start of transcription to the site of poly(A) addition, is 15,561 nucleotides in length. The human MOG gene contains 8 exons, separated by 7 introns; canonical intron/exon boundary sites are observed at each junction. The introns vary in size from 242 to 6484 bp and contain numerous repetitive DNA elements, including 14 Alu sequences within 3 introns. Another Alu element is located in the 3{prime}-untranslated region of the gene. Alu sequences were classified with respect to subfamily assignment. Seven hundred sixty-three nucleotides 5{prime} of the transcription start and 1214 nucleotides 3{prime} of the poly(A) addition sites were also sequenced. The 5{prime}-flanking region revealed the presence of several consensus sequences that could be relevant in the transcription of the MOG gene, in particular binding sites in common with other myelin gene promoters. Two polymorphic intragenic dinucleotide (CA){sub n} and tetranucleotide (TAAA){sub n} repeats were identified and may provide genetic marker tools for association and linkage studies. 50 refs., 3 figs., 3 tabs.

  9. Nucleotide sequence and genome organization of atractylodes mottle virus, a new member of the genus Carlavirus.

    PubMed

    Zhao, Fumei; Igori, Davaajargal; Lim, Seungmo; Yoo, Ran Hee; Lee, Su-Heon; Moon, Jae Sun

    2015-11-01

    The complete genome sequence of a member of a distinct species of the genus Carlavirus in the family Betaflexiviridae, tentatively named atractylodes mottle virus (AtrMoV), has been determined. Analysis of its genomic organization indicates that it has a single-stranded, positive-sense genomic RNA of 8866 nucleotides, excluding the poly(A) tail, and consists of six open reading frames typical of members of the genus Carlavirus. The individual open reading frames of AtrMoV show moderately low sequence similarity to those of other carlaviruses at the nucleotide and amino acid sequence levels. Pairwise comparison and phylogenetic analysis suggest that AtrMoV is most closely related to chrysanthemum virus B. PMID:26264403

  10. Analysis of the complete nucleotide sequence of the Agrobacterium tumefaciens virB operon.

    PubMed

    Thompson, D V; Melchers, L S; Idler, K B; Schilperoort, R A; Hooykaas, P J

    1988-05-25

    The complete nucleotide sequence of the virB locus, from the octopine Ti plasmid of Agrobacterium tumefaciens strain 15955, has been determined. In the large virB-operon (9600 nucleotides) we have identified eleven open reading frames, designated virB1 to virB11. From DNA sequence analysis it is proposed that nearly all VirB products, i.e. VirB1 to VirB9, are secreted or membrane associated proteins. Interestingly, both a membrane protein (VirB4) and a potential cytoplasmic protein (VirB11) contain the consensus amino acid sequence of ATP-binding proteins. In view of the conjugative T-DNA transfer model, the VirB proteins are suggested to act at the bacterial surface and there play an important role in directing T-DNA transfer to plant cells. PMID:2837739

  11. A sequence of seventy-three nucleotides from the coliphage R17 genome

    PubMed Central

    Rensing, Ulrich F. E.

    1973-01-01

    1. A sequence of 73 nucleotides of the RNA genome from coliphage R17 was determined. It can be read through in only one translational frame. The fragment is not part of the coatprotein cistron (Min Jou et al., 1972), nor does it come from the untranslated sequences described previously (Steitz, 1969; Nichols, 1970; Cory et al., 1970; de Wachter et al., 1971; Contreras et al., 1971; Cory et al., 1972). It contains two sequences of 23 and 24 nucleotides, 22 of which are identical. This kind of reiteration is the first one found in bacteriophage nucleic acid. 2. Improved conditions were found and tested for blocking oligonucleotides with carbodi-imide and cleaving by ribonuclease A at cytidylate residues. 3. A synthetic medium is described which allows labelling in vivo with 32P to give specific radioactivities higher than those obtained in the procedures used previously. ImagesPLATE 1PLATE 2PLATE 3 PMID:4352721

  12. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties

    PubMed Central

    Neuwald, Andrew F.; Altschul, Stephen F.

    2016-01-01

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a “top-down” strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins’ structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO’s superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/. PMID

  13. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.

    PubMed

    Neuwald, Andrew F; Altschul, Stephen F

    2016-05-01

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/. PMID:27192614

  14. Complete nucleotide sequence of a subviral DNA molecule of porcine circovirus type 2.

    PubMed

    Wen, Han

    2016-07-01

    Porcine circovirus type 2 (PCV2) is a member of the genus Circovirus in the family Circoviridae. Most subgenomic molecules of PCV2 have been mapped. Here, the first full-length sequence of a subviral molecule of PCV2 (CH-IVT12) containing a reverse complement sequence of the PCV2 genome was determined by sequencing DNA extracted from PK15 cells infected with PCV2. The circular CH-IVT12 DNA consists of 1136 nucleotides and contains one major open reading frame. PMID:27084550

  15. Sequence selective naked-eye detection of DNA harnessing extension of oligonucleotide-modified nucleotides.

    PubMed

    Verga, Daniela; Welter, Moritz; Marx, Andreas

    2016-02-01

    DNA polymerases can efficiently and sequence selectively incorporate oligonucleotide (ODN)-modified nucleotides and the incorporated oligonucleotide strand can be employed as primer in rolling circle amplification (RCA). The effective amplification of the DNA primer by Φ29 DNA polymerase allows the sequence-selective hybridisation of the amplified strand with a G-quadruplex DNA sequence that has horse radish peroxidase-like activity. Based on these findings we develop a system that allows DNA detection with single-base resolution by naked eye.

  16. Sequence selective naked-eye detection of DNA harnessing extension of oligonucleotide-modified nucleotides.

    PubMed

    Verga, Daniela; Welter, Moritz; Marx, Andreas

    2016-02-01

    DNA polymerases can efficiently and sequence selectively incorporate oligonucleotide (ODN)-modified nucleotides and the incorporated oligonucleotide strand can be employed as primer in rolling circle amplification (RCA). The effective amplification of the DNA primer by Φ29 DNA polymerase allows the sequence-selective hybridisation of the amplified strand with a G-quadruplex DNA sequence that has horse radish peroxidase-like activity. Based on these findings we develop a system that allows DNA detection with single-base resolution by naked eye. PMID:26774580

  17. The nucleotide sequence at the termini of adenovirus type 5 DNA.

    PubMed Central

    Steenbergh, P H; Maat, J; van Ormondt, H; Sussenbach, J S

    1977-01-01

    The sequences of the first 194 base pairs at both termini of adenovirus type 5 (Ad5) DNA have been determined, using the chemical degradation technique developed by Maxam and Gilbert (Proc. Nat. Acad. Sci. USA 74 (1977), pp. 560-564). The nucleotide sequences 1-75 were confirmed by analysis of labeled RNA transcribed from the terminal HhaI fragments in vitro. The sequence data show that Ad5 DNA has a perfect inverted terminal repetition of 103 base pairs long. Images PMID:600799

  18. Complete nucleotide sequence analysis of a Dengue-1 virus isolated on Easter Island, Chile.

    PubMed

    Cáceres, C; Yung, V; Araya, P; Tognarelli, J; Villagra, E; Vera, L; Fernández, J

    2008-01-01

    Dengue-1 viruses responsible for the dengue fever outbreak in Easter Island in 2002 were isolated from acute-phase sera of dengue fever patients. In order to analyze the complete genome sequence, we designed primers to amplify contiguous segments across the entire sequence of the viral genome. RT-PCR products obtained were cloned, and complete nucleotide and deduced amino acid sequences were determined. This report constitutes the first complete genetic characterization of a DENV-1 isolate from Chile. Phylogenetic analysis shows that an Easter Island isolate is most closely related to Pacific DENV-1 genotype IV viruses.

  19. Upcoming challenges for multiple sequence alignment methods in the high-throughput era

    PubMed Central

    Kemena, Carsten; Notredame, Cedric

    2009-01-01

    This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches. Contact: cedric.notredame@crg.es PMID:19648142

  20. Nucleotide sequence discrepancies within the GA strain of Marek's disease virus

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Comparative genomics between 9 gallid herpesvirus type 2 strains have singled out the virulent (v) prototype strain GA as phylogenetically distant from other v pathotypes. Multiple amino acid alignments of otherwise highly conserved unique long (UL) genes have indicated sequence discrepancies within...

  1. RNA-Pareto: interactive analysis of Pareto-optimal RNA sequence-structure alignments.

    PubMed

    Schnattinger, Thomas; Schöning, Uwe; Marchfelder, Anita; Kestler, Hans A

    2013-12-01

    Incorporating secondary structure information into the alignment process improves the quality of RNA sequence alignments. Instead of using fixed weighting parameters, sequence and structure components can be treated as different objectives and optimized simultaneously. The result is not a single, but a Pareto-set of equally optimal solutions, which all represent different possible weighting parameters. We now provide the interactive graphical software tool RNA-Pareto, which allows a direct inspection of all feasible results to the pairwise RNA sequence-structure alignment problem and greatly facilitates the exploration of the optimal solution set.

  2. PFAAT version 2.0: A tool for editing, annotating, and analyzing multiple sequence alignments

    PubMed Central

    Caffrey, Daniel R; Dana, Paul H; Mathur, Vidhya; Ocano, Marco; Hong, Eun-Jong; Wang, Yaoyu E; Somaroo, Shyamal; Caffrey, Brian E; Potluri, Shobha; Huang, Enoch S

    2007-01-01

    Background By virtue of their shared ancestry, homologous sequences are similar in their structure and function. Consequently, multiple sequence alignments are routinely used to identify trends that relate to function. This type of analysis is particularly productive when it is combined with structural and phylogenetic analysis. Results Here we describe the release of PFAAT version 2.0, a tool for editing, analyzing, and annotating multiple sequence alignments. Support for multiple annotations is a key component of this release as it provides a framework for most of the new functionalities. The sequence annotations are accessible from the alignment and tree, where they are typically used to label sequences or hyperlink them to related databases. Sequence annotations can be created manually or extracted automatically from UniProt entries. Once a multiple sequence alignment is populated with sequence annotations, sequences can be easily selected and sorted through a sophisticated search dialog. The selected sequences can be further analyzed using statistical methods that explicitly model relationships between the sequence annotations and residue properties. Residue annotations are accessible from the alignment viewer and are typically used to designate binding sites or properties for a particular residue. Residue annotations are also searchable, and allow one to quickly select alignment columns for further sequence analysis, e.g. computing percent identities. Other features include: novel algorithms to compute sequence conservation, mapping conservation scores to a 3D structure in Jmol, displaying secondary structure elements, and sorting sequences by residue composition. Conclusion PFAAT provides a framework whereby end-users can specify knowledge for a protein family in the form of annotation. The annotations can be combined with sophisticated analysis to test hypothesis that relate to sequence, structure and function. PMID:17931421

  3. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    PubMed Central

    Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard

    2004-01-01

    Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope. PMID:15357879

  4. Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model

    PubMed Central

    Neuwald, Andrew F; Liu, Jun S

    2004-01-01

    Background Certain protein families are highly conserved across distantly related organisms and belong to large and functionally diverse superfamilies. The patterns of conservation present in these protein sequences presumably are due to selective constraints maintaining important but unknown structural mechanisms with some constraints specific to each family and others shared by a larger subset or by the entire superfamily. To exploit these patterns as a source of functional information, we recently devised a statistically based approach called contrast hierarchical alignment and interaction network (CHAIN) analysis, which infers the strengths of various categories of selective constraints from co-conserved patterns in a multiple alignment. The power of this approach strongly depends on the quality of the multiple alignments, which thus motivated development of theoretical concepts and strategies to improve alignment of conserved motifs within large sets of distantly related sequences. Results Here we describe a hidden Markov model (HMM), an algebraic system, and Markov chain Monte Carlo (MCMC) sampling strategies for alignment of multiple sequence motifs. The MCMC sampling strategies are useful both for alignment optimization and for adjusting position specific background amino acid frequencies for alignment uncertainties. Associated statistical formulations provide an objective measure of alignment quality as well as automatic gap penalty optimization. Improved alignments obtained in this way are compared with PSI-BLAST based alignments within the context of CHAIN analysis of three protein families: Giα subunits, prolyl oligopeptidases, and transitional endoplasmic reticulum (p97) AAA+ ATPases. Conclusion While not entirely replacing PSI-BLAST based alignments, which likewise may be optimized for CHAIN analysis using this approach, these motif-based methods often more accurately align very distantly related sequences and thus can provide a better measure of

  5. Cloning and nucleotide sequence of the alpha-galactosidase cDNA from Cyamopsis tetragonoloba (guar).

    PubMed

    Overbeeke, N; Fellinger, A J; Toonen, M Y; van Wassenaar, D; Verrips, C T

    1989-11-01

    Polyadenylated mRNA was purified from the aleurone cells of Cyamopsis tetragonoloba (guar) seeds germinated for 18 h and used for the construction of a cDNA library. Clones with the alpha-galactosidase encoding gene were identified using oligo-nucleotide mixed probes based on the NH2 terminal amino acid sequence and on the sequence of an internal peptide. The nucleotide sequence of the cDNA clone showed that the enzyme is synthesized as a precursor with a 47 amino acid NH2 terminal extension. This pre-sequence most likely functions to target the protein outside the aleurone cells into the endosperm. Based upon structural features, it is proposed to divide the precursor into a pre-(signal sequence) part and a glycosylated pro-part comparable with those of the yeast mat A/alpha factor and killer factor. A comparison of the derived amino acid sequence of this alpha-galactosidase from plant origin revealed significant stretches of homology with respect to the amino acid sequences of the enzymes from Saccharomyces cerevisiae and from human origin but only to a minor extent compared with the alpha-galactosidase from Escherichia coli.

  6. Complete nucleotide sequence of wound tumor virus genomic segments encoding nonstructural polypeptides.

    PubMed

    Anzola, J V; Dall, D J; Xu, Z K; Nuss, D L

    1989-07-01

    Sequence analysis of the genomic segments which encode the five wound tumor virus nonstructural polypeptides has been completed. The complete nucleotide sequence of segments S4 (2565 bp), S6 (1700 bp), S9 (1182 bp), and S10 (1172 bp) are presented in this report while the sequence of segment S12 (851 bp) has been described previously (T. Asamizu, D. Summers, M. B. Motika, J. V. Anzola, and D. L. Nuss, 1985, Virology 144, 398-409). Comparison of the only published sequence for another member of the genus Phytoreovirus, that of rice dwarf virus segment S10, with the combined available wound tumor virus sequence data revealed similarity with WTV segment S10: 54.9 and 30.6% at the nucleotide and amino acid level, respectively. Although wound tumor virus and rice dwarf virus differ in plant host range, tissue specificity, vector range, and disease symptom expression, the level of sequence similarity shared by the two segments suggests a common origin for these viruses. The potential use of a phytoreovirus sequence database for predicting functions of viral encoded gene products is considered.

  7. Nucleotide binding database NBDB – a collection of sequence motifs with specific protein-ligand interactions

    PubMed Central

    Zheng, Zejun; Goncearenco, Alexander; Berezovsky, Igor N.

    2016-01-01

    NBDB database describes protein motifs, elementary functional loops (EFLs) that are involved in binding of nucleotide-containing ligands and other biologically relevant cofactors/coenzymes, including ATP, AMP, ATP, GMP, GDP, GTP, CTP, PAP, PPS, FMN, FAD(H), NAD(H), NADP, cAMP, cGMP, c-di-AMP and c-di-GMP, ThPP, THD, F-420, ACO, CoA, PLP and SAM. The database is freely available online at http://nbdb.bii.a-star.edu.sg. In total, NBDB contains data on 249 motifs that work in interactions with 24 ligands. Sequence profiles of EFL motifs were derived de novo from nonredundant Uniprot proteome sequences. Conserved amino acid residues in the profiles interact specifically with distinct chemical parts of nucleotide-containing ligands, such as nitrogenous bases, phosphate groups, ribose, nicotinamide, and flavin moieties. Each EFL profile in the database is characterized by a pattern of corresponding ligand–protein interactions found in crystallized ligand–protein complexes. NBDB database helps to explore the determinants of nucleotide and cofactor binding in different protein folds and families. NBDB can also detect fragments that match to profiles of particular EFLs in the protein sequence provided by user. Comprehensive information on sequence, structures, and interactions of EFLs with ligands provides a foundation for experimental and computational efforts on design of required protein functions. PMID:26507856

  8. Nucleotide binding database NBDB--a collection of sequence motifs with specific protein-ligand interactions.

    PubMed

    Zheng, Zejun; Goncearenco, Alexander; Berezovsky, Igor N

    2016-01-01

    NBDB database describes protein motifs, elementary functional loops (EFLs) that are involved in binding of nucleotide-containing ligands and other biologically relevant cofactors/coenzymes, including ATP, AMP, ATP, GMP, GDP, GTP, CTP, PAP, PPS, FMN, FAD(H), NAD(H), NADP, cAMP, cGMP, c-di-AMP and c-di-GMP, ThPP, THD, F-420, ACO, CoA, PLP and SAM. The database is freely available online at http://nbdb.bii.a-star.edu.sg. In total, NBDB contains data on 249 motifs that work in interactions with 24 ligands. Sequence profiles of EFL motifs were derived de novo from nonredundant Uniprot proteome sequences. Conserved amino acid residues in the profiles interact specifically with distinct chemical parts of nucleotide-containing ligands, such as nitrogenous bases, phosphate groups, ribose, nicotinamide, and flavin moieties. Each EFL profile in the database is characterized by a pattern of corresponding ligand-protein interactions found in crystallized ligand-protein complexes. NBDB database helps to explore the determinants of nucleotide and cofactor binding in different protein folds and families. NBDB can also detect fragments that match to profiles of particular EFLs in the protein sequence provided by user. Comprehensive information on sequence, structures, and interactions of EFLs with ligands provides a foundation for experimental and computational efforts on design of required protein functions.

  9. Nucleotide binding database NBDB--a collection of sequence motifs with specific protein-ligand interactions.

    PubMed

    Zheng, Zejun; Goncearenco, Alexander; Berezovsky, Igor N

    2016-01-01

    NBDB database describes protein motifs, elementary functional loops (EFLs) that are involved in binding of nucleotide-containing ligands and other biologically relevant cofactors/coenzymes, including ATP, AMP, ATP, GMP, GDP, GTP, CTP, PAP, PPS, FMN, FAD(H), NAD(H), NADP, cAMP, cGMP, c-di-AMP and c-di-GMP, ThPP, THD, F-420, ACO, CoA, PLP and SAM. The database is freely available online at http://nbdb.bii.a-star.edu.sg. In total, NBDB contains data on 249 motifs that work in interactions with 24 ligands. Sequence profiles of EFL motifs were derived de novo from nonredundant Uniprot proteome sequences. Conserved amino acid residues in the profiles interact specifically with distinct chemical parts of nucleotide-containing ligands, such as nitrogenous bases, phosphate groups, ribose, nicotinamide, and flavin moieties. Each EFL profile in the database is characterized by a pattern of corresponding ligand-protein interactions found in crystallized ligand-protein complexes. NBDB database helps to explore the determinants of nucleotide and cofactor binding in different protein folds and families. NBDB can also detect fragments that match to profiles of particular EFLs in the protein sequence provided by user. Comprehensive information on sequence, structures, and interactions of EFLs with ligands provides a foundation for experimental and computational efforts on design of required protein functions. PMID:26507856

  10. Linking the human cytogenetic map with nucleotide sequence: the CCAP clone set.

    PubMed

    Jang, Wonhee; Yonescu, Raluca; Knutsen, Turid; Brown, Theresa; Reppert, Tricia; Sirotkin, Karl; Schuler, Gregory D; Ried, Thomas; Kirsch, Ilan R

    2006-07-15

    We present the completed dataset and clone repository of the Cancer Chromosome Aberration Project (CCAP), an initiative developed and funded through the intramural program of the U.S. National Cancer Institute, to provide seamless linkage of human cytogenetic markers with the primary nucleotide sequence of the human genome. Spaced at 1-2 Mb intervals across the human genome, 1,339 bacterial artificial chromosome (BAC) clones have been localized to chromosomal bands through high-resolution fluorescence in situ hybridization (FISH) mapping. Of these clones, 99.8% can be positioned on the primary human genome sequence and 95% are placed at or close to their precise nucleotide starts and stops. This dataset can be studied and manipulated within generally available public Web sites. The clones are available from a commercial repository. The CCAP BAC clone set provides anchors for the interrogation of gene and sequence involvement in oncogenic and developmental disorders when the starting point is the recognition of a structural, numerical, or interstitial chromosomal aberration. This dataset also provides a current view of the quality and coherence of the available genome sequence and insight into the nucleotide and three-dimensional structures that manifest as Giemsa light and dark chromosomal banding patterns.

  11. Linking the human cytogenetic map with nucleotide sequence: the CCAP clone set.

    PubMed

    Jang, Wonhee; Yonescu, Raluca; Knutsen, Turid; Brown, Theresa; Reppert, Tricia; Sirotkin, Karl; Schuler, Gregory D; Ried, Thomas; Kirsch, Ilan R

    2006-07-15

    We present the completed dataset and clone repository of the Cancer Chromosome Aberration Project (CCAP), an initiative developed and funded through the intramural program of the U.S. National Cancer Institute, to provide seamless linkage of human cytogenetic markers with the primary nucleotide sequence of the human genome. Spaced at 1-2 Mb intervals across the human genome, 1,339 bacterial artificial chromosome (BAC) clones have been localized to chromosomal bands through high-resolution fluorescence in situ hybridization (FISH) mapping. Of these clones, 99.8% can be positioned on the primary human genome sequence and 95% are placed at or close to their precise nucleotide starts and stops. This dataset can be studied and manipulated within generally available public Web sites. The clones are available from a commercial repository. The CCAP BAC clone set provides anchors for the interrogation of gene and sequence involvement in oncogenic and developmental disorders when the starting point is the recognition of a structural, numerical, or interstitial chromosomal aberration. This dataset also provides a current view of the quality and coherence of the available genome sequence and insight into the nucleotide and three-dimensional structures that manifest as Giemsa light and dark chromosomal banding patterns. PMID:16843097

  12. A rank-based sequence aligner with applications in phylogenetic analysis.

    PubMed

    Dinu, Liviu P; Ionescu, Radu Tudor; Tomescu, Alexandru I

    2014-01-01

    Recent tools for aligning short DNA reads have been designed to optimize the trade-off between correctness and speed. This paper introduces a method for assigning a set of short DNA reads to a reference genome, under Local Rank Distance (LRD). The rank-based aligner proposed in this work aims to improve correctness over speed. However, some indexing strategies to speed up the aligner are also investigated. The LRD aligner is improved in terms of speed by storing [Formula: see text]-mer positions in a hash table for each read. Another improvement, that produces an approximate LRD aligner, is to consider only the positions in the reference that are likely to represent a good positional match of the read. The proposed aligner is evaluated and compared to other state of the art alignment tools in several experiments. A set of experiments are conducted to determine the precision and the recall of the proposed aligner, in the presence of contaminated reads. In another set of experiments, the proposed aligner is used to find the order, the family, or the species of a new (or unknown) organism, given only a set of short Next-Generation Sequencing DNA reads. The empirical results show that the aligner proposed in this work is highly accurate from a biological point of view. Compared to the other evaluated tools, the LRD aligner has the important advantage of being very accurate even for a very low base coverage. Thus, the LRD aligner can be considered as a good alternative to standard alignment tools, especially when the accuracy of the aligner is of high importance. Source code and UNIX binaries of the aligner are freely available for future development and use at http://lrd.herokuapp.com/aligners. The software is implemented in C++ and Java, being supported on UNIX and MS Windows.

  13. IP-MSA: Independent order of progressive multiple sequence alignments using different substitution matrices

    NASA Astrophysics Data System (ADS)

    Boraik, Aziz Nasser; Abdullah, Rosni; Venkat, Ibrahim

    2014-12-01

    Multiple sequence alignment (MSA) is an essential process for many biological sequence analyses. There are many algorithms developed to solve MSA, but an efficient computation method with very high accuracy is still a challenge. Progressive alignment is the most widely used approach to compute the final MSA. In this paper, we present a simple and effective progressive approach. Based on the independent order of sequences progressive alignment which proposed in QOMA, this method has been modified to align the whole sequences to maximize the score of MSA. Moreover, in order to further improve the accuracy of the method, we estimate the similarity of any pair of input sequences by using their percent identity, and based on this measure, we choose different substitution matrices during the progressive alignment. In addition, we have included horizontal information to alignment by adjusting the weights of amino acid residues based on their neighboring residues. The experimental results have been tested on popular benchmark of global protein sequences BAliBASE 3.0 and local protein sequences IRMBASE 2.0. The results of the proposed approach outperform the original method in QOMA in terms of sum-of-pair score and column score by up to 14% and 7% respectively.

  14. Cloning and nucleotide sequence of the anaerobically regulated pepT gene of Salmonella typhimurium.

    PubMed Central

    Miller, C G; Miller, J L; Bagga, D A

    1991-01-01

    The anaerobically regulated pepT gene of Salmonella typhimurium has been cloned in pBR328. Strains carrying the pepT plasmid, pJG17, overproduce peptidase T by approximately 70-fold. The nucleotide sequence of a 2.5-kb region including pepT has been determined. The sequence codes for a protein of 44,855 Da, consistent with a molecular weight of approximately 46,000 for peptidase T (as determined by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and gel filtration). The N-terminal amino acid sequence of peptidase T purified from a pJG17-containing strain matches that predicted by the nucleotide sequence. A plasmid carrying an anaerobically regulated pepT::lacZ transcriptional fusion contains only 165 bp 5' to the start of translation. This region contains a sequence highly homologous to that identified in Escherichia coli as the site of action of the FNR protein, a positive regulator of anaerobic gene expression. A region of the deduced amino acid sequence of peptidase T is similar to segments of Pseudomonas carboxypeptidase G2, the E. coli peptidase encoded by the iap gene, and E. coli peptidase D. PMID:1904438

  15. Manipulating multiple sequence alignments via MaM and WebMaM

    PubMed Central

    Alkan, Can; Tüzün, Eray; Buard, Jerome; Lethiec, Franck; Eichler, Evan E.; Bailey, Jeffrey A.; Sahinalp, S. Cenk

    2005-01-01

    MaM is a software tool that processes and manipulates multiple alignments of genomic sequence. MaM computes the exact location of common repeat elements, exons and unique regions within aligned genomics sequences using a variety of user identified programs, databases and/or tables. The program can extract subalignments, corresponding to these various regions of DNA to be analyzed independently or in conjunction with other elements of genomic DNA. Graphical displays further allow an assessment of sequence variation throughout these different regions of the aligned sequence, providing separate displays for their repeat, non-repeat and coding portions of genomic DNA. The program should facilitate the phylogenetic analysis and processing of different portions of genomic sequence as part of large-scale sequencing efforts. MaM source code is freely available for non-commercial use at ; and the web interface WebMaM is hosted at . PMID:15980474

  16. Predicting and improving the protein sequence alignment quality by support vector regression

    PubMed Central

    Lee, Minho; Jeong, Chan-seok; Kim, Dongsup

    2007-01-01

    Background For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment. Results In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs. Conclusion The present work demonstrates that the alignment quality can be

  17. Nucleotide sequence of the hypervariable region of the human C2 gene

    SciTech Connect

    Zhu, Z.B.; Volanakis, J.V. )

    1991-03-15

    It has been previously suggested that the multiallelic Bam H1/Sst I RFLPs of the human C2 gene arose through deletion/insertion of a tandemly-repeated minisatellite region. In this study the authors subcloned and sequenced the Sst I polymorphic fragment of the b haplotype of the C2 gene. This restriction fragment is 2,450 bp long and maps 1,550 bp 3{prime} of exon 3. Its nucleotide sequence is characterized by the presence of at least 4 different repeated regions varying in size from 18 to 58 bp. One of these regions starting at position 1,413 is 48 bp long and is repeated five times. The first 3 repeats are in tandem and are separated by 72 bp from two additional tandem repeats. Sequence homology among the 5 repeats ranges between 93 and 98%. Eighty three percent of the nucleotides of the repeated-region are G or C. It seems likely that this nucleotide repeat resulted in the multiallelic RFLPs through a mechanism of unequal recombination or replication slippage.

  18. Complete nucleotide sequence and coding strategy of rice hoja blanca virus RNA4.

    PubMed

    Ramirez, B C; Lozano, I; Constantino, L M; Haenni, A L; Calvert, L A

    1993-11-01

    The complete sequence of rice hoja blanca virus (RHBV) RNA4 has been determined, based on the sequence of the corresponding cDNA clones. RNA4 consists of 1991 nucleotides with two open reading frames (ORFs). One putative ORF is located in the 5'-proximal region of the viral RNA4; it encodes a protein of predicted M(r) 20076 which corresponds to the major non-structural protein that accumulates in RHBV-infected rice plants, and which bears limited sequence identity with the helper component of tobacco vein mottling potyvirus. The other ORF is located in the 5'-proximal region of the viral complementary RNA4 and encodes a protein of predicted M(r) 32,469. Between the two ORFs is an intergenic region of 524 nucleotides, part of which can theoretically adopt a stable stem-loop structure; the 5' and 3' ends can potentially base-pair over 16 nucleotides, producing a pan-handle configuration. These characteristics are in favour of an ambisense coding strategy for RHBV RNA4. PMID:8245863

  19. Complete Nucleotide Sequence of a French Isolate of Maize rough dwarf virus, a Fijivirus Member in the Family Reoviridae

    PubMed Central

    Svanella-Dumas, L.; Marais, A.; Faure, C.; Theil, S.; Thibord, J. B.

    2016-01-01

    The complete nucleotide sequence of a French isolate of Maize rough dwarf virus (MRDV) was determined by next-generation sequencing and compared with the single available complete sequence and with the partial sequences of two additional isolates available in online databases. PMID:27445367

  20. Predicting the accuracy of multiple sequence alignment algorithms by using computational intelligent techniques

    PubMed Central

    Ortuño, Francisco M.; Valenzuela, Olga; Pomares, Hector; Rojas, Fernando; Florido, Javier P.; Urquiza, Jose M.

    2013-01-01

    Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments. The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm is run, the computational time is not excessively increased. PMID:23066102

  1. Nucleotide sequence of a hop stunt viroid variant isolated from citrus growing in Taiwan.

    PubMed

    Hsu, Y H; Chen, W; Owens, R A

    1995-01-01

    The 303 nucleotide sequence of HSVd-citrus(T), a hop stunt viroid (HSVd) variant present in Etrog citron growing in Taiwan, was determined from cDNAs amplified by the polymerase chain reaction. HSVd-citrus(T) is very similar to several HSVd isolates previously recovered from citrus or cucumber, and exhibits microsequence heterogeneity at positions 154 and 181. Phylogenetic analysis using maximum parsimony grouped HSVd-citrus(T) with seven other isolates from citrus and cucumber in a large cluster of "citrus-type" isolates. A similar analysis revealed marked differences in both the extent and distribution of sequence variation among naturally occurring isolates of potato spindle tuber viroid.

  2. Nucleotide sequencing and characterization of the genes encoding benzene oxidation enzymes of Pseudomonas putida

    SciTech Connect

    Irie, S.; Doi, S.; Yorifuji, T.; Takagi, M.; Yano, K.

    1987-11-01

    The nucleotide sequence of the genes from Pseudomonas putida encoding oxidation of benzene to catechol was determined. Five open reading frames were found in the sequence. Four corresponding protein molecules were detected by a DNA-directed in vitro translation system. Escherichia coli cells containing the fragment with the four open reading frames transformed benzene to cis-benzene glycol, which is an intermediate of the oxidation of benzene to catechol. The relation between the product of each cistron and the components of the benzene oxidation enzyme system is discussed.

  3. SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics

    PubMed Central

    Will, Sebastian; Otto, Christina; Miladi, Milad; Möhl, Mathias; Backofen, Rolf

    2015-01-01

    Motivation: RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of O(n6). Subsequently, numerous faster ‘Sankoff-style’ approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity (≥ quartic time). Results: Breaking this barrier, we introduce the novel Sankoff-style algorithm ‘sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)’, which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff’s original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics. Availability and implementation: SPARSE is freely available at http://www.bioinf.uni-freiburg.de/Software/SPARSE. Contact: backofen@informatik.uni-freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25838465

  4. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment.

    PubMed

    Iantorno, Stefano; Gori, Kevin; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe

    2014-01-01

    Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies-based on simulation, consistency, protein structure, and phylogeny-and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application-with a keen awareness of the assumptions underlying each benchmarking strategy.

  5. A statistical physics perspective on alignment-independent protein sequence comparison

    PubMed Central

    Chattopadhyay, Amit K.; Nasiev, Diar; Flower, Darren R.

    2015-01-01

    Motivation: Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Results: Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from ‘first passage probability distribution’ to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. Contact: d.r.flower@aston.ac.uk PMID:25810434

  6. Remote access to ACNUC nucleotide and protein sequence databases at PBIL.

    PubMed

    Gouy, Manolo; Delmotte, Stéphane

    2008-04-01

    The ACNUC biological sequence database system provides powerful and fast query and extraction capabilities to a variety of nucleotide and protein sequence databases. The collection of ACNUC databases served by the Pôle Bio-Informatique Lyonnais includes the EMBL, GenBank, RefSeq and UniProt nucleotide and protein sequence databases and a series of other sequence databases that support comparative genomics analyses: HOVERGEN and HOGENOM containing families of homologous protein-coding genes from vertebrate and prokaryotic genomes, respectively; Ensembl and Genome Reviews for analyses of prokaryotic and of selected eukaryotic genomes. This report describes the main features of the ACNUC system and the access to ACNUC databases from any internet-connected computer. Such access was made possible by the definition of a remote ACNUC access protocol and the implementation of Application Programming Interfaces between the C, Python and R languages and this communication protocol. Two retrieval programs for ACNUC databases, Query_win, with a graphical user interface and raa_query, with a command line interface, are also described. Altogether, these bioinformatics tools provide users with either ready-to-use means of querying remote sequence databases through a variety of selection criteria, or a simple way to endow application programs with an extensive access to these databases. Remote access to ACNUC databases is open to all and fully documented (http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html).

  7. Nucleotide sequences of immunoglobulin eta genes of chimpanzee and orangutan: DNA molecular clock and hominoid evolution

    SciTech Connect

    Sakoyama, Y.; Hong, K.J.; Byun, S.M.; Hisajima, H.; Ueda, S.; Yaoita, Y.; Hayashida, H.; Miyata, T.; Honjo, T.

    1987-02-01

    To determine the phylogenetic relationships among hominoids and the dates of their divergence, the complete nucleotide sequences of the constant region of the immunoglobulin eta-chain (C/sub eta1/) genes from chimpanzee and orangutan have been determined. These sequences were compared with the human eta-chain constant-region sequence. A molecular clock (silent molecular clock), measured by the degree of sequence divergence at the synonymous (silent) positions of protein-encoding regions, was introduced for the present study. From the comparison of nucleotide sequences of ..cap alpha../sub 1/-antitrypsin and ..beta..- and delta-globulin genes between humans and Old World monkeys, the silent molecular clock was calibrated: the mean evolutionary rate of silent substitution was determined to be 1.56 x 10/sup -9/ substitutions per site per year. Using the silent molecular clock, the mean divergence dates of chimpanzee and orangutan from the human lineage were estimated as 6.4 +/- 2.6 million years and 17.3 +/- 4.5 million years, respectively. It was also shown that the evolutionary rate of primate genes is considerably slower than those of other mammalian genes.

  8. Nucleotide sequence of the p53 cDNA of beluga whale (Delphinapterus leucas).

    PubMed

    Xu, Ning; Shiraki, Takashi; Yamada, Tadasu; Nakajima, Masayuki; Gauthier, Julie M; Pfeiffer, Carl J; Sato, Shigeaki

    2002-04-17

    The cDNA (DNA complementary to RNA) of the p53 gene of the beluga whale (Delphinapterus leucas) was sequenced by the method of 5'- and 3'-rapid amplification of cDNA ends (RACE) with the cDNA made for the RNA obtained from fresh peripheral blood leukocytes isolated from two animals. Primers for the RACE method were synthesized based on the sequence of the DNA of beluga whale corresponding to exon 5 of the human p53 gene, which was determined after amplification of the DNA isolated from the liver from a beluga whale by using a pair of primers for the human sequence. The sequenced cDNA had a 2150-nucleotide length and contained the whole region corresponding to human exons 1 through 11. The reading frame was 1164 bp (base pair) long and began in exon 2 and ended in exon 11, coding for a 387-amino acid protein. The nucleotide sequence of the reading frame showed high similarity over 85% with pig, sheep, cow, and human genes. The similarities with the former two animals at the amino acid level were also more than 85%. Lower similarity of the beluga whale p53 gene was also found with those of lower tetrapods, fish and invertebrates.

  9. Comparison of Sequencing Platforms for Single Nucleotide Variant Calls in a Human Sample

    PubMed Central

    Miller, Webb; Guillory, Joseph; Stinson, Jeremy; Seshagiri, Somasekar

    2013-01-01

    Next-generation sequencings platforms coupled with advanced bioinformatic tools enable re-sequencing of the human genome at high-speed and large cost savings. We compare sequencing platforms from Roche/454(GS FLX), Illumina/HiSeq (HiSeq 2000), and Life Technologies/SOLiD (SOLiD 3 ECC) for their ability to identify single nucleotide substitutions in whole genome sequences from the same human sample. We report on significant GC-related bias observed in the data sequenced on Illumina and SOLiD platforms. The differences in the variant calls were investigated with regards to coverage, and sequencing error. Some of the variants called by only one or two of the platforms were experimentally tested using mass spectrometry; a method that is independent of DNA sequencing. We establish several causes why variants remained unreported, specific to each platform. We report the indel called using the three sequencing technologies and from the obtained results we conclude that sequencing human genomes with more than a single platform and multiple libraries is beneficial when high level of accuracy is required. PMID:23405114

  10. Comparison of sequencing platforms for single nucleotide variant calls in a human sample.

    PubMed

    Ratan, Aakrosh; Miller, Webb; Guillory, Joseph; Stinson, Jeremy; Seshagiri, Somasekar; Schuster, Stephan C

    2013-01-01

    Next-generation sequencings platforms coupled with advanced bioinformatic tools enable re-sequencing of the human genome at high-speed and large cost savings. We compare sequencing platforms from Roche/454(GS FLX), Illumina/HiSeq (HiSeq 2000), and Life Technologies/SOLiD (SOLiD 3 ECC) for their ability to identify single nucleotide substitutions in whole genome sequences from the same human sample. We report on significant GC-related bias observed in the data sequenced on Illumina and SOLiD platforms. The differences in the variant calls were investigated with regards to coverage, and sequencing error. Some of the variants called by only one or two of the platforms were experimentally tested using mass spectrometry; a method that is independent of DNA sequencing. We establish several causes why variants remained unreported, specific to each platform. We report the indel called using the three sequencing technologies and from the obtained results we conclude that sequencing human genomes with more than a single platform and multiple libraries is beneficial when high level of accuracy is required.

  11. Nucleotide sequence of the SrRNA gene and phylogenetic analysis of Trichomonas tenax.

    PubMed

    Fukura, K; Yamamoto, A; Hashimoto, T; Goto, N

    1996-01-01

    The small subunit ribosomal RNA (SrRNA) gene of Trichomonas tenax ATCC30207 was amplified by PCR and the 1.55-kb product was cloned into plasmid vector pUC18. Four clones were isolated and sequenced. The insert DNAs were 1,552 bp long and their G+C contents were 48.1%; three of them had exactly the same DNA sequences and one had only one nucleotide change. A representative SrRNA sequence was analyzed and a phylogenetic tree was estimated by the neighbor-joining (NJ) method. Among the protists examined, T. tenax was placed as the closest relative of Tritrichomonas foetus, as expected from the traditional taxonomy. The total homology between the two SrRNA sequences was 89.2%.

  12. Nucleotide sequence of Crithidia fasciculata cytosol 5S ribosomal ribonucleic acid.

    PubMed

    MacKay, R M; Gray, M W; Doolittle, W F

    1980-11-11

    The complete nucleotide sequence of the cytosol 5S ribosomal ribonucleic acid of the trypanosomatid protozoan Crithidia fasciculata has been determined by a combination of T1-oligonucleotide catalog and gel sequencing techniques. The sequence is: GAGUACGACCAUACUUGAGUGAAAACACCAUAUCCCGUCCGAUUUGUGAAGUUAAGCACC CACAGGCUUAGUUAGUACUGAGGUCAGUGAUGACUCGGGAACCCUGAGUGCCGUACUCCCOH. This 5S ribosomal RNA is unique in having GAUU in place of the GAAC or GAUC found in all other prokaryotic and eukaryotic 5S RNAs, and thought to be involved in interactions with tRNAs. Comparisons to other eukaryotic cytosol 5S ribosomal RNA sequences indicate that the four major eukaryotic kingdoms (animals, plants, fungi, and protists) are about equally remote from each other, and that the latter kingdom may be the most internally diverse.

  13. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference

    PubMed Central

    Tan, Ge; Muffato, Matthieu; Ledergerber, Christian; Herrero, Javier; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe

    2015-01-01

    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms. PMID:26031838

  14. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.

    PubMed

    Tan, Ge; Muffato, Matthieu; Ledergerber, Christian; Herrero, Javier; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe

    2015-09-01

    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms. PMID:26031838

  15. Nucleotide sequence variation of chitin synthase genes among ectomycorrhizal fungi and its potential use in taxonomy.

    PubMed Central

    Mehmann, B; Brunner, I; Braus, G H

    1994-01-01

    DNA sequences of single-copy genes coding for chitin synthases (UDP-N-acetyl-D-glucosamine:chitin 4-beta-N-acetylglucosaminyltransferase; EC 2.4.1.16) were used to characterize ectomycorrhizal fungi. Degenerate primers deduced from short, completely conserved amino acid stretches flanking a region of about 200 amino acids of zymogenic chitin synthases allowed the amplification of DNA fragments of several members of this gene family. Different DNA band patterns were obtained from basidiomycetes because of variation in the number and length of amplified fragments. Cloning and sequencing of the most prominent DNA fragments revealed that these differences were due to various introns at conserved positions. The presence of introns in basidiomycetous fungi therefore has a potential use in identification of genera by analyzing PCR-generated DNA fragment patterns. Analyses of the nucleotide sequences of cloned fragments revealed variations in nucleotide sequences from 4 to 45%. By comparison of the deduced amino acid sequences, the majority of the DNA fragments were identified as members of genes for chitin synthase class II. The deduced amino acid sequences from species of the same genus differed only in one amino acid residue, whereas identity between the amino acid sequences of ascomycetous and basidiomycetous fungi within the same taxonomic class was found to be approximately 43 to 66%. Phylogenetic analysis of the amino acid sequence of class II chitin synthase-encoding gene fragments by using parsimony confirmed the current taxonomic groupings. In addition, our data revealed a fourth class of putative zymogenic chitin synthesis. Images PMID:7944356

  16. The mouse collagen X gene: complete nucleotide sequence, exon structure and expression pattern.

    PubMed Central

    Elima, K; Eerola, I; Rosati, R; Metsäranta, M; Garofalo, S; Perälä, M; De Crombrugghe, B; Vuorio, E

    1993-01-01

    Overlapping genomic clones covering the 7.2 kb mouse alpha 1(X) collagen gene, 0.86 kb of promoter and 1.25 kb of 3'-flanking sequences were isolated from two genomic libraries and characterized by nucleotide sequencing. Typical features of the gene include a unique three-exon structure, similar to that in the chick gene, with the entire triple-helical domain of 463 amino acids coded by a single large exon. The highest degree of amino acid and nucleotide sequence conservation was seen in the coding region for the collagenous and C-terminal non-collagenous domains between the mouse and known chick, bovine and human collagen type X sequences. More divergence between the sequences occurred in the N-terminal non-collagenous domain. Similarity between the mammalian collagen X sequences extended into the 3'-untranslated sequence, particularly near the polyadenylation site. The promoter of the mouse collagen X gene was found to contain two TATAA boxes 159 bp apart; primer extension analyses of the transcription start site revealed that both were functional. The promoter has an unusual structure with a very low G + C content of 28% between positions -220 and -1 of the upstream transcription start site. Northern and in situ hybridization analyses confirmed that the expression of the alpha 1(X) collagen gene is restricted to hypertrophic chondrocytes in tissues undergoing endochondral calcification. The detailed sequence information of the gene is useful for studies on the promoter activity of the gene and for generation of transgenic mice. Images Figure 3 Figure 5 Figure 6 PMID:8424763

  17. SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

    PubMed

    Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge

    2016-01-01

    Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license. PMID:27182962

  18. BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations.

    PubMed

    Bahr, A; Thompson, J D; Thierry, J C; Poch, O

    2001-01-01

    BAliBASE is specifically designed to serve as an evaluation resource to address all the problems encountered when aligning complete sequences. The database contains high quality, manually constructed multiple sequence alignments together with detailed annotations. The alignments are all based on three-dimensional structural superpositions, with the exception of the transmembrane sequences. The first release provided sets of reference alignments dealing with the problems of high variability, unequal repartition and large N/C-terminal extensions and internal insertions. Here we describe version 2.0 of the database, which incorporates three new reference sets of alignments containing structural repeats, trans-membrane sequences and circular permutations to evaluate the accuracy of detection/prediction and alignment of these complex sequences. BAliBASE can be viewed at the web site http://www-igbmc.u-strasbg. fr/BioInfo/BAliBASE2/index.html or can be downloaded from ftp://ftp-igbmc.u-strasbg.fr/pub/BAliBASE2 /.

  19. SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

    PubMed Central

    2016-01-01

    Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license. PMID:27182962

  20. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark.

    PubMed

    Thompson, Julie D; Koehl, Patrice; Ripp, Raymond; Poch, Olivier

    2005-10-01

    Multiple sequence alignment is one of the cornerstones of modern molecular biology. It is used to identify conserved motifs, to determine protein domains, in 2D/3D structure prediction by homology and in evolutionary studies. Recently, high-throughput technologies such as genome sequencing and structural proteomics have lead to an explosion in the amount of sequence and structure information available. In response, several new multiple alignment methods have been developed that improve both the efficiency and the quality of protein alignments. Consequently, the benchmarks used to evaluate and compare these methods must also evolve. We present here the latest release of the most widely used multiple alignment benchmark, BAliBASE, which provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences. Using a novel, semiautomatic update protocol, the number of protein families in the benchmark has been increased and representative test cases are now available that cover most of the protein fold space. The total number of proteins in BAliBASE has also been significantly increased from 1444 to 6255 sequences. In addition, full-length sequences are now provided for all test cases, which represent difficult cases for both global and local alignment programs. Finally, the BAliBASE Web site (http://www-bio3d-igbmc.u-strasbg.fr/balibase) has been completely redesigned to provide a more user-friendly, interactive interface for the visualization of the BAliBASE reference alignments and the associated annotations.

  1. PEG-Labeled Nucleotides and Nanopore Detection for Single Molecule DNA Sequencing by Synthesis

    PubMed Central

    Kumar, Shiv; Tao, Chuanjuan; Chien, Minchen; Hellner, Brittney; Balijepalli, Arvind; Robertson, Joseph W. F.; Li, Zengmin; Russo, James J.; Reiner, Joseph E.; Kasianowicz, John J.; Ju, Jingyue

    2012-01-01

    We describe a novel single molecule nanopore-based sequencing by synthesis (Nano-SBS) strategy that can accurately distinguish four bases by detecting 4 different sized tags released from 5′-phosphate-modified nucleotides. The basic principle is as follows. As each nucleotide is incorporated into the growing DNA strand during the polymerase reaction, its tag is released and enters a nanopore in release order. This produces a unique ionic current blockade signature due to the tag's distinct chemical structure, thereby determining DNA sequence electronically at single molecule level with single base resolution. As proof of principle, we attached four different length PEG-coumarin tags to the terminal phosphate of 2′-deoxyguanosine-5′-tetraphosphate. We demonstrate efficient, accurate incorporation of the nucleotide analogs during the polymerase reaction, and excellent discrimination among the four tags based on nanopore ionic currents. This approach coupled with polymerase attached to the nanopores in an array format should yield a single-molecule electronic Nano-SBS platform. PMID:23002425

  2. Mouse Mammary Tumor Virus-Like Nucleotide Sequences in Canine and Feline Mammary Tumors▿

    PubMed Central

    Hsu, Wei-Li; Lin, Hsing-Yi; Chiou, Shyan-Song; Chang, Chao-Chin; Wang, Szu-Pong; Lin, Kuan-Hsun; Chulakasian, Songkhla; Wong, Min-Liang; Chang, Shih-Chieh

    2010-01-01

    Mouse mammary tumor virus (MMTV) has been speculated to be involved in human breast cancer. Companion animals, dogs, and cats with intimate human contacts may contribute to the transmission of MMTV between mouse and human. The aim of this study was to detect MMTV-like nucleotide sequences in canine and feline mammary tumors by nested PCR. Results showed that the presence of MMTV-like env and LTR sequences in canine malignant mammary tumors was 3.49% (3/86) and 18.60% (16/86), respectively. For feline malignant mammary tumors, the presence of both env and LTR sequences was found to be 22.22% (2/9). Nevertheless, the MMTV-like LTR and env sequences also were detected in normal mammary glands of dogs and cats. In comparisons of the MMTV-like DNA sequences of our findings to those of NIH 3T3 (MMTV-positive murine cell line) and human breast cancer cells, the sequence similarities ranged from 94 to 98%. Phylogenetic analysis revealed that intermixing among sequences identified from tissues of different hosts, i.e., mouse, dog, cat, and human, indicated the MMTV-like DNA existing in these hosts. Moreover, the env transcript was detected in 1 of the 19 MMTV-positive samples by reverse transcription-PCR. Taken together, our study provides evidence for the existence and expression of MMTV-like sequences in neoplastic and normal mammary glands of dogs and cats. PMID:20881168

  3. Cloning, nucleotide sequence, and expression of the Pasteurella haemolytica A1 glycoprotease gene.

    PubMed Central

    Abdullah, K M; Lo, R Y; Mellors, A

    1991-01-01

    Pasteurella haemolytica serotype A1 secretes a glycoprotease which is specific for O-sialoglycoproteins such as glycophorin A. The gene encoding the glycoprotease enzyme has been cloned in the recombinant plasmid pH1, and its nucleotide sequence has been determined. The gene (designated gcp) codes for a protein of 35.2 kDa, and an active enzyme protein of this molecular mass can be observed in Escherichia coli clones carrying pPH1. In vivo labeling of plasmid-encoded proteins in E. coli maxicells demonstrated the expression of a 35-kDa protein from pPH1. The amino-terminal sequence of the heterologously expressed protein corresponds to that predicted from the nucleotide sequence. The glycoprotease is a neutral metalloprotease, and the predicted amino acid sequence of the glycoprotease contains a putative zinc-binding site. The gene shows no significant homology with the genes for other proteases of procaryotic or eucaryotic origin. However, there is substantial homology between gcp and an E. coli gene, orfX, whose product is believed to function in the regulation of macromolecule biosynthesis. Images PMID:1885539

  4. Cloning and nucleotide sequence of the Salmonella typhimurium LT2 metF gene and its homology with the corresponding sequence of Escherichia coli.

    PubMed

    Stauffer, G V; Stauffer, L T

    1988-05-01

    The Salmonella typhimurium LT2 metF gene, encoding 5,10-methylenetetrahydrofolate reductase, has been cloned. Strains with multicopy plasmids carrying the metF gene overproduce the enzyme 44-fold. The nucleotide sequence of the metF gene was determined, and an open reading frame of 888 nucleotides was identified. The polypeptide deduced from the DNA sequence contains 296 amino acids and has a molecular weight of 33,135 daltons. Mung bean nuclease mapping experiments located the transcription start point and possible transcription termination region for the gene. There is a 25 bp nucleotide sequence between the translation termination site and the possible transcription termination region. This region possesses a GC-rich sequence that could form a stable stem and loop structure once transcribed (delta G = -9 kcal/mol), followed by an AT-rich sequence, both of which are characteristic of rho-independent transcription terminators. The nucleotide and deduced amino acid sequences of the S. typhimurium metF gene are compared with the corresponding sequences of the Escherichia coli metF gene. The nucleotide sequences show 85% homology. Most of the nucleotide differences found do not alter the amino acid sequences, which show 95% homology. The results also show that a change has occurred in the metF region of the S. typhimurium chromosome as compared to the E. coli chromosome.

  5. SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes

    PubMed Central

    Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver

    2012-01-01

    Motivation: In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. Results: In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Availability: Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license. Contact: epruesse@mpi-bremen.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22556368

  6. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments

    DOE PAGES

    Daily, Jeffrey A.

    2016-02-10

    Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less

  7. Nucleotide sequence and revised map location of the arn gene from bacteriophage T4.

    PubMed

    Kim, B C; Kim, K; Park, E H; Lim, C J

    1997-10-31

    Non-glucosylated (Glu-) T-even phage DNAs are restricted by Escherichia coli RgIA and RgIB endonucleases with different specificities. RgIB endonuclease activity is strongly inhibited by anti-restriction endonuclease (Arn) encoded by the bacteriophage T4 genome. The nucleotide sequence of the arn gene encoding Arn was determined. The product of the cloned arn gene was overexpressed by the T7 RNA polymerase/promoter system, and its molecular size is consistent with that predicted from the open reading frame of the arn gene. The arn gene is located between the asiA gene and motA gene in the region of 161,300-161,578 nucleotides.

  8. Nucleotide sequence of a cloned duck hepatitis B virus genome: comparison with woodchuck and human hepatitis B virus sequences.

    PubMed Central

    Mandart, E; Kay, A; Galibert, F

    1984-01-01

    The nucleotide sequence of an EcoRI duck hepatitis B virus (DHBV) clone was elucidated by using the Maxam and Gilbert method. This sequence, which is 3,021 nucleotides long, was compared with the two previously analyzed hepatitis B-like viruses (human and woodchuck). From this comparison, it was shown that DHBV is derived from an ancestor common to the two others but has a slightly different genomic organization. There was no intergenic region between genes 5 and 8, which were fused into a single open reading frame in DHBV. Genes for the surface and core proteins were assigned to open reading frames 7 and 5/8. Amino acid comparisons showed some structural relationship between gene 6 product and avian reverse transcriptase, suggesting either evolution from a common ancestor or convergence to some particular structure to fulfill a specific function. This should be correlated with the synthesis of an RNA intermediate during DNA replication. This is also taken as an argument in favor of the hypothesis that gene 6 codes for the DNA polymerase that is found within the virion. DNA sequence comparison also showed that the two mammalian hepatitis B viruses are more homologous to each other than they are to DHBV, indicating that DHBV starts to evolve on its own earlier than the two other viruses, as do birds compared with mammals. From this it is proposed that the viruses evolved in a fashion parallel to the species they infect. PMID:6699938

  9. Support for linguistic macrofamilies from weighted sequence alignment.

    PubMed

    Jäger, Gerhard

    2015-10-13

    Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily.

  10. Support for linguistic macrofamilies from weighted sequence alignment

    PubMed Central

    Jäger, Gerhard

    2015-01-01

    Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily. PMID:26403857

  11. SARA-Coffee web server, a tool for the computation of RNA sequence and structure multiple alignments

    PubMed Central

    Di Tommaso, Paolo; Bussotti, Giovanni; Kemena, Carsten; Capriotti, Emidio; Chatzou, Maria; Prieto, Pablo; Notredame, Cedric

    2014-01-01

    This article introduces the SARA-Coffee web server; a service allowing the online computation of 3D structure based multiple RNA sequence alignments. The server makes it possible to combine sequences with and without known 3D structures. Given a set of sequences SARA-Coffee outputs a multiple sequence alignment along with a reliability index for every sequence, column and aligned residue. SARA-Coffee combines SARA, a pairwise structural RNA aligner with the R-Coffee multiple RNA aligner in a way that has been shown to improve alignment accuracy over most sequence aligners when enough structural data is available. The server can be accessed from http://tcoffee.crg.cat/apps/tcoffee/do:saracoffee. PMID:24972831

  12. SARA-Coffee web server, a tool for the computation of RNA sequence and structure multiple alignments.

    PubMed

    Di Tommaso, Paolo; Bussotti, Giovanni; Kemena, Carsten; Capriotti, Emidio; Chatzou, Maria; Prieto, Pablo; Notredame, Cedric

    2014-07-01

    This article introduces the SARA-Coffee web server; a service allowing the online computation of 3D structure based multiple RNA sequence alignments. The server makes it possible to combine sequences with and without known 3D structures. Given a set of sequences SARA-Coffee outputs a multiple sequence alignment along with a reliability index for every sequence, column and aligned residue. SARA-Coffee combines SARA, a pairwise structural RNA aligner with the R-Coffee multiple RNA aligner in a way that has been shown to improve alignment accuracy over most sequence aligners when enough structural data is available. The server can be accessed from http://tcoffee.crg.cat/apps/tcoffee/do:saracoffee.

  13. Single nucleotide polymorphisms from Theobroma cacao expressed sequence tags associated with witches' broom disease in cacao.

    PubMed

    Lima, L S; Gramacho, K P; Carels, N; Novais, R; Gaiotto, F A; Lopes, U V; Gesteira, A S; Zaidan, H A; Cascardo, J C M; Pires, J L; Micheli, F

    2009-07-14

    In order to increase the efficiency of cacao tree resistance to witches' broom disease, which is caused by Moniliophthora perniciosa (Tricholomataceae), we looked for molecular markers that could help in the selection of resistant cacao genotypes. Among the different markers useful for developing marker-assisted selection, single nucleotide polymorphisms (SNPs) constitute the most common type of sequence difference between alleles and can be easily detected by in silico analysis from expressed sequence tag libraries. We report the first detection and analysis of SNPs from cacao-M. perniciosa interaction expressed sequence tags, using bioinformatics. Selection based on analysis of these SNPs should be useful for developing cacao varieties resistant to this devastating disease.

  14. Conservation of nucleotide sequences for molecular diagnosis of Middle East respiratory syndrome coronavirus, 2015.

    PubMed

    Furuse, Yuki; Okamoto, Michiko; Oshitani, Hitoshi

    2015-11-01

    Infection due to the Middle East respiratory syndrome coronavirus (MERS-CoV) is widespread. The present study was performed to assess the protocols used for the molecular diagnosis of MERS-CoV by analyzing the nucleotide sequences of viruses detected between 2012 and 2015, including sequences from the large outbreak in eastern Asia in 2015. Although the diagnostic protocols were established only 2 years ago, mismatches between the sequences of primers/probes and viruses were found for several of the assays. Such mismatches could lead to a lower sensitivity of the assay, thereby leading to false-negative diagnosis. A slight modification in the primer design is suggested. Protocols for the molecular diagnosis of viral infections should be reviewed regularly after they are established, particularly for viruses that pose a great threat to public health such as MERS-CoV.

  15. Nucleotide-sequence of a canine oral papillomavirus containing a long noncoding region.

    PubMed

    Isegawa, N; Ohta, M; Shirasawa, H; Tokita, H; Yamaura, A; Simizu, B

    1995-07-01

    The DNA genome of a canine oral papillomavirus (COPV) was completely sequenced and found to consist of 8607 base pairs, which were the longest of all known papillomaviruses (PVs). Its organization was similar to that of other PVs except that it lacked early gene 5 (E5) and possessed a unique long noncoding region (L-NCR) between the end of the early genes and the beginning of the late genes. COPV also possessed a short noncoding region (S-NCR) which contained a putative upper regulatory region (URR), which is commonly found in PVs. The L-NCR did not show any similarity to known PV DNAs nor other DNA sequences in the GenBank database. Nucleotide sequence analysis of COPV showed that it was closely related to human papillomavirus type 1 (HPV 1) and animal PVs associated with cutaneous lesions in rabbit, European elk, deer and cow as we reported previously. PMID:21552821

  16. Comparisons of the Distribution of Nucleotides and Common Sequences in Deoxyribonucleic Acid from Selected Bacteriophages

    PubMed Central

    Skalka, A.; Hanson, P.

    1972-01-01

    Results from comparisons of deoxyribonucleic acid (DNA) from several classes of bacteriophages suggest that most phage chromosomes contain either a homogeneous distribution of nucleotides or are made up of a few, rather large segments of different quanine plus cytosine (G + C) contents which are internally homogeneous. Among those temperate phages tested, most contained segmented DNA. Comparisons of sequence similarities among segments from lambdoid phage DNA species revealed the following order in relatedness to λ: 82 (and 434) > 21 > 424 > φ80. Most common sequences are found in the highest G + C segments, which in λ contain head and tail genes. Hybridization tests with λ and 186 or P2 DNA species verified that the lambdoids and 186 and P2 belong to two distinct groups. There are fewer homologous sequences between the DNA species of coliphages λ and P2 or 186 than there are between the DNA species of coliphage λ and salmonella phage P22. PMID:4553679

  17. Nucleotide sequence of a satellite RNA associated with carrot motley dwarf in parsley and carrot.

    PubMed

    Menzel, Wulf; Maiss, Edgar; Vetten, H Josef

    2009-02-01

    Carrot motley dwarf (CMD) is known to result from a mixed infection by two viruses, the polerovirus Carrot red leaf virus and one of the umbraviruses Carrot mottle mimic virus or Carrot mottle virus. Some umbraviruses have been shown to be associated with small satellite (sat) RNAs, but none have been reported for the latter two. A CMD-affected parsley plant was used for sap transmission to test plants, that were used for dsRNA isolation. The presence of a 0.8-kbp dsRNA indicated the occurrence of a hitherto unrecognized satRNA associated with CMD. The satRNAs of the CMD isolate from parsley and an isolate from carrot have been sequenced and showed 94% sequence identity. Nucleotide sequences and putative translation products had no significant similarities to GenBank entries. To our knowledge, this is the first report of satRNAs associated with CMD.

  18. Skeleton-based human action recognition using multiple sequence alignment

    NASA Astrophysics Data System (ADS)

    Ding, Wenwen; Liu, Kai; Cheng, Fei; Zhang, Jin; Li, YunSong

    2015-05-01

    Human action recognition and analysis is an active research topic in computer vision for many years. This paper presents a method to represent human actions based on trajectories consisting of 3D joint positions. This method first decompose action into a sequence of meaningful atomic actions (actionlets), and then label actionlets with English alphabets according to the Davies-Bouldin index value. Therefore, an action can be represented using a sequence of actionlet symbols, which will preserve the temporal order of occurrence of each of the actionlets. Finally, we employ sequence comparison to classify multiple actions through using string matching algorithms (Needleman-Wunsch). The effectiveness of the proposed method is evaluated on datasets captured by commodity depth cameras. Experiments of the proposed method on three challenging 3D action datasets show promising results.

  19. Nucleotide sequence analysis of the L gene of Newcastle disease virus: homologies with Sendai and vesicular stomatitis viruses.

    PubMed Central

    Yusoff, K; Millar, N S; Chambers, P; Emmerson, P T

    1987-01-01

    The nucleotide sequence of the L gene of the Beaudette C strain of Newcastle disease virus (NDV) has been determined. The L gene is 6704 nucleotides long and encodes a protein of 2204 amino acids with a calculated molecular weight of 248822. Mung bean nuclease mapping of the 5' terminus of the L gene mRNA indicates that the transcription of the L gene is initiated 11 nucleotides upstream of the translational start site. Comparison with the amino acid sequences of the L genes of Sendai virus and vesicular stomatitis virus (VSV) suggests that there are several regions of homology between the sequences. These data provide further evidence for an evolutionary relationship between the Paramyxoviridae and the Rhabdoviridae. A non-coding sequence of 46 nucleotides downstream of the presumed polyadenylation site of the L gene may be part of a negative strand leader RNA. Images PMID:3035486

  20. Developing Single Nucleotide Polymorphism (SNP) markers from transcriptome sequences for the identification of longan (Dimocarpus longan) germplasm

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Longan (Dimocarpus longan Lour.) is an important tropical fruit tree crop. Accurate varietal identification is essential for germplasm management and breeding. Using longan transcriptome sequences from public databases, we developed single nucleotide polymorphism (SNP) markers; validated 60 SNPs in...

  1. Nucleotide sequences of the 3' terminal region of onion yellow dwarf virus isolates from Allium plants in Japan.

    PubMed

    Tsuneyoshi, T; Ikeda, Y; Sumi, S

    1997-01-01

    The 2032 nucleotide sequence of the 3' terminal region of onion yellow dwarf virus (OYDV) isolated from Allium wakegi, bearing the genes for viral coat protein (CP) and a truncated RNA-dependent RNA polymerase, has been determined. Respective homologies of the nucleotide sequence in the corresponding region and the deduced amino acid sequence of CP with the equivalents of leek yellow stripe virus (LYSV) from garlic were 68.0 and 59.3%. Variation in the nucleotide sequence is concentrated in the boundary region between the putative RNA-dependent RNA polymerase gene and the CP gene as well as in the 3' noncoding region. These sequence divergencies, including the deletion of 79 nucleotides, resulted both in alterations to the amino acid sequence and the absence of 28 amino acid residues in the amino terminal region of OYDV CP in comparison with LYSV CP. In addition, the length of the 3' noncoding sequence of OYDV was one-third that of LYSV. Comparison of the 3' terminal 1197 nucleotides sequence of OYDV with sequences of the respective cDNAs cloned by RT-PCR directly from the total RNA of infected Allium plants that included two varieties of A. fistulosum, "Wakenegi" and "Shimonita-negi", and A. chinense, showed 90.7% overall identities, even though they have long been cultivated in locally restricted area in Japan. These findings appear to suggest that a single strain of OYDV invaded Japanese Allium plants long ago and spread throughout them. PMID:9354273

  2. PyMod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within PyMOL

    PubMed Central

    2012-01-01

    Background In recent years, an exponential growing number of tools for protein sequence analysis, editing and modeling tasks have been put at the disposal of the scientific community. Despite the vast majority of these tools have been released as open source software, their deep learning curves often discourages even the most experienced users. Results A simple and intuitive interface, PyMod, between the popular molecular graphics system PyMOL and several other tools (i.e., [PSI-]BLAST, ClustalW, MUSCLE, CEalign and MODELLER) has been developed, to show how the integration of the individual steps required for homology modeling and sequence/structure analysis within the PyMOL framework can hugely simplify these tasks. Sequence similarity searches, multiple sequence and structural alignments generation and editing, and even the possibility to merge sequence and structure alignments have been implemented in PyMod, with the aim of creating a simple, yet powerful tool for sequence and structure analysis and building of homology models. Conclusions PyMod represents a new tool for the analysis and the manipulation of protein sequences and structures. The ease of use, integration with many sequence retrieving and alignment tools and PyMOL, one of the most used molecular visualization system, are the key features of this tool. Source code, installation instructions, video tutorials and a user's guide are freely available at the URL http://schubert.bio.uniroma1.it/pymod/index.html PMID:22536966

  3. Nucleotide sequences derived from pheasant DNA in the genome of recombinant avian leukosis viruses with subgroup F specificity.

    PubMed

    Keshet, E; Temin, H M

    1977-11-01

    Recombination between viral and cellular genes can give rise to new strains of retroviruses. For example, Rous-associated virus 61 (RAV-61) is a recombinant between the Bryan high-titer strain of Rous sarcoma virus (RSV) and normal pheasant DNA. Nucleic acid hybridization techniques were used to study the genome of RAV-61 and another RAV with subgroup F specificity (RAV-F) obtained by passage of RSV-RAV-0 in cells from a ring-necked pheasant embryo. The nucleotide sequences acquired by these two independent isolates of RAV-F that were not shared with the parental virus comprised 20 to 25% of the RAV-F genomes and were indistinguishable by nucleic acid hybridization. (In addition, RAV-F genomes had another set of nucleotide sequences that were homologous to some pheasant nucleotide sequences and also were present in the parental viruses.) A specific complementary DNA, containing only nucleotide sequences complementary to those acquired by RAV-61 through recombination, was prepared. These nucleotide sequences were pheasant derived and were not present in the genomes of reticuloendotheliosis viruses, pheasant viruses, and avian leukosis-sarcoma viruses of subgroups A, B, C, D, and E. They were partially endogenous, however, to avian DNA other than pheasant. The fraction of these nucleotide sequences present in other avian DNAs generally paralleled the genetic relatedness of these avian species to pheasants. However, there was a high degree of homology between these pheasant nucleotide sequences and related nucleotide sequences in the DNA of normal chickens as indicated by the identical melting profiles of the respective hybrids.

  4. Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.

    PubMed

    Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias

    2011-01-01

    The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time. PMID:22254462

  5. Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.

    PubMed

    Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias

    2011-01-01

    The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.

  6. Nucleotide sequence of the Shiga-like toxin genes of Escherichia coli.

    PubMed Central

    Calderwood, S B; Auclair, F; Donohue-Rolfe, A; Keusch, G T; Mekalanos, J J

    1987-01-01

    We have determined the nucleotide sequence of the sltA and sltB genes that encode the Shiga-like toxin (SLT) produced by Escherichia coli phage H19B. The amino acid composition of the A and B subunits of SLT is very similar to that previously established for Shiga toxin from Shigella dysenteriae 1, and the deduced amino acid sequence of the B subunit of SLT is identical with that reported for the B subunit of Shiga toxin. The genes for the A and B subunits of SLT apparently constitute an operon, with only 12 nucleotides separating the coding regions. There is a 21-base-pair region of dyad symmetry overlapping the proposed promoter of the slt operon that may be involved in regulation of SLT production by iron. The peptide sequence of the A subunit of SLT is homologous to the A subunit of the plant toxin ricin, providing evidence for the hypothesis that certain prokaryotic toxins may be evolutionarily related to eukaryotic enzymes. Images PMID:3299365

  7. Genomic DNA enrichment using sequence capture microarrays: a novel approach to discover sequence nucleotide polymorphisms (SNP) in Brassica napus L.

    PubMed

    Clarke, Wayne E; Parkin, Isobel A; Gajardo, Humberto A; Gerhardt, Daniel J; Higgins, Erin; Sidebottom, Christine; Sharpe, Andrew G; Snowdon, Rod J; Federico, Maria L; Iniguez-Luy, Federico L

    2013-01-01

    Targeted genomic selection methodologies, or sequence capture, allow for DNA enrichment and large-scale resequencing and characterization of natural genetic variation in species with complex genomes, such as rapeseed canola (Brassica napus L., AACC, 2n=38). The main goal of this project was to combine sequence capture with next generation sequencing (NGS) to discover single nucleotide polymorphisms (SNPs) in specific areas of the B. napus genome historically associated (via quantitative trait loci -QTL- analysis) to traits of agronomical and nutritional importance. A 2.1 million feature sequence capture platform was designed to interrogate DNA sequence variation across 47 specific genomic regions, representing 51.2 Mb of the Brassica A and C genomes, in ten diverse rapeseed genotypes. All ten genotypes were sequenced using the 454 Life Sciences chemistry and to assess the effect of increased sequence depth, two genotypes were also sequenced using Illumina HiSeq chemistry. As a result, 589,367 potentially useful SNPs were identified. Analysis of sequence coverage indicated a four-fold increased representation of target regions, with 57% of the filtered SNPs falling within these regions. Sixty percent of discovered SNPs corresponded to transitions while 40% were transversions. Interestingly, fifty eight percent of the SNPs were found in genic regions while 42% were found in intergenic regions. Further, a high percentage of genic SNPs was found in exons (65% and 64% for the A and C genomes, respectively). Two different genotyping assays were used to validate the discovered SNPs. Validation rates ranged from 61.5% to 84% of tested SNPs, underpinning the effectiveness of this SNP discovery approach. Most importantly, the discovered SNPs were associated with agronomically important regions of the B. napus genome generating a novel data resource for research and breeding this crop species.

  8. Nucleotide Sequences and Modifications That Determine RIG-I/RNA Binding and Signaling Activities ▿

    PubMed Central

    Uzri, Dina; Gehrke, Lee

    2009-01-01

    Cytoplasmic viral RNAs with 5′ triphosphates (5′ppp) are detected by the RNA helicase RIG-I, initiating downstream signaling and alpha/beta interferon (IFN-α/β) expression that establish an antiviral state. We demonstrate here that the hepatitis C virus (HCV) 3′ untranslated region (UTR) RNA has greater activity as an immune stimulator than several flavivirus UTR RNAs. We confirmed that the HCV 3′-UTR poly(U/UC) region is the determinant for robust activation of RIG-I-mediated innate immune signaling and that its antisense sequence, poly(AG/A), is an equivalent RIG-I activator. The poly(U/UC) region of the fulminant HCV JFH-1 strain was a relatively weak activator, while the antisense JFH-1 strain poly(AG/A) RNA was very potent. Poly(U/UC) activity does not require primary nucleotide sequence adjacency to the 5′ppp, suggesting that RIG-I recognizes two independent RNA domains. Whereas poly(U) 50-nt or poly(A) 50-nt sequences were minimally active, inserting a single C or G nucleotide, respectively, into these RNAs increased IFN-β expression. Poly(U/UC) RNAs transcribed in vitro using modified uridine 2′ fluoro or pseudouridine ribonucleotides lacked signaling activity while functioning as competitive inhibitors of RIG-I binding and IFN-β expression. Nucleotide base and ribose modifications that convert activator RNAs into competitive inhibitors of RIG-I signaling may be useful as modulators of RIG-I-mediated innate immune responses and as tools to dissect the RNA binding and conformational events associated with signaling. PMID:19224987

  9. The nucleotide sequences of some large ribonuclease T1 products from bacteriophage R17 ribonucleic acid

    PubMed Central

    Jeppesen, Peter G. N.

    1971-01-01

    A method of `fingerprinting' high-molecular-weight 32P-labelled RNA species, using a two-dimensional thin-layer-chromatographic separation of ribonuclease T1 digestion products, has been applied to RNA from the Escherichia coli bacteriophage R17. The `fingerprinting' technique, besides giving a unique pattern that can be used as a characterization of the RNA, has made it possible to isolate a number of the larger oligonucleotides and to determine their nucleotide sequences. ImagesPLATE 1 PMID:5158505

  10. The Complete Nucleotide Sequence of the Mitochondrial Genome of Bactrocera minax (Diptera: Tephritidae)

    PubMed Central

    Zhang, Bin; Nardi, Francesco; Hull-Sanders, Helen; Wan, Xuanwu; Liu, Yinghong

    2014-01-01

    The complete 16,043 bp mitochondrial genome (mitogenome) of Bactrocera minax (Diptera: Tephritidae) has been sequenced. The genome encodes 37 genes usually found in insect mitogenomes. The mitogenome information for B. minax was compared to the homologous sequences of Bactrocera oleae, Bactrocera tryoni, Bactrocera philippinensis, Bactrocera carambolae, Bactrocera papayae, Bactrocera dorsalis, Bactrocera correcta, Bactrocera cucurbitae and Ceratitis capitata. The analysis indicated the structure and organization are typical of, and similar to, the nine closely related species mentioned above, although it contains the lowest genome-wide A+T content (67.3%). Four short intergenic spacers with a high degree of conservation among the nine tephritid species mentioned above and B. minax were observed, which also have clear counterparts in the control regions (CRs). Correlation analysis among these ten tephritid species revealed close positive correlation between the A+T content of zero-fold degenerate sites (P0FD), the ratio of nucleotide substitution frequency at P0FD sites to all degenerate sites (zero-fold degenerate sites, two-fold degenerate sites and four-fold degenerate sites) and amino acid sequence distance (ASD) were found. Further, significant positive correlation was observed between the A+T content of four-fold degenerate sites (P4FD) and the ratio of nucleotide substitution frequency at P4FD sites to all degenerate sites; however, we found significant negative correlation between ASD and the A+T content of P4FD, and the ratio of nucleotide substitution frequency at P4FD sites to all degenerate sites. A higher nucleotide substitution frequency at non-synonymous sites compared to synonymous sites was observed in nad4, the first time that has been observed in an insect mitogenome. A poly(T) stretch at the 5′ end of the CR followed by a [TA(A)]n-like stretch was also found. In addition, a highly conserved G+A-rich sequence block was observed in front of the

  11. The nucleotide sequence of glutamate tRNA4 of Drosophila melanogaster.

    PubMed Central

    Altwegg, M; Kubli, E

    1980-01-01

    The nucleotide sequence of Drosophila melanogaster glutamate tRNA4 was determined to be: pU-C-C-C-A-U-A-U-G-G-U-C-psi-A-G-D-G-G-C-D-A-G-G-A-U-A-U-C-U-G-G-C (m) -U-U-U-C-A-C-C-A-G-A-A-G-G-C-C-C-G-G-G-T-psi-U-C-G-A-U-U-C-C-C-G-G-U-A-U-G-G-G-A-A-C-C-AOH. A partial modified C is found at position 32 in the anticodon loop. Images PMID:6775307

  12. Nucleotide sequence and organization of copper resistance genes from Pseudomonas syringae pv. tomato

    SciTech Connect

    Mellano, M.A.; Cooksey, D.A.

    1988-06-01

    The nucleotide sequence of a 4.5-kilobase copper resistance determinant from Pseudomonas syringae pv. tomato revealed four open reading frames (ORFs) in the same orientation. Deletion and site-specific mutational analyses indicated that the first two ORFs were essential for copper resistance; the last two ORFs were required for full resistance, but low-level resistance could be conferred in their absence. Five highly conserved, direct 24-base repeats were found near the beginning of the second ORF, and a similar, but less conserved, repeated region was found in the middle of the first ORF.

  13. CATO: The Clone Alignment Tool.

    PubMed

    Henstock, Peter V; LaPan, Peter

    2016-01-01

    High-throughput cloning efforts produce large numbers of sequences that need to be aligned, edited, compared with reference sequences, and organized as files and selected clones. Different pieces of software are typically required to perform each of these tasks. We have designed a single piece of software, CATO, the Clone Alignment Tool, that allows a user to align, evaluate, edit, and select clone sequences based on comparisons to reference sequences. The input and output are designed to be compatible with standard data formats, and thus suitable for integration into a clone processing pipeline. CATO provides both sequence alignment and visualizations to facilitate the analysis of cloning experiments. The alignment algorithm matches each of the relevant candidate sequences against each reference sequence. The visualization portion displays three levels of matching: 1) a top-level summary of the top candidate sequences aligned to each reference sequence, 2) a focused alignment view with the nucleotides of matched sequences displayed against one reference sequence, and 3) a pair-wise alignment of a single reference and candidate sequence pair. Users can select the minimum matching criteria for valid clones, edit or swap reference sequences, and export the results to a summary file as part of the high-throughput cloning workflow.

  14. CATO: The Clone Alignment Tool.

    PubMed

    Henstock, Peter V; LaPan, Peter

    2016-01-01

    High-throughput cloning efforts produce large numbers of sequences that need to be aligned, edited, compared with reference sequences, and organized as files and selected clones. Different pieces of software are typically required to perform each of these tasks. We have designed a single piece of software, CATO, the Clone Alignment Tool, that allows a user to align, evaluate, edit, and select clone sequences based on comparisons to reference sequences. The input and output are designed to be compatible with standard data formats, and thus suitable for integration into a clone processing pipeline. CATO provides both sequence alignment and visualizations to facilitate the analysis of cloning experiments. The alignment algorithm matches each of the relevant candidate sequences against each reference sequence. The visualization portion displays three levels of matching: 1) a top-level summary of the top candidate sequences aligned to each reference sequence, 2) a focused alignment view with the nucleotides of matched sequences displayed against one reference sequence, and 3) a pair-wise alignment of a single reference and candidate sequence pair. Users can select the minimum matching criteria for valid clones, edit or swap reference sequences, and export the results to a summary file as part of the high-throughput cloning workflow. PMID:27459605

  15. CATO: The Clone Alignment Tool

    PubMed Central

    Henstock, Peter V.; LaPan, Peter

    2016-01-01

    High-throughput cloning efforts produce large numbers of sequences that need to be aligned, edited, compared with reference sequences, and organized as files and selected clones. Different pieces of software are typically required to perform each of these tasks. We have designed a single piece of software, CATO, the Clone Alignment Tool, that allows a user to align, evaluate, edit, and select clone sequences based on comparisons to reference sequences. The input and output are designed to be compatible with standard data formats, and thus suitable for integration into a clone processing pipeline. CATO provides both sequence alignment and visualizations to facilitate the analysis of cloning experiments. The alignment algorithm matches each of the relevant candidate sequences against each reference sequence. The visualization portion displays three levels of matching: 1) a top-level summary of the top candidate sequences aligned to each reference sequence, 2) a focused alignment view with the nucleotides of matched sequences displayed against one reference sequence, and 3) a pair-wise alignment of a single reference and candidate sequence pair. Users can select the minimum matching criteria for valid clones, edit or swap reference sequences, and export the results to a summary file as part of the high-throughput cloning workflow. PMID:27459605

  16. Nucleotide deletion and P addition in V(D)J recombination: a determinant role of the coding-end sequence.

    PubMed Central

    Nadel, B; Feeney, A J

    1997-01-01

    During V(D)J recombination, the coding ends to be joined are extensively modified. Those modifications, termed coding-end processing, consist of removal and addition of various numbers of nucleotides. We previously showed in vivo that coding-end processing is specific for each coding end, suggesting that specific motifs in a coding-end sequence influence nucleotide deletion and P-region formation. In this study, we created a panel of recombination substrates containing actual immunoglobulin and T-cell receptor coding-end sequences and dissected the role of each motif by comparing its processing pattern with those of variants containing minimal nucleotide changes from the original sequence. Our results demonstrate the determinant role of specific sequence motifs on coding-end processing and also the importance of the context in which they are found. We show that minimal nucleotide changes in key positions of a coding-end sequence can result in dramatic changes in the processing pattern. We propose that each coding-end sequence dictates a unique hairpin structure, the result of a particular energy conformation between nucleotides organizing the loop and the stem, and that the interplay between this structure and specific sequence motifs influences the frequency and location of nicks which open the coding-end hairpin. These findings indicate that the sequences of the coding ends determine their own processing and have a profound impact on the development of the primary B- and T-cell repertoires. PMID:9199310

  17. Nucleotide sequence of the gene for the b subunit of human factor XIII

    SciTech Connect

    Bottenus, R.E.; Ichinose, A.; Davie, E.W. )

    1990-12-01

    Factor XIII (M{sub r} 320 000) is a blood coagulation factor that stabilizes and strengthens the fibrin clot. It circulates in blood as a tetramer composed of two a subunits (M{sub r} 75 000 each) and two b subunits (M{sub r} 80 000 each). The b subunit consists of 641 amino acids and includes 10 tandem repeats of 60 amino acids known as GP-I structures, short consensus repeats (SCR), or sushi domains. In the present study, the human gene for the b subunit has been isolated from three different genomic libraries prepared in {lambda} phage. Fifteen independent phage with inserts coding for the entire gene were isolated and characterized by restriction mapping, Southern blotting, and DNA sequencing. The gene was found to be 28 kilobases in length and consisted of 12 exons (I-XII) separated by 11 intervening sequences. The leader sequence was encoded by exon I, while the carbonyl-terminal region of the protein was encoded by exon XII. Exons II-XI each coded for a single sushi domain, suggesting that the gene evolved through exon shuffling and duplication. The 12 exons in the gene ranged in size from 64 to 222 base pairs, while the introns ranged in size from 87 to 9970 nucleotides and made up 92{percent} of the gene. One nucleotide change was found in the coding region of the gene when its sequence was compared to that of the cDNA. This difference, however, did not result in a change in the amino acid sequence of the protein.

  18. Complete Nucleotide Sequence of a South African Isolate of Grapevine Fanleaf Virus and Its Associated Satellite RNA

    PubMed Central

    Lamprecht, Renate L.; Spaltman, Monique; Stephan, Dirk; Wetzel, Thierry; Burger, Johan T.

    2013-01-01

    The complete sequences of RNA1, RNA2 and satellite RNA have been determined for a South African isolate of Grapevine fanleaf virus (GFLV-SACH44). The two RNAs of GFLV-SACH44 are 7,341 nucleotides (nt) and 3,816 nt in length, respectively, and its satellite RNA (satRNA) is 1,104 nt in length, all excluding the poly(A) tail. Multiple sequence alignment of these sequences showed that GFLV-SACH44 RNA1 and RNA2 were the closest to the South African isolate, GFLV-SAPCS3 (98.2% and 98.6% nt identity, respectively), followed by the French isolate, GFLV-F13 (87.3% and 90.1% nt identity, respectively). Interestingly, the GFLV-SACH44 satRNA is more similar to three Arabis mosaic virus satRNAs (85%–87.4% nt identity) than to the satRNA of GFLV-F13 (81.8% nt identity) and was most distantly related to the satRNA of GFLV-R2 (71.0% nt identity). Full-length infectious clones of GFLV-SACH44 satRNA were constructed. The infectivity of the clones was tested with three nepovirus isolates, GFLV-NW, Arabis mosaic virus (ArMV)-NW and GFLV-SAPCS3. The clones were mechanically inoculated in Chenopodium quinoa and were infectious when co-inoculated with the two GFLV helper viruses, but not when co-inoculated with ArMV-NW. PMID:23867805

  19. Simultaneous Detection of Both Single Nucleotide Variations and Copy Number Alterations by Next-Generation Sequencing in Gorlin Syndrome.

    PubMed

    Morita, Kei-ichi; Naruto, Takuya; Tanimoto, Kousuke; Yasukawa, Chisato; Oikawa, Yu; Masuda, Kiyoshi; Imoto, Issei; Inazawa, Johji; Omura, Ken; Harada, Hiroyuki

    2015-01-01

    Gorlin syndrome (GS) is an autosomal dominant disorder that predisposes affected individuals to developmental defects and tumorigenesis, and caused mainly by heterozygous germline PTCH1 mutations. Despite exhaustive analysis, PTCH1 mutations are often unidentifiable in some patients; the failure to detect mutations is presumably because of mutations occurred in other causative genes or outside of analyzed regions of PTCH1, or copy number alterations (CNAs). In this study, we subjected a cohort of GS-affected individuals from six unrelated families to next-generation sequencing (NGS) analysis for the combined screening of causative alterations in Hedgehog signaling pathway-related genes. Specific single nucleotide variations (SNVs) of PTCH1 causing inferred amino acid changes were identified in four families (seven affected individuals), whereas CNAs within or around PTCH1 were found in two families in whom possible causative SNVs were not detected. Through a targeted resequencing of all coding exons, as well as simultaneous evaluation of copy number status using the alignment map files obtained via NGS, we found that GS phenotypes could be explained by PTCH1 mutations or deletions in all affected patients. Because it is advisable to evaluate CNAs of candidate causative genes in point mutation-negative cases, NGS methodology appears to be useful for improving molecular diagnosis through the simultaneous detection of both SNVs and CNAs in the targeted genes/regions. PMID:26544948

  20. Molecular cloning of the Clostridium botulinum structural gene encoding the type B neurotoxin and determination of its entire nucleotide sequence.

    PubMed Central

    Whelan, S M; Elmore, M J; Bodsworth, N J; Brehm, J K; Atkinson, T; Minton, N P

    1992-01-01

    DNA fragments derived from the Clostridium botulinum type A neurotoxin (BoNT/A) gene (botA) were used in DNA-DNA hybridization reactions to derive a restriction map of the region of the C. botulinum type B strain Danish chromosome encoding botB. As the one probe encoded part of the BoNT/A heavy (H) chain and the other encoded part of the light (L) chain, the position and orientation of botB relative to this map were established. The temperature at which hybridization occurred indicated that a higher degree of DNA homology occurred between the two genes in the H-chain-encoding region. By using the derived restriction map data, a 2.1-kb BglII-XbaI fragment encoding the entire BoNT/B L chain and 108 amino acids of the H chain was cloned and characterized by nucleotide sequencing. A contiguous 1.8-kb XbaI fragment encoding a further 623 amino acids of the H chain was also cloned. The 3' end of the gene was obtained by cloning a 1.6-kb fragment amplified from genomic DNA by inverse polymerase chain reaction. Translation of the nucleotide sequence derived from all three clones demonstrated that BoNT/B was composed of 1,291 amino acids. Comparative alignment of its sequence with all currently characterized BoNTs (A, C, D, and E) and tetanus toxin (TeTx) showed that a wide variation in percent homology occurred dependent on which component of the dichain was compared. Thus, the L chain of BoNT/B exhibits the greatest degree of homology (50% identity) with the TeTx L chain, whereas its H chain is most homologous (48% identity) with the BoNT/A H chain. Overall, the six neurotoxins were shown to be composed of highly conserved amino acid domains interceded with amino acid tracts exhibiting little overall similarity. In total, 68 amino acids of an average of 442 are absolutely conserved between L chains and 110 of 845 amino acids are conserved between H chains. Conservation of Trp residues (one in the L chain and nine in the H chain) was particularly striking. The most

  1. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment

    SciTech Connect

    Lawrence, C.E.; Altschul, S.F.; Boguski, M.S.; Neuwald, A.F.; Wootton, J.C. ); Liu, J.S. )

    1993-10-08

    A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and understanding the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflect similar molecular structures and biological properties. A mathematical definition of this [open quotes]local multiple alignment[close quotes] problem suitable for full computer automation has been used to develop a new and sensitive algorithm, based on the statistical method of iterative sampling. This algorithm finds an optimized local alignment model for N sequences in N-linear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helixturn-helix proteins, lipocalins, and prenyltransferases.

  2. Infectious hepatitis B virus from cloned DNA of known nucleotide sequence.

    PubMed Central

    Will, H; Cattaneo, R; Darai, G; Deinhardt, F; Schellekens, H; Schaller, H

    1985-01-01

    The infectivity of cloned hepatitis B viral DNA (HBV) has been tested in chimpanzees to identify a fully functional HBV genome and to assess the risk associated with its handling. Only one of two HBV DNA sequence variants tested was shown to be infectious. "Clone purified" virus of predicted nucleotide sequence was produced from the infectious HBV DNA, and the cloned viral genome was identical in structure with naturally occurring HBV. Infection could be initiated independent of whether circular monomeric or plasmid integrated dimeric forms of the viral genome were inoculated, but the infectivity of the DNA depended on liver cell transfection or intrahepatic injection. Intravenous injection of high doses of infectious HBV DNA did not induce hepatitis, suggesting that there is virtually no risk associated with routine laboratory handling of cloned HBV DNA. Images PMID:2983320

  3. The uteroglobin gene region: hormonal regulation, repetitive elements and complete nucleotide sequence of the gene.

    PubMed Central

    Suske, G; Wenz, M; Cato, A C; Beato, M

    1983-01-01

    Differential uteroglobin induction represents an appropriate model for the molecular analysis of the mechanism by which steroid hormones control gene expression in mammals. We have analyzed the structure and hormonal regulation of a 35 Kb region of genomic DNA in which the uteroglobin gene is located. The complete sequence of 3,700 nucleotides including the uteroglobin gene and its flanking regions has been determined, and the limits of the gene established by S1 nuclease mapping. Several regions containing repeated sequences were mapped by blot hybridization, one of which is located within the large intron in the uteroglobin gene. Analysis of the RNAs extracted from endometrium, lung and liver, after treatment with estrogen and/or progesterone shows that within the 35 Kb region, the uteroglobin gene is the only DNA segment whose transcription into stable RNA is induced by progesterone. Images PMID:6304644

  4. Using mitochondrial nucleotide sequences to investigate diversity and genealogical relationships within common carp (Cyprinus carpio L.).

    PubMed

    Thai, B T; Burridge, C P; Pham, T A; Austin, C M

    2005-02-01

    Direct sequencing of mitochondrial DNA (mtDNA) D-loop (745 bp) and MTATPase6/MTATPase8 (857 bp) regions was used to investigate genetic variation within common carp and develop a global genealogy of common carp strains. The D-loop region was more variable than the MTATPase6/MTATPase8 region, but given the wide distribution of carp the overall levels of sequence divergence were low. Levels of haplotype diversity varied widely among countries with Chinese, Indonesian and Vietnamese carp showing the greatest diversity whereas Japanese Koi and European carp had undetectable nucleotide variation. A genealogical analysis supports a close relationship between Vietnamese, Koi and Chinese Color carp strains and to a lesser extent, European carp. Chinese and Indonesian carp strains were the most divergent, and their relationships do not support the evolution of independent Asian and European lineages and current taxonomic treatments.

  5. Nucleotide sequence of nifD from Frankia alni strain ArI3: phylogenetic inferences.

    PubMed

    Normand, P; Gouy, M; Cournoyer, B; Simonet, P

    1992-05-01

    The complete nucleotide sequence of the nifD gene encoding the alpha subunit of component I of nitrogenase from Frankia alni strain ArI3 was determined. The coding region is 1,458 bp in length and encodes a polypeptide of 486 residues with a predicted molecular weight of 53,500. Phylogenetic inferences with 12 complete published nifD sequences were drawn using a variety of approaches. Frankia nifD clusters with proteobacteria rather than with Clostridium pasteurianum, the other Gram-positive bacterium studied. Extant eubacterial nif genes seem to have at least three distinct evolutionary origins as a result of ancient gene duplications. Within the Gram-positive bacterial phylum, functional nif genes descend from different duplicates. PMID:1584016

  6. The complete nucleotide sequence and genome organization of pea streak virus (genus Carlavirus).

    PubMed

    Su, Li; Li, Zhengnan; Bernardy, Mike; Wiersma, Paul A; Cheng, Zhihui; Xiang, Yu

    2015-10-01

    Pea streak virus (PeSV) is a member of the genus Carlavirus in the family Betaflexiviridae. Here, the first complete genome sequence of PeSV was determined by deep sequencing of a cDNA library constructed from dsRNA extracted from a PeSV-infected sample and Rapid Amplification of cDNA Ends (RACE) PCR. The PeSV genome consists of 8041 nucleotides excluding the poly(A) tail and contains six open reading frames (ORFs). The putative peptide encoded by the PeSV ORF6 has an estimated molecular mass of 6.6 kDa and shows no similarity to any known proteins. This differs from typical carlaviruses, whose ORF6 encodes a 12- to 18-kDa cysteine-rich nucleic-acid-binding protein.

  7. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

    PubMed

    Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus; Hatje, Klas; Kollmar, Martin; Leimeister, Chris-André; Morgenstern, Burkhard

    2014-07-01

    In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.

  8. Aptaligner: automated software for aligning pseudorandom DNA X-aptamers from next-generation sequencing data.

    PubMed

    Lu, Emily; Elizondo-Riojas, Miguel-Angel; Chang, Jeffrey T; Volk, David E

    2014-06-10

    Next-generation sequencing results from bead-based aptamer libraries have demonstrated that traditional DNA/RNA alignment software is insufficient. This is particularly true for X-aptamers containing specialty bases (W, X, Y, Z, ...) that are identified by special encoding. Thus, we sought an automated program that uses the inherent design scheme of bead-based X-aptamers to create a hypothetical reference library and Markov modeling techniques to provide improved alignments. Aptaligner provides this feature as well as length error and noise level cutoff features, is parallelized to run on multiple central processing units (cores), and sorts sequences from a single chip into projects and subprojects.

  9. Unique nucleotide sequence-guided assembly of repetitive DNA parts for synthetic biology applications

    SciTech Connect

    Torella, JP; Lienert, F; Boehm, CR; Chen, JH; Way, JC; Silver, PA

    2014-08-07

    Recombination-based DNA construction methods, such as Gibson assembly, have made it possible to easily and simultaneously assemble multiple DNA parts, and they hold promise for the development and optimization of metabolic pathways and functional genetic circuits. Over time, however, these pathways and circuits have become more complex, and the increasing need for standardization and insulation of genetic parts has resulted in sequence redundancies-for example, repeated terminator and insulator sequences-that complicate recombination-based assembly. We and others have recently developed DNA assembly methods, which we refer to collectively as unique nucleotide sequence (UNS)-guided assembly, in which individual DNA parts are flanked with UNSs to facilitate the ordered, recombination-based assembly of repetitive sequences. Here we present a detailed protocol for UNS-guided assembly that enables researchers to convert multiple DNA parts into sequenced, correctly assembled constructs, or into high-quality combinatorial libraries in only 2-3 d. If the DNA parts must be generated from scratch, an additional 2-5 d are necessary. This protocol requires no specialized equipment and can easily be implemented by a student with experience in basic cloning techniques.

  10. Unique nucleotide sequence (UNS)-guided assembly of repetitive DNA parts for synthetic biology applications

    PubMed Central

    Torella, Joseph P.; Lienert, Florian; Boehm, Christian R.; Chen, Jan-Hung; Way, Jeffrey C.; Silver, Pamela A.

    2016-01-01

    Recombination-based DNA construction methods, such as Gibson assembly, have made it possible to easily and simultaneously assemble multiple DNA parts and hold promise for the development and optimization of metabolic pathways and functional genetic circuits. Over time, however, these pathways and circuits have become more complex, and the increasing need for standardization and insulation of genetic parts has resulted in sequence redundancies — for example repeated terminator and insulator sequences — that complicate recombination-based assembly. We and others have recently developed DNA assembly methods that we refer to collectively as unique nucleotide sequence (UNS)-guided assembly, in which individual DNA parts are flanked with UNSs to facilitate the ordered, recombination-based assembly of repetitive sequences. Here we present a detailed protocol for UNS-guided assembly that enables researchers to convert multiple DNA parts into sequenced, correctly-assembled constructs, or into high-quality combinatorial libraries in only 2–3 days. If the DNA parts must be generated from scratch, an additional 2–5 days are necessary. This protocol requires no specialized equipment and can easily be implemented by a student with experience in basic cloning techniques. PMID:25101822

  11. Nucleotide sequence of the DNA packaging and capsid synthesis genes of bacteriophage P2.

    PubMed Central

    Linderoth, N A; Ziermann, R; Haggård-Ljungquist, E; Christie, G E; Calendar, R

    1991-01-01

    Overlapping DNA fragments containing the DNA packaging and capsid synthesis gene region of bacteriophage P2 were cloned and sequenced. In this report we present the complete nucleotide sequence of this 6550 bp region. Each of six open reading frames found in the interval was assigned to one of the essential genes (Q, P, O, N, M and L) by correlating genetic, physical and mutational data with DNA and protein sequence information. Polypeptides predicted were: a capsid completion protein, gpL; the major capsid precursor, gpN; the presumed capsid scaffolding protein; gpO; the ATPase and proposed endonuclease subunits of terminase, gpP and gpM, respectively; and a candidate for the portal protein, gpQ. These gene and protein sequences exhibited no homology to analogous genes or proteins of other bacteriophages. Expression of gene Q in E. coli from a plasmid caused production of a Mr 39,000 Da protein that restored Qam34 growth. This sequence analysis found only genes previously known from analysis of conditional-lethal mutations. No new capsid genes were found. Images PMID:1837355

  12. Complete nucleotide sequence and genome organization of Pelargonium flower break virus.

    PubMed

    Rico, P; Hernández, C

    2004-03-01

    The complete nucleotide sequence of Pelargonium flower break virus (PFBV) has been determined. The genomic RNA is 3923 nucleotides (nt) long and contains five open reading frames (ORFs). The 5'-proximal ORF encodes a 27 kDa protein (p27) and terminates with an amber codon which may be read-through into an in-frame p56 ORF to generate a 86 kDa protein (p86) containing the viral RNA dependent-RNA polymerase motifs. Two small ORFs, located in the central part of the viral genome, encode polypeptides of 7 (p7) and 12 kDa (p12), respectively, which are very likely involved in virus movement. Interestingly, p12 presents a leucine zipper motif that has not been previously reported in related proteins. The 3'-proximal ORF encodes a 37 kDa capsid protein (CP). The p12 ORF is in-frame with the p86 ORF and a double read-through protein of 99 kDa (p99) may be produced. Amino acid sequence comparisons revealed that the proteins encoded by ORFs 2, 3 and 4 are more similar to the corresponding gene products of Carnation mottle virus than to those of other carmoviruses, whereas the p27 and the CP show higher identity with the equivalent proteins of Saguaro cactus virus. Phylogenetic analysis conducted with the different viral products confirmed the assignment of PFBV to the genus Carmovirus. PMID:14991450

  13. Genome-wide association study reveals five nucleotide sequence variants for carcass traits in beef cattle.

    PubMed

    Kim, Y; Ryu, J; Woo, J; Kim, J B; Kim, C Y; Lee, C

    2011-08-01

    Genetic associations of nucleotide sequence variants with carcass traits in beef cattle were investigated using a genome-wide single nucleotide polymorphism (SNP) assay. Three hundred and thirteen Korean cattle were genotyped with the Illumina BovineSNP50 BeadChip, and 39,129 SNPs from 311 animals were analysed for each carcass phenotype after filtering by quality assurance. Five sequence markers were associated with one of the meat quantity or quality traits; rs109593638 on chromosome 3 with marbling score, rs109821175 on chromosome 11 and rs110862496 on chromosome 13 with backfat thickness (BFT), and rs110228023 on chromosome 6 and rs110201414 on chromosome 16 with eye muscle area (EMA) (P < 1.27 × 10(-6) , Bonferonni P < 0.05). The ss96319521 SNP, located within a gene with functions of muscle development, dishevelled homolog 1 (DVL1), would be a desirable candidate marker. Individuals with genotype CC at this gene appeared to have increased both EMA and carcass weight. Fine-mapping would be required to refine each of the five association signals shown in the current study for future application in marker-assisted selection for genetic improvement of beef quality and quantity.

  14. Essential nucleotide sequences and secondary structure elements of the hairpin ribozyme.

    PubMed Central

    Berzal-Herranz, A; Joseph, S; Chowrira, B M; Butcher, S E; Burke, J M

    1993-01-01

    In vitro selection experiments have been used to isolate active variants of the 50 nt hairpin catalytic RNA motif following randomization of individual ribozyme domains and intensive mutagenesis of the ribozyme-substrate complex. Active and inactive variants were characterized by sequencing, analysis of RNA cleavage activity in cis and in trans, and by substrate binding studies. Results precisely define base-pairing requirements for ribozyme helices 3 and 4, and identify eight essential nucleotides (G8, A9, A10, G21, A22, A23, A24 and C25) within the catalytic core of the ribozyme. Activity and substrate binding assays show that point mutations at these eight sites eliminate cleavage activity but do not significantly decrease substrate binding, demonstrating that these bases contribute to catalytic function. The mutation U39C has been isolated from different selection experiments as a second-site suppressor of the down mutants G21U and A43G. Assays of the U39C mutation in the wild-type ribozyme and in a variety of mutant backgrounds show that this variant is a general up mutation. Results from selection experiments involving populations totaling more than 10(10) variants are summarized, and consensus sequences including 16 essential nucleotides and a secondary structure model of four short helices, encompassing 18 bp for the ribozyme-substrate complex are derived. Images PMID:8508779

  15. Mapping DNA methylation by transverse current sequencing: Reduction of noise from neighboring nucleotides

    NASA Astrophysics Data System (ADS)

    Alvarez, Jose; Massey, Steven; Kalitsov, Alan; Velev, Julian

    Nanopore sequencing via transverse current has emerged as a competitive candidate for mapping DNA methylation without needed bisulfite-treatment, fluorescent tag, or PCR amplification. By eliminating the error producing amplification step, long read lengths become feasible, which greatly simplifies the assembly process and reduces the time and the cost inherent in current technologies. However, due to the large error rates of nanopore sequencing, single base resolution has not been reached. A very important source of noise is the intrinsic structural noise in the electric signature of the nucleotide arising from the influence of neighboring nucleotides. In this work we perform calculations of the tunneling current through DNA molecules in nanopores using the non-equilibrium electron transport method within an effective multi-orbital tight-binding model derived from first-principles calculations. We develop a base-calling algorithm accounting for the correlations of the current through neighboring bases, which in principle can reduce the error rate below any desired precision. Using this method we show that we can clearly distinguish DNA methylation and other base modifications based on the reading of the tunneling current.

  16. Evidence for Balancing Selection from Nucleotide Sequence Analyses of Human G6PD

    PubMed Central

    Verrelli, Brian C.; McDonald, John H.; Argyropoulos, George; Destro-Bisol, Giovanni; Froment, Alain; Drousiotou, Anthi; Lefranc, Gerard; Helal, Ahmed N.; Loiselet, Jacques; Tishkoff, Sarah A.

    2002-01-01

    Glucose-6-phosphate dehydrogenase (G6PD) mutations that result in reduced enzyme activity have been implicated in malarial resistance and constitute one of the best examples of selection in the human genome. In the present study, we characterize the nucleotide diversity across a 5.2-kb region of G6PD in a sample of 160 Africans and 56 non-Africans, to determine how selection has shaped patterns of DNA variation at this gene. Our global sample of enzymatically normal B alleles and A, A−, and Med alleles with reduced enzyme activities reveals many previously uncharacterized silent-site polymorphisms. In comparison with the absence of amino acid divergence between human and chimpanzee G6PD sequences, we find that the number of G6PD amino acid polymorphisms in human populations is significantly high. Unlike many other G6PD-activity alleles with reduced activity, we find that the age of the A variant, which is common in Africa, may not be consistent with the recent emergence of severe malaria and therefore may have originally had a historically different adaptive function. Overall, our observations strongly support previous genotype-phenotype association studies that proposed that balancing selection maintains G6PD deficiencies within human populations. The present study demonstrates that nucleotide sequence analyses can reveal signatures of both historical and recent selection in the genome and may elucidate the impact that infectious disease has had during human evolution. PMID:12378426

  17. Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase.

    PubMed Central

    Clark, A G; Weiss, K M; Nickerson, D A; Taylor, S L; Buchanan, A; Stengård, J; Salomaa, V; Vartiainen, E; Perola, M; Boerwinkle, E; Sing, C F

    1998-01-01

    Allelic variation in 9.7 kb of genomic DNA sequence from the human lipoprotein lipase gene (LPL) was scored in 71 healthy individuals (142 chromosomes) from three populations: African Americans (24) from Jackson, MS; Finns (24) from North Karelia, Finland; and non-Hispanic Whites (23) from Rochester, MN. The sequences had a total of 88 variable sites, with a nucleotide diversity (site-specific heterozygosity) of .002+/-.001 across this 9.7-kb region. The frequency spectrum of nucleotide variation exhibited a slight excess of heterozygosity, but, in general, the data fit expectations of the infinite-sites model of mutation and genetic drift. Allele-specific PCR helped resolve linkage phases, and a total of 88 distinct haplotypes were identified. For 1,410 (64%) of the 2,211 site pairs, all four possible gametes were present in these haplotypes, reflecting a rich history of past recombination. Despite the strong evidence for recombination, extensive linkage disequilibrium was observed. The number of haplotypes generally is much greater than the number expected under the infinite-sites model, but there was sufficient multisite linkage disequilibrium to reveal two major clades, which appear to be very old. Variation in this region of LPL may depart from the variation expected under a simple, neutral model, owing to complex historical patterns of population founding, drift, selection, and recombination. These data suggest that the design and interpretation of disease-association studies may not be as straightforward as often is assumed. PMID:9683608

  18. Nucleotide sequence and newly formed phosphodiester bond of spontaneously ligated satellite tobacco ringspot virus RNA.

    PubMed Central

    Buzayan, J M; Hampel, A; Bruening, G

    1986-01-01

    The satellite RNA of tobacco ringspot virus (STobRV RNA) replicates and becomes encapsidated in association with tobacco ringspot virus. Previous results show that the infected tissue produces multimeric STobRV RNAs of both polarities. RNA that is complementary to encapsidated STobRV RNA, designated as having the (-) polarity, cleaves autolytically at a specific ApG bond. Purified autolysis products spontaneously join in a non-enzymic reaction. We report characteristics of this RNA ligation reaction: the terminal groups that react, the type of bond in the newly formed junction and the nucleotide sequence of the joined RNA. The nucleotide sequence of the ligated RNA shows that joining of the reacting RNAs restored an ApG bond. The junction ApG has a 3'-to-5' phosphodiester bond. Thus the net ligation reaction of STobRV (-)RNA is the precise reversal of autolysis. We discuss this new type of RNA ligation reaction and its implications for the formation of multimeric STobRV RNAs during replication. Images PMID:2433680

  19. Complete nucleotide sequence of the Nilaparvata lugens reovirus: a putative member of the genus Fijivirus.

    PubMed

    Nakashima, N; Koizumi, M; Watanabe, H; Noda, H

    1996-01-01

    The nucleotide sequences of all genome segments of the Nilaparvata lugens reovirus (NLRV), which is found in the brown planthopper Nilaparvata lugens, have been determined and some genes have been assigned to structural and functional proteins. The genome of NLRV consists of 28 699 nucleotides and contains at least 11 large open reading frames (ORFs). The genome of NLRV is the largest among viruses of the family Reoviridae reported to date. The deduced amino acid sequence of genome segment S1 contained the major motifs of RNA polymerase and that of S7 had the purine NTP-binding motif. Based on the molecular masses of the deduced proteins and the particle structure of NLRV, segments S1, S3 and S7 were assigned to the 160, 140 and 75 kDa proteins, respectively, that are located in the inner core. It was deduced that S2 codes for the 135 kDa protein (B spike), which is located on the surface of the inner core. Most reported ORFs of rice black streaked dwarf virus (RBSDV), which shares many properties with NLRV, had similarities with the corresponding ORFs of NLRV. An exception was S7 ORF2, which is found in RBSDV but not NLRV and may therefore be involved in multiplication of RBSDV in rice plants. These results and our previous observations indicate that NLRV should be classified in the genus Fijivirus.

  20. Mulan: Multiple-Sequence Local Alignment and Visualization for Studying Function and Evolution

    SciTech Connect

    Ovcharenko, I; Loots, G; Giardine, B; Hou, M; Ma, J; Hardison, R; Stubbs, L; Miller, W

    2004-07-14

    Multiple sequence alignment analysis is a powerful approach for understanding phylogenetic relationships, annotating genes and detecting functional regulatory elements. With a growing number of partly or fully sequenced vertebrate genomes, effective tools for performing multiple comparisons are required to accurately and efficiently assist biological discoveries. Here we introduce Mulan (http://mulan.dcode.org/), a novel method and a network server for comparing multiple draft and finished-quality sequences to identify functional elements conserved over evolutionary time. Mulan brings together several novel algorithms: the tba multi-aligner program for rapid identification of local sequence conservation and the multiTF program for detecting evolutionarily conserved transcription factor binding sites in multiple alignments. In addition, Mulan supports two-way communication with the GALA database; alignments of multiple species dynamically generated in GALA can be viewed in Mulan, and conserved transcription factor binding sites identified with Mulan/multiTF can be integrated and overlaid with extensive genome annotation data using GALA. Local multiple alignments computed by Mulan ensure reliable representation of short-and large-scale genomic rearrangements in distant organisms. Mulan allows for interactive modification of critical conservation parameters to differentially predict conserved regions in comparisons of both closely and distantly related species. We illustrate the uses and applications of the Mulan tool through multi-species comparisons of the GATA3 gene locus and the identification of elements that are conserved differently in avians than in other genomes allowing speculation on the evolution of birds. Source code for the aligners and the aligner-evaluation software can be freely downloaded from http://bio.cse.psu.edu/.

  1. DINAMO: a coupled sequence alignment editor/molecular graphics tool for interactive homology modeling of proteins.

    PubMed

    Hansen, M; Bentz, J; Baucom, A; Gregoret, L

    1998-01-01

    Gaining functional information about a novel protein is a universal problem in biomedical research. With the explosive growth of the protein sequence and structural databases, it is becoming increasingly common for researchers to attempt to build a three-dimensional model of their protein of interest in order to gain information about its structure and interactions with other molecules. The two most reliable methods for predicting the structure of a protein are homology modeling, in which the novel sequence is modeled on the known three-dimensional structure of a related protein, and fold recognition (threading), where the sequence is scored against a library of fold models, and the highest scoring model is selected. The sequence alignment to a known structure can be ambiguous, and human intervention is often required to optimize the model. We describe an interactive model building and assessment tool in which a sequence alignment editor is dynamically coupled to a molecular graphics display. By means of a set of assessment tools, the user may optimize his or her alignment to satisfy the known heuristics of protein structure. Adjustments to the sequence alignment made by the user are reflected in the displayed model by color and other visual cues. For instance, residues are colored by hydrophobicity in both the three-dimensional model and in the sequence alignment. This aids the user in identifying undesirable buried polar residues. Several different evaluation metrics may be selected including residue conservation, residue properties, and visualization of predicted secondary structure. These characteristics may be mapped to the model both singly and in combination. DINAMO is a Java-based tool that may be run either over the web or installed locally. Its modular architecture also allows Java-literate users to add plug-ins of their own design.

  2. Characterization, nucleotide sequence, and conserved genomic locations of insertion sequence ISRm5 in Rhizobium meliloti.

    PubMed Central

    Laberge, S; Middleton, A T; Wheatcroft, R

    1995-01-01

    A target for ISRm3 transposition in Rhizobium meliloti IZ450 is another insertion sequence element, named ISRm5. ISRm5 is 1,340 bp in length and possesses terminal inverted repeats of unequal lengths (27 and 28 bp) and contain five mismatches. An open reading frame that spans 89% of the length of one DNA strand encodes a putative transposase with significant similarity to the putative transposases of 11 insertion sequence elements from diverse bacterial species, including ISRm3 from R. meliloti. Multiple copies and variants of ISRm5 occur in the R. meliloti genome, often in close association with ISRm3. Five ISRm5 copies in two strains were studied, and each was found to be located between 8-bp direct repeats. At two of these loci, which were shown to be highly conserved in R. meliloti, the copies of ISRm5 were found to be associated with pairs of short inverted repeats resembling transcription terminators. This structural arrangement not only may provide a conserved niche for ISRm5 but also may be a preferred target for transposition. PMID:7768811

  3. SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction.

    PubMed

    Hagopian, Raffi; Davidson, John R; Datta, Ruchira S; Samad, Bushra; Jarvis, Glen R; Sjölander, Kimmen

    2010-07-01

    We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.

  4. Remarkable similarity in genome nucleotide sequences between the Schwarz FF-8 and AIK-C measles virus vaccine strains and apparent nucleotide differences in the phosphoprotein gene.

    PubMed

    Ito, Chie; Ohgimoto, Shinji; Kato, Seiichi; Sharma, Luna Bhatta; Ayata, Minoru; Komase, Katsuhiro; Takeuchi, Kaoru; Ihara, Toshiaki; Ogura, Hisashi

    2011-07-01

    The Schwarz FF-8 (FF-8) and AIK-C measles virus vaccine strains are currently used for vaccination in Japan. Here, the complete genome nucleotide sequence of the FF-8 strain has been determined and its genome sequence found to be remarkably similar to that of the AIK-C strain. These two strains are differentiated only by two nucleotide differences in the phosphoprotein gene. Since the FF-8 strain does not possess the amino acid substitutions in the phospho- and fusion proteins which are responsible for the temperature-sensitivity and small syncytium formation phenotypes of the AIK-C strain, respectively, other unidentified common mechanisms likely attenuate both the FF-8 and AIK-C strains.

  5. The bioinformatics of nucleotide sequence coding for proteins requiring metal coenzymes and proteins embedded with metals

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Cheung, E.; Holden, T.; Sullivan, R.; Nguyen, A.; Lieberman, D.; Cheung, T.

    2015-09-01

    All metallo-proteins need post-translation metal incorporation. In fact, the isotope ratio of Fe, Cu, and Zn in physiology and oncology have emerged as an important tool. The nickel containing F430 is the prosthetic group of the enzyme methyl coenzyme M reductase which catalyzes the release of methane in the final step of methano-genesis, a prime energy metabolism candidate for life exploration space mission in the solar system. The 3.5 Gyr early life sulfite reductase as a life switch energy metabolism had Fe-Mo clusters. The nitrogenase for nitrogen fixation 3 billion years ago had Mo. The early life arsenite oxidase needed for anoxygenic photosynthesis energy metabolism 2.8 billion years ago had Mo and Fe. The selection pressure in metal incorporation inside a protein would be quantifiable in terms of the related nucleotide sequence complexity with fractal dimension and entropy values. Simulation model showed that the studied metal-required energy metabolism sequences had at least ten times more selection pressure relatively in comparison to the horizontal transferred sequences in Mealybug, guided by the outcome histogram of the correlation R-sq values. The metal energy metabolism sequence group was compared to the circadian clock KaiC sequence group using magnesium atomic level bond shifting mechanism in the protein, and the simulation model would suggest a much higher selection pressure for the energy life switch sequence group. The possibility of using Kepler 444 as an example of ancient life in Galaxy with the associated exoplanets has been proposed and is further discussed in this report. Examples of arsenic metal bonding shift probed by Synchrotron-based X-ray spectroscopy data and Zn controlled FOXP2 regulated pathways in human and chimp brain studied tissue samples are studied in relationship to the sequence bioinformatics. The analysis results suggest that relatively large metal bonding shift amount is associated with low probability correlation R

  6. ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements.

    PubMed

    Taylor, James; Tyekucheva, Svitlana; King, David C; Hardison, Ross C; Miller, Webb; Chiaromonte, Francesca

    2006-12-01

    Genomic sequence signals - such as base composition, presence of particular motifs, or evolutionary constraint - have been used effectively to identify functional elements. However, approaches based only on specific signals known to correlate with function can be quite limiting. When training data are available, application of computational learning algorithms to multispecies alignments has the potential to capture broader and more informative sequence and evolutionary patterns that better characterize a class of elements. However, effective exploitation of patterns in multispecies alignments is impeded by the vast number of possible alignment columns and by a limited understanding of which particular strings of columns may characterize a given class. We have developed a computational method, called ESPERR (evolutionary and sequence pattern extraction through reduced representations), which uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR produces a greatly improved Regulatory Potential score, which can discriminate regulatory regions from neutral sites with excellent accuracy ( approximately 94%). This score captures strong signals (GC content and conservation), as well as subtler signals (with small contributions from many different alignment patterns) that characterize the regulatory elements in our training set. ESPERR is also effective for predicting other classes of functional elements, as we show for DNaseI hypersensitive sites and highly conserved regions with developmental enhancer activity. Our software, training data, and genome-wide predictions are available from our Web site (http://www.bx.psu.edu/projects/esperr).

  7. Spial: analysis of subtype-specific features in multiple sequence alignments of proteins

    PubMed Central

    Wuster, Arthur; Venkatakrishnan, A. J.; Schertler, Gebhard F. X.; Babu, M. Madan

    2010-01-01

    Motivation: Spial (Specificity in alignments) is a tool for the comparative analysis of two alignments of evolutionarily related sequences that differ in their function, such as two receptor subtypes. It highlights functionally important residues that are either specific to one of the two alignments or conserved across both alignments. It permits visualization of this information in three complementary ways: by colour-coding alignment positions, by sequence logos and optionally by colour-coding the residues of a protein structure provided by the user. This can aid in the detection of residues that are involved in the subtype-specific interaction with a ligand, other proteins or nucleic acids. Spial may also be used to detect residues that may be post-translationally modified in one of the two sets of sequences. Availability: http://www.mrc-lmb.cam.ac.uk/genomes/spial/; supplementary information is available at http://www.mrc-lmb.cam.ac.uk/genomes/spial/help.html Contact: ajv@mrc-lmb.cam.ac.uk PMID:20880955

  8. Targeted capture enrichment and sequencing identifies extensive nucleotide variation in the turkey MHC-B.

    PubMed

    Reed, Kent M; Mendoza, Kristelle M; Settlage, Robert E

    2016-03-01

    Variation in the major histocompatibility complex (MHC) is increasingly associated with disease susceptibility and resistance in avian species of agricultural importance. This variation includes sequence polymorphisms but also structural differences (gene rearrangement) and copy number variation (CNV). The MHC has now been described for multiple galliform species including the best defined assemblies of the chicken (Gallus gallus) and domestic turkey (Meleagris gallopavo). Using this sequence resource, this study applied high-throughput sequencing to investigate MHC variation in turkeys of North America (NA turkeys). An MHC-specific SureSelect (Agilent) capture array was developed, and libraries were created for 14 turkeys representing domestic (commercial bred), heritage breed, and wild turkeys. In addition, a representative of the Ocellated turkey (M. ocellata) and chicken (G. gallus) was included to test cross-species applicability of the capture array allowing for identification of new species-specific polymorphisms. Libraries were hybridized to ∼12 K cRNA baits and the resulting pools were sequenced. On average, 98% of processed reads mapped to the turkey whole genome sequence and 53% to the MHC target. In addition to the MHC, capture hybridization recovered sequences corresponding to other MHC regions. Sequence alignment and de novo assembly indicated the presence of several additional BG genes in the turkey with evidence for CNV. Variant detection identified an average of 2245 polymorphisms per individual for the NA turkeys, 3012 for the Ocellated turkey, and 462 variants in the chicken (RJF-256). This study provides an extensive sequence resource for examining MHC variation and its relation to health of this agriculturally important group of birds.

  9. Multiple sequence alignment with arbitrary gap costs: computing an optimal solution using polyhedral combinatorics.

    PubMed

    Althaus, Ernst; Caprara, Alberto; Lenhof, Hans-Peter; Reinert, Knut

    2002-01-01

    Multiple sequence alignment is one of the dominant problems in computational molecular biology. Numerous scoring functions and methods have been proposed, most of which result in NP-hard problems. In this paper we propose for the first time a general formulation for multiple alignment with arbitrary gap-costs based on an integer linear program (ILP). In addition we describe a branch-and-cut algorithm to effectively solve the ILP to optimality. We evaluate the performances of our approach in terms of running time and quality of the alignments using the BAliBase database of reference alignments. The results show that our implementation ranks amongst the best programs developed so far.

  10. Multiple sequence alignment algorithm based on a dispersion graph and ant colony algorithm.

    PubMed

    Chen, Weiyang; Liao, Bo; Zhu, Wen; Xiang, Xuyu

    2009-10-01

    In this article, we describe a representation for the processes of multiple sequences alignment (MSA) and used it to solve the problem of MSA. By this representation, we took every possible aligning result into account by defining the representation of gap insertion, the value of heuristic information in every optional path and scoring rule. On the basis of the proposed multidimensional graph, we used the ant colony algorithm to find the better path that denotes a better aligning result. In our article, we proposed the instance of three-dimensional graph and four-dimensional graph and advanced a special ichnographic representation to analyze MSA. It is yet only an experimental software, and we gave an example for finding the best aligning result by three-dimensional graph and ant colony algorithm. Experimental results show that our method can improve the solution quality on MSA benchmarks. PMID:19130503

  11. ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments.

    PubMed

    Schwarz, Roland F; Tamuri, Asif U; Kultys, Marek; King, James; Godwin, James; Florescu, Ana M; Schultz, Jörg; Goldman, Nick

    2016-05-01

    Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles). PMID:26819408

  12. ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments.

    PubMed

    Schwarz, Roland F; Tamuri, Asif U; Kultys, Marek; King, James; Godwin, James; Florescu, Ana M; Schultz, Jörg; Goldman, Nick

    2016-05-01

    Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles).

  13. ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments

    PubMed Central

    Schwarz, Roland F.; Tamuri, Asif U.; Kultys, Marek; King, James; Godwin, James; Florescu, Ana M.; Schultz, Jörg; Goldman, Nick

    2016-01-01

    Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles). PMID:26819408

  14. araB Gene and nucleotide sequence of the araC gene of Erwinia carotovora.

    PubMed Central

    Lei, S P; Lin, H C; Heffernan, L; Wilcox, G

    1985-01-01

    The araB and araC genes of Erwinia carotovora were expressed in Escherichia coli and Salmonella typhimurium. The araB and araC genes in E. coli, E. carotovora, and S. typhimurium were transcribed in divergent directions. In E. carotovora, the araB and araC genes were separated by 3.5 kilobase pairs, whereas in E. coli and S. typhimurium they were separated by 147 base pairs. The nucleotide sequence of the E. carotovora araC gene was determined. The predicted sequence of AraC protein of E. carotovora was 18 and 29 amino acids longer than that of AraC protein of E. coli and S. typhimurium, respectively. The DNA sequence of the araC gene of E. carotovora was 58% homologous to that of E. coli and 59% homologous to that of S. typhimurium, with respect to the common region they share. The predicted amino acid sequence of AraC protein was 57% homologous to that of E. coli and 58% homologous to that of S. typhimurium. The 5' noncoding regions of the araB and araC genes of E. carotovora had little homology to either of the other two species. Images PMID:3902795

  15. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences.

    PubMed

    Irizarry, K; Kustanovich, V; Li, C; Brown, N; Nelson, S; Wong, W; Lee, C J

    2000-10-01

    Single-nucleotide polymorphisms (SNPs) have been explored as a high-resolution marker set for accelerating the mapping of disease genes. Here we report 48,196 candidate SNPs detected by statistical analysis of human expressed sequence tags (ESTs), associated primarily with coding regions of genes. We used Bayesian inference to weigh evidence for true polymorphism versus sequencing error, misalignment or ambiguity, misclustering or chimaeric EST sequences, assessing data such as raw chromatogram height, sharpness, overlap and spacing, sequencing error rates, context-sensitivity and cDNA library origin. Three separate validations-comparison with 54 genes screened for SNPs independently, verification of HLA-A polymorphisms and restriction fragment length polymorphism (RFLP) testing-verified 70%, 89% and 71% of our predicted SNPs, respectively. Our method detects tenfold more true HLA-A SNPs than previous analyses of the EST data. We found SNPs in a large fraction of known disease genes, including some disease-causing mutations (for example, the HbS sickle-cell mutation). Our comprehensive analysis of human coding region polymorphism provides a public resource for mapping of disease genes (available at http://www.bioinformatics.ucla.edu/snp).

  16. Cloning and nucleotide sequence of the gene coding for citrate synthase from a thermotolerant Bacillus sp.

    PubMed Central

    Schendel, F J; August, P R; Anderson, C R; Hanson, R S; Flickinger, M C

    1992-01-01

    The structural gene coding for citrate synthase from the gram-positive soil isolate Bacillus sp. strain C4 (ATCC 55182) capable of secreting acetic acid at pH 5.0 to 7.0 in the presence of dolime has been cloned from a genomic library by complementation of an Escherichia coli auxotrophic mutant lacking citrate synthase. The nucleotide sequence of the entire 3.1-kb HindIII fragment has been determined, and one major open reading frame was found coding for citrate synthase (ctsA). Citrate synthase from Bacillus sp. strain C4 was found to be a dimer (Mr, 84,500) with a subunit with an Mr of 42,000. The N-terminal sequence was found to be identical with that predicted from the gene sequence. The kinetics were best fit to a bisubstrate enzyme with an ordered mechanism. Bacillus sp. strain C4 citrate synthase was not activated by potassium chloride and was not inhibited by NADH, ATP, ADP, or AMP at levels up to 1 mM. The predicted amino acid sequence was compared with that of the E. coli, Acinetobacter anitratum, Pseudomonas aeruginosa, Rickettsia prowazekii, porcine heart, and Saccharomyces cerevisiae cytoplasmic and mitochondrial enzymes. PMID:1311544

  17. Cloning and genomic nucleotide sequence of the matrix attachment region binding protein from the halotolerant alga Dunaliella salina.

    PubMed

    Wang, Peng-Ju; Wang, Tian-Yun; Wang, Ya-Feng; Yang, Rui; Li, Zhao-Xi

    2013-07-01

    In our previous study, the sequence of a matrix attachment region binding protein (MBP) cDNA was cloned from the unicellular green alga Dunaliella salina. However, the nucleotide sequence of this gene has not been reported so far. In this paper, the nucleotide sequence of MBP was cloned and characterized, and its gene copy number was determined. The MBP nucleotide sequence is 5641 bp long, and interrupted by 12 introns ranging from 132 to 562 bp. All the introns in the D. salina MBP gene have orthodox splice sites, exhibiting GT at the 5' end and AG at the 3' end. Southern blot analysis showed that MBP only has one copy in the D. salina genome. PMID:22961592

  18. Cloning and genomic nucleotide sequence of the matrix attachment region binding protein from the halotolerant alga Dunaliella salina.

    PubMed

    Wang, Peng-Ju; Wang, Tian-Yun; Wang, Ya-Feng; Yang, Rui; Li, Zhao-Xi

    2013-07-01

    In our previous study, the sequence of a matrix attachment region binding protein (MBP) cDNA was cloned from the unicellular green alga Dunaliella salina. However, the nucleotide sequence of this gene has not been reported so far. In this paper, the nucleotide sequence of MBP was cloned and characterized, and its gene copy number was determined. The MBP nucleotide sequence is 5641 bp long, and interrupted by 12 introns ranging from 132 to 562 bp. All the introns in the D. salina MBP gene have orthodox splice sites, exhibiting GT at the 5' end and AG at the 3' end. Southern blot analysis showed that MBP only has one copy in the D. salina genome.

  19. Complete nucleotide sequences of two isolates of cherry green ring mottle virus from peach (Prunus persica) in China.

    PubMed

    Wang, Lihui; Jiang, Dongmei; Niu, Feiqing; Lu, Meiguang; Wang, Hongqing; Li, Shifang

    2013-03-01

    Two complete nucleotide sequences of cherry green ring mottle virus (CGRMV) isolated from peach in Hebei (Hs10) and Fujian (F9) Provinces, China, were determined. Five open reading frames (ORFs) were found in the genomes of both isolates. The F9 and Hs10 isolates shared 82.2 % and 83.4-94.4 % nucleotide sequence identity, respectively, with two CGRMV isolates from cherry. Analysis of the nucleotide and amino acid sequences from the five ORFs of both isolates showed that Hs10 shares the greatest sequence identity with P1A (GenBank AJ291761) from cherry. Phylogenetic analysis indicated that CGRMV isolates from peach and cherry are closely related to members of the genus Foveavirus.

  20. DNAAlignEditor: DNA alignment editor tool

    PubMed Central

    Sanchez-Villeda, Hector; Schroeder, Steven; Flint-Garcia, Sherry; Guill, Katherine E; Yamasaki, Masanori; McMullen, Michael D

    2008-01-01

    Background With advances in DNA re-sequencing methods and Next-Generation parallel sequencing approaches, there has been a large increase in genomic efforts to define and analyze the sequence variability present among individuals within a species. For very polymorphic species such as maize, this has lead to a need for intuitive, user-friendly software that aids the biologist, often with naïve programming capability, in tracking, editing, displaying, and exporting multiple individual sequence alignments. To fill this need we have developed a novel DNA alignment editor. Results We have generated a nucleotide sequence alignment editor (DNAAlignEditor) that provides an intuitive, user-friendly interface for manual editing of multiple sequence alignments with functions for input, editing, and output of sequence alignments. The color-coding of nucleotide identity and the display of associated quality score aids in the manual alignment editing process. DNAAlignEditor works as a client/server tool having two main components: a relational database that collects the processed alignments and a user interface connected to database through universal data access connectivity drivers. DNAAlignEditor can be used either as a stand-alone application or as a network application with multiple users concurrently connected. Conclusion We anticipate that this software will be of general interest to biologists and population genetics in editing DNA sequence alignments and analyzing natural sequence variation regardless of species, and will be particularly useful for manual alignment editing of sequences in species with high levels of polymorphism. PMID:18366684

  1. Assessing Activity Pattern Similarity with Multidimensional Sequence Alignment based on a Multiobjective Optimization Evolutionary Algorithm

    PubMed Central

    Kwan, Mei-Po; Xiao, Ningchuan; Ding, Guoxiang

    2015-01-01

    Due to the complexity and multidimensional characteristics of human activities, assessing the similarity of human activity patterns and classifying individuals with similar patterns remains highly challenging. This paper presents a new and unique methodology for evaluating the similarity among individual activity patterns. It conceptualizes multidimensional sequence alignment (MDSA) as a multiobjective optimization problem, and solves this problem with an evolutionary algorithm. The study utilizes sequence alignment to code multiple facets of human activities into multidimensional sequences, and to treat similarity assessment as a multiobjective optimization problem that aims to minimize the alignment cost for all dimensions simultaneously. A multiobjective optimization evolutionary algorithm (MOEA) is used to generate a diverse set of optimal or near-optimal alignment solutions. Evolutionary operators are specifically designed for this problem, and a local search method also is incorporated to improve the search ability of the algorithm. We demonstrate the effectiveness of our method by comparing it with a popular existing method called ClustalG using a set of 50 sequences. The results indicate that our method outperforms the existing method for most of our selected cases. The multiobjective evolutionary algorithm presented in this paper provides an effective approach for assessing activity pattern similarity, and a foundation for identifying distinctive groups of individuals with similar activity patterns. PMID:26190858

  2. Single nucleotide polymorphism analysis of Korean native chickens using next generation sequencing data.

    PubMed

    Seo, Dong-Won; Oh, Jae-Don; Jin, Shil; Song, Ki-Duk; Park, Hee-Bok; Heo, Kang-Nyeong; Shin, Younhee; Jung, Myunghee; Park, Junhyung; Jo, Cheorun; Lee, Hak-Kyo; Lee, Jun-Heon

    2015-02-01

    There are five native chicken lines in Korea, which are mainly classified by plumage colors (black, white, red, yellow, gray). These five lines are very important genetic resources in the Korean poultry industry. Based on a next generation sequencing technology, whole genome sequence and reference assemblies were performed using Gallus_gallus_4.0 (NCBI) with whole genome sequences from these lines to identify common and novel single nucleotide polymorphisms (SNPs). We obtained 36,660,731,136 ± 1,257,159,120 bp of raw sequence and average 26.6-fold of 25-29 billion reference assembly sequences representing 97.288 % coverage. Also, 4,006,068 ± 97,534 SNPs were observed from 29 autosomes and the Z chromosome and, of these, 752,309 SNPs are the common SNPs across lines. Among the identified SNPs, the number of novel- and known-location assigned SNPs was 1,047,951 ± 14,956 and 2,948,648 ± 81,414, respectively. The number of unassigned known SNPs was 1,181 ± 150 and unassigned novel SNPs was 8,238 ± 1,019. Synonymous SNPs, non-synonymous SNPs, and SNPs having character changes were 26,266 ± 1,456, 11,467 ± 604, 8,180 ± 458, respectively. Overall, 443,048 ± 26,389 SNPs in each bird were identified by comparing with dbSNP in NCBI. The presently obtained genome sequence and SNP information in Korean native chickens have wide applications for further genome studies such as genetic diversity studies to detect causative mutations for economic and disease related traits.

  3. Sequence comparison alignment-free approach based on suffix tree and L-words frequency.

    PubMed

    Soares, Inês; Goios, Ana; Amorim, António

    2012-01-01

    The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words--in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.

  4. Manipulating multiple sequence alignments via MaM and WebMaM.

    PubMed

    Alkan, Can; Tüzün, Eray; Buard, Jerome; Lethiec, Franck; Eichler, Evan E; Bailey, Jeffrey A; Sahinalp, S Cenk

    2005-07-01

    MaM is a software tool that processes and manipulates multiple alignments of genomic sequence. MaM computes the exact location of common repeat elements, exons and unique regions within aligned genomics sequences using a variety of user identified programs, databases and/or tables. The program can extract subalignments, corresponding to these various regions of DNA to be analyzed independently or in conjunction with other elements of genomic DNA. Graphical displays further allow an assessment of sequence variation throughout these different regions of the aligned sequence, providing separate displays for their repeat, non-repeat and coding portions of genomic DNA. The program should facilitate the phylogenetic analysis and processing of different portions of genomic sequence as part of large-scale sequencing efforts. MaM source code is freely available for non-commercial use at http://compbio.cs.sfu.ca/MAM.htm; and the web interface WebMaM is hosted at http://atgc.lirmm.fr/mam.

  5. Analysis of the genome sequence of the pathogenic Muscovy duck parvovirus strain YY reveals a 14-nucleotide-pair deletion in the inverted terminal repeats.

    PubMed

    Wang, Jianye; Huang, Yu; Zhou, Mingxu; Zhu, Guoqiang

    2016-09-01

    Genomic information about Muscovy duck parvovirus is still limited. In this study, the genome of the pathogenic MDPV strain YY was sequenced. The full-length genome of YY is 5075 nucleotides (nt) long, 57 nt shorter than that of strain FM. Sequence alignment indicates that the 5' and 3' inverted terminal repeats (ITR) of strain YY contain a 14-nucleotide-pair deletion in the stem of the palindromic hairpin structure in comparison to strain FM and FZ91-30. The deleted region contains one "E-box" site and one repeated motif with the sequence "TTCCGGT" or "ACCGGAA". Phylogenetic trees constructed based the protein coding genes concordantly showed that YY, together with nine other MDPV isolates from various places, clustered in a separate branch, distinct from the branch formed by goose parvovirus (GPV) strains. These results demonstrate that, despite the distinctive deletion, the YY strain still belongs to the classical MDPV group. Moreover, the deletion of ITR may contribute to the genome evolution of MDPV under immunization pressure. PMID:27344160

  6. Analysis of the genome sequence of the pathogenic Muscovy duck parvovirus strain YY reveals a 14-nucleotide-pair deletion in the inverted terminal repeats.

    PubMed

    Wang, Jianye; Huang, Yu; Zhou, Mingxu; Zhu, Guoqiang

    2016-09-01

    Genomic information about Muscovy duck parvovirus is still limited. In this study, the genome of the pathogenic MDPV strain YY was sequenced. The full-length genome of YY is 5075 nucleotides (nt) long, 57 nt shorter than that of strain FM. Sequence alignment indicates that the 5' and 3' inverted terminal repeats (ITR) of strain YY contain a 14-nucleotide-pair deletion in the stem of the palindromic hairpin structure in comparison to strain FM and FZ91-30. The deleted region contains one "E-box" site and one repeated motif with the sequence "TTCCGGT" or "ACCGGAA". Phylogenetic trees constructed based the protein coding genes concordantly showed that YY, together with nine other MDPV isolates from various places, clustered in a separate branch, distinct from the branch formed by goose parvovirus (GPV) strains. These results demonstrate that, despite the distinctive deletion, the YY strain still belongs to the classical MDPV group. Moreover, the deletion of ITR may contribute to the genome evolution of MDPV under immunization pressure.

  7. Plastid sequence evolution: a new pattern of nucleotide substitutions in the Cucurbitaceae.

    PubMed

    Decker-Walters, Deena S; Chung, Sang-Min; Staub, Jack E

    2004-05-01

    Nucleotide substitutions (i.e., point mutations) are the primary driving force in generating DNA variation upon which selection can act. Substitutions called transitions, which entail exchanges between purines (A = adenine, G = guanine) or pyrimidines (C = cytosine, T = thymine), typically outnumber transversions (e.g., exchanges between a purine and a pyrimidine) in a DNA strand. With an increasing number of plant studies revealing a transversion rather than transition bias, we chose to perform a detailed substitution analysis for the plant family Cucurbitaceae using data from several short plastid DNA sequences. We generated a phylogenetic tree for 19 taxa of the tribe Benincaseae and related genera and then scored conservative substitution changes (e.g., those not exhibiting homoplasy or reversals) from the unambiguous branches of the tree. Neither the transition nor (A+T)/(G+C) biases found in previous studies were supported by our overall data. More importantly, we found a novel and symmetrical substitution bias in which Gs had been preferentially replaced by A, As by C, Cs by T, and Ts by G, resulting in the G-->A-->C-->T-->G substitution series. Understanding this pattern will lead to new hypotheses concerning plastid evolution, which in turn will affect the choices of substitution models and other tree-building algorithms for phylogenetic analyses based on nucleotide data.

  8. Nucleotide Sequence Analyses and Predicted Coding of Bunyavirus Genome RNA Species

    PubMed Central

    Clerx-van Haaster, Corrie M.; Akashi, Hiroomi; Auperin, David D.; Bishop, David H. L.

    1982-01-01

    We performed 3′ RNA sequence analyses of [32P]pCp-end-labeled La Crosse (LAC) virus, alternate LAC virus isolate L74, and snowshoe hare bunyavirus large (L), medium (M), and small (S) negative-stranded viral RNA species to determine the coding capabilities of these species. These analyses were confirmed by dideoxy primer extension studies in which we used a synthetic oligodeoxynucleotide primer complementary to the conserved 3′-terminal decanucleotide of the three viral RNA species (Clerx-van Haaster and Bishop, Virology 105:564-574, 1980). The deduced sequences predicted translation of two S-RNA gene products that were read in overlapping reading frames. So far, only single contiguous open reading frames have been identified for the viral M- and L-RNA species. For the negative-stranded M-RNA species of all three viruses, the single reading frame developed from the first 3′-proximal UAC triplet. Likewise, for the L-RNA of the alternate LAC isolate, a single open reading frame developed from the first 3′-proximal UAC triplet. The corresponding L-RNA sequences of prototype LAC and snowshoe hare viruses initiated open reading frames; however, for both viral L-RNA species there was a preceding 3′-proximal UAC triplet in another reading frame that was followed shortly afterward by a termination codon. A comparison of the sequence data obtained for snowshoe hare virus, LAC virus, and the alternate LAC virus isolate showed that the identified nucleotide substitutions were sufficient to account for some of the fingerprint differences in the L-, M-, and S-RNA species of the three viruses. Unlike the distribution of the L- and M-RNA substitutions, significantly fewer nucleotide substitutions occurred after the initial UAC triplet of the S-RNA species than before this triplet, implying that the overlapping genes of the S RNA provided a constraint against evolution by point mutation. The comparative sequence analyses predicted amino acid differences among the

  9. Guanine nucleotide-binding proteins that enhance choleragen ADP-ribosyltransferase activity: nucleotide and deduced amino acid sequence of an ADP-ribosylation factor cDNA.

    PubMed Central

    Price, S R; Nightingale, M; Tsai, S C; Williamson, K C; Adamik, R; Chen, H C; Moss, J; Vaughan, M

    1988-01-01

    Three (two soluble and one membrane) guanine nucleotide-binding proteins (G proteins) that enhance ADP-ribosylation of the Gs alpha stimulatory subunit of the adenylyl cyclase (EC 4.6.1.1) complex by choleragen have recently been purified from bovine brain. To further define the structure and function of these ADP-ribosylation factors (ARFs), we isolated a cDNA clone (lambda ARF2B) from a bovine retinal library by screening with a mixed heptadecanucleotide probe whose sequence was based on the partial amino acid sequence of one of the soluble ARFs from bovine brain. Comparison of the deduced amino acid sequence of lambda ARF2B with sequences of peptides from the ARF protein (total of 60 amino acids) revealed only two differences. Whether these are cloning artifacts or reflect the existence of more than one ARF protein remains to be determined. Deduced amino acid sequences of ARF, Go alpha (the alpha subunit of a G protein that may be involved in regulation of ion fluxes), and c-Ha-ras gene product p21 show similarities in regions believed to be involved in guanine nucleotide binding and GTP hydrolysis. ARF apparently lacks a site analogous to that ADP-ribosylated by choleragen in G-protein alpha subunits. Although both the ARF proteins and the alpha subunits bind guanine nucleotides and serve as choleragen substrates, they must interact with the toxin A1 peptide in different ways. In addition to serving as an ADP-ribose acceptor, ARF interacts with the toxin in a manner that modifies its catalytic properties. PMID:3135549

  10. The complete nucleotide sequence of a new bipartite begomovirus from Brazil infecting Abutilon.

    PubMed

    Paprotka, T; Metzler, V; Jeske, H

    2010-05-01

    The complete nucleotide sequence of Abutilon mosaic Brazil virus (AbMBV), a new bipartite begomovirus from Bahia, Brazil, is described and analyzed phylogenetically. Its DNA A is most closely related to those of Sida-infecting begomoviruses from Brazil and forms a phylogenetic cluster with pepper- and Euphorbia-infecting begomoviruses from Central America. The DNA B component forms a cluster with different Sida- and okra-infecting begomoviruses from Brazil. Both components are distinct from those of the classical Abutilon mosaic virus originating from the West Indies. AbMBV is transmissible to Nicotiana benthamiana and Malva parviflora by biolistics of rolling-circle amplification products and induces characteristic mosaic and vein-clearing symptoms in M. parviflora.

  11. High-Throughput Sequencing Reveals Single Nucleotide Variants in Longer-Kernel Bread Wheat

    PubMed Central

    Chen, Feng; Zhu, Zibo; Zhou, Xiaobian; Yan, Yan; Dong, Zhongdong; Cui, Dangqun

    2016-01-01

    The transcriptomes of bread wheat Yunong 201 and its ethyl methanesulfonate derivative Yunong 3114 were obtained by next-sequencing technology. Single nucleotide variants (SNVs) in the wheat strains were explored and compared. A total of 5907 and 6287 non-synonymous SNVs were acquired for Yunong 201 and 3114, respectively. A total of 4021 genes with SNVs were obtained. The genes that underwent non-synonymous SNVs were significantly involved in ATP binding, protein phosphorylation, and cellular protein metabolic process. The heat map analysis also indicated that most of these mutant genes were significantly differentially expressed at different developmental stages. The SNVs in these genes possibly contribute to the longer kernel length of Yunong 3114. Our data provide useful information on wheat transcriptome for future studies on wheat functional genomics. This study could also help in illustrating the gene functions of the non-synonymous SNVs of Yunong 201 and 3114. PMID:27551288

  12. Developing single nucleotide polymorphism (SNP) markers from transcriptome sequences for identification of longan (Dimocarpus longan) germplasm

    PubMed Central

    Wang, Boyi; Tan, Hua-Wei; Fang, Wanping; Meinhardt, Lyndel W; Mischke, Sue; Matsumoto, Tracie; Zhang, Dapeng

    2015-01-01

    Longan (Dimocarpus longan Lour.) is an important tropical fruit tree crop. Accurate varietal identification is essential for germplasm management and breeding. Using longan transcriptome sequences from public databases, we developed single nucleotide polymorphism (SNP) markers; validated 60 SNPs in 50 longan germplasm accessions, including cultivated varieties and wild germplasm; and designated 25 SNP markers that unambiguously identified all tested longan varieties with high statistical rigor (P<0.0001). Multiple trees from the same clone were verified and off-type trees were identified. Diversity analysis revealed genetic relationships among analyzed accessions. Cultivated varieties differed significantly from wild populations (Fst=0.300; P<0.001), demonstrating untapped genetic diversity for germplasm conservation and utilization. Within cultivated varieties, apparent differences between varieties from China and those from Thailand and Hawaii indicated geographic patterns of genetic differentiation. These SNP markers provide a powerful tool to manage longan genetic resources and breeding, with accurate and efficient genotype identification. PMID:26504559

  13. High-Throughput Sequencing Reveals Single Nucleotide Variants in Longer-Kernel Bread Wheat.

    PubMed

    Chen, Feng; Zhu, Zibo; Zhou, Xiaobian; Yan, Yan; Dong, Zhongdong; Cui, Dangqun

    2016-01-01

    The transcriptomes of bread wheat Yunong 201 and its ethyl methanesulfonate derivative Yunong 3114 were obtained by next-sequencing technology. Single nucleotide variants (SNVs) in the wheat strains were explored and compared. A total of 5907 and 6287 non-synonymous SNVs were acquired for Yunong 201 and 3114, respectively. A total of 4021 genes with SNVs were obtained. The genes that underwent non-synonymous SNVs were significantly involved in ATP binding, protein phosphorylation, and cellular protein metabolic process. The heat map analysis also indicated that most of these mutant genes were significantly differentially expressed at different developmental stages. The SNVs in these genes possibly contribute to the longer kernel length of Yunong 3114. Our data provide useful information on wheat transcriptome for future studies on wheat functional genomics. This study could also help in illustrating the gene functions of the non-synonymous SNVs of Yunong 201 and 3114. PMID:27551288

  14. Complete nucleotide sequence of the mitochondrial genome of a salamander, Mertensiella luschani.

    PubMed

    Zardoya, Rafael; Malaga-Trillo, Edward; Veith, Michael; Meyer, Axel

    2003-10-23

    The complete nucleotide sequence (16,650 bp) of the mitochondrial genome of the salamander Mertensiella luschani (Caudata, Amphibia) was determined. This molecule conforms to the consensus vertebrate mitochondrial gene order. However, it is characterized by a long non-coding intervening sequence with two 124-bp repeats between the tRNA(Thr) and tRNA(Pro) genes. The new sequence data were used to reconstruct a phylogeny of jawed vertebrates. Phylogenetic analyses of all mitochondrial protein-coding genes at the amino acid level recovered a robust vertebrate tree in which lungfishes are the closest living relatives of tetrapods, salamanders and frogs are grouped together to the exclusion of caecilians (the Batrachia hypothesis) in a monophyletic amphibian clade, turtles show diapsid affinities and are placed as sister group of crocodiles+birds, and the marsupials are grouped together with monotremes and basal to placental mammals. The deduced phylogeny was used to characterize the molecular evolution of vertebrate mitochondrial proteins. Amino acid frequencies were analyzed across the main lineages of jawed vertebrates, and leucine and cysteine were found to be the most and least abundant amino acids in mitochondrial proteins, respectively. Patterns of amino acid replacements were conserved among vertebrates. Overall, cartilaginous fishes showed the least variation in amino acid frequencies and replacements. Constancy of rates of evolution among the main lineages of jawed vertebrates was rejected.

  15. Regulatory regions of two transport operons under nitrogen control: nucleotide sequences.

    PubMed Central

    Higgins, C F; Ames, G F

    1982-01-01

    We have determined the nucleotide sequences of the regulatory regions from two amino acid transport operons from Salmonella typhimurium: dhuA, which regulates the histidine transport operon, and argTr, which regulates argT, the gene encoding the lysine-arginine-ornithine-binding protein, LAO. The promoter for the histidine transport operon has been identified from the sequence change in the promoter-up mutation dhuA1. Neither regulatory region has any of the features typical of the regulatory regions of the amino acid biosynthetic operons, indicating that regulation of at least these transport genes does not involve a transcription attenuation mechanism. We have identified three interesting features, present in both of these sequences, which may be of importance in the regulation of these and other operons: a "stem-loop-foot" structure, a region of specific homology, and a mirror symmetry. The region of mirror symmetry may be a protein recognition site important is regulating expression of these and other operons in response to nitrogen availability. Mirror symmetry as a structure for DNA-protein interaction sites has not been proposed previously. PMID:7041112

  16. High-throughput nucleotide sequence analysis of diverse bacterial communities in leachates of decomposing pig carcasses

    PubMed Central

    Yang, Seung Hak; Lim, Joung Soo; Khan, Modabber Ahmed; Kim, Bong Soo; Choi, Dong Yoon; Lee, Eun Young; Ahn, Hee Kwon

    2015-01-01

    The leachate generated by the decomposition of animal carcass has been implicated as an environmental contaminant surrounding the burial site. High-throughput nucleotide sequencing was conducted to investigate the bacterial communities in leachates from the decomposition of pig carcasses. We acquired 51,230 reads from six different samples (1, 2, 3, 4, 6 and 14 week-old carcasses) and found that sequences representing the phylum Firmicutes predominated. The diversity of bacterial 16S rRNA gene sequences in the leachate was the highest at 6 weeks, in contrast to those at 2 and 14 weeks. The relative abundance of Firmicutes was reduced, while the proportion of Bacteroidetes and Proteobacteria increased from 3–6 weeks. The representation of phyla was restored after 14 weeks. However, the community structures between the samples taken at 1–2 and 14 weeks differed at the bacterial classification level. The trend in pH was similar to the changes seen in bacterial communities, indicating that the pH of the leachate could be related to the shift in the microbial community. The results indicate that the composition of bacterial communities in leachates of decomposing pig carcasses shifted continuously during the study period and might be influenced by the burial site. PMID:26500442

  17. Complete nucleotide sequence of watermelon chlorotic stunt virus originating from Oman.

    PubMed

    Khan, Akhtar J; Akhtar, Sohail; Briddon, Rob W; Ammara, Um; Al-Matrooshi, Abdulrahman M; Mansoor, Shahid

    2012-07-01

    Watermelon chlorotic stunt virus (WmCSV) is a bipartite begomovirus (genus Begomovirus, family Geminiviridae) that causes economic losses to cucurbits, particularly watermelon, across the Middle East and North Africa. Recently squash (Cucurbita moschata) grown in an experimental field in Oman was found to display symptoms such as leaf curling, yellowing and stunting, typical of a begomovirus infection. Sequence analysis of the virus isolated from squash showed 97.6-99.9% nucleotide sequence identity to previously described WmCSV isolates for the DNA A component and 93-98% identity for the DNA B component. Agrobacterium-mediated inoculation to Nicotiana benthamiana resulted in the development of symptoms fifteen days post inoculation. This is the first bipartite begomovirus identified in Oman. Overall the Oman isolate showed the highest levels of sequence identity to a WmCSV isolate originating from Iran, which was confirmed by phylogenetic analysis. This suggests that WmCSV present in Oman has been introduced from Iran. The significance of this finding is discussed.

  18. Complete nucleotide sequence of the mitochondrial genome of a salamander, Mertensiella luschani.

    PubMed

    Zardoya, Rafael; Malaga-Trillo, Edward; Veith, Michael; Meyer, Axel

    2003-10-23

    The complete nucleotide sequence (16,650 bp) of the mitochondrial genome of the salamander Mertensiella luschani (Caudata, Amphibia) was determined. This molecule conforms to the consensus vertebrate mitochondrial gene order. However, it is characterized by a long non-coding intervening sequence with two 124-bp repeats between the tRNA(Thr) and tRNA(Pro) genes. The new sequence data were used to reconstruct a phylogeny of jawed vertebrates. Phylogenetic analyses of all mitochondrial protein-coding genes at the amino acid level recovered a robust vertebrate tree in which lungfishes are the closest living relatives of tetrapods, salamanders and frogs are grouped together to the exclusion of caecilians (the Batrachia hypothesis) in a monophyletic amphibian clade, turtles show diapsid affinities and are placed as sister group of crocodiles+birds, and the marsupials are grouped together with monotremes and basal to placental mammals. The deduced phylogeny was used to characterize the molecular evolution of vertebrate mitochondrial proteins. Amino acid frequencies were analyzed across the main lineages of jawed vertebrates, and leucine and cysteine were found to be the most and least abundant amino acids in mitochondrial proteins, respectively. Patterns of amino acid replacements were conserved among vertebrates. Overall, cartilaginous fishes showed the least variation in amino acid frequencies and replacements. Constancy of rates of evolution among the main lineages of jawed vertebrates was rejected. PMID:14604788

  19. Whole genome sequencing of a single Bos taurus animal for single nucleotide polymorphism discovery

    PubMed Central

    Eck, Sebastian H; Benet-Pagès, Anna; Flisikowski, Krzysztof; Meitinger, Thomas; Fries, Ruedi; Strom, Tim M

    2009-01-01

    Background The majority of the 2 million bovine single nucleotide polymorphisms (SNPs) currently available in dbSNP have been identified in a single breed, Hereford cattle, during the bovine genome project. In an attempt to evaluate the variance of a second breed, we have produced a whole genome sequence at low coverage of a single Fleckvieh bull. Results We generated 24 gigabases of sequence, mainly using 36-bp paired-end reads, resulting in an average 7.4-fold sequence depth. This coverage was sufficient to identify 2.44 million SNPs, 82% of which were previously unknown, and 115,000 small indels. A comparison with the genotypes of the same animal, generated on a 50 k oligonucleotide chip, revealed a detection rate of 74% and 30% for homozygous and heterozygous SNPs, respectively. The false positive rate, as determined by comparison with genotypes determined for 196 randomly selected SNPs, was approximately 1.1%. We further determined the allele frequencies of the 196 SNPs in 48 Fleckvieh and 48 Braunvieh bulls. 95% of the SNPs were polymorphic with an average minor allele frequency of 24.5% and with 83% of the SNPs having a minor allele frequency larger than 5%. Conclusions This work provides the first single cattle genome by next-generation sequencing. The chosen approach - low to medium coverage re-sequencing - added more than 2 million novel SNPs to the currently publicly available SNP resource, providing a valuable resource for the construction of high density oligonucleotide arrays in the context of genome-wide association studies. PMID:19660108

  20. Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages.

    PubMed

    Jayaswal, Vivek; Wong, Thomas K F; Robinson, John; Poladian, Leon; Jermiin, Lars S

    2014-09-01

    Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modeled accurately using the distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or underparameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10 000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42 337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear

  1. Skylign: a tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models

    PubMed Central

    2014-01-01

    Background Logos are commonly used in molecular biology to provide a compact graphical representation of the conservation pattern of a set of sequences. They render the information contained in sequence alignments or profile hidden Markov models by drawing a stack of letters for each position, where the height of the stack corresponds to the conservation at that position, and the height of each letter within a stack depends on the frequency of that letter at that position. Results We present a new tool and web server, called Skylign, which provides a unified framework for creating logos for both sequence alignments and profile hidden Markov models. In addition to static image files, Skylign creates a novel interactive logo plot for inclusion in web pages. These interactive logos enable scrolling, zooming, and inspection of underlying values. Skylign can avoid sampling bias in sequence alignments by down-weighting redundant sequences and by combining observed counts with informed priors. It also simplifies the representation of gap parameters, and can optionally scale letter heights based on alternate calculations of the conservation of a position. Conclusion Skylign is available as a website, a scriptable web service with a RESTful interface, and as a software package for download. Skylign’s interactive logos are easily incorporated into a web page with just a few lines of HTML markup. Skylign may be found at http://skylign.org. PMID:24410852

  2. Phylo-VISTA: An Interactive Visualization Tool for Multiple DNA Sequence Alignments

    SciTech Connect

    Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.; Brudno, Michael; Batzoglou, Serafim; Bethel, E. Wes; Rubin, Edward M.; Hamann, Bernd; Dubchak, Inna

    2004-04-01

    We have developed Phylo-VISTA (Shah et al., 2003), an interactive software tool for analyzing multiple alignments by visualizing a similarity measure for DNA sequences of multiple species. The complexity of visual presentation is effectively organized using a framework based upon inter-species phylogenetic relationships. The phylogenetic organization supports rapid, user-guided inter-species comparison. To aid in navigation through large sequence datasets, Phylo-VISTA provides a user with the ability to select and view data at varying resolutions. The combination of multi-resolution data visualization and analysis, combined with the phylogenetic framework for inter-species comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments.

  3. A data parallel strategy for aligning multiple biological sequences on multi-core computers.

    PubMed

    Zhu, Xiangyuan; Li, Kenli; Salah, Ahmad

    2013-05-01

    In this paper, we address the large-scale biological sequence alignment problem, which has an increasing demand in computational biology. We employ data parallelism paradigm that is suitable for handling large-scale processing on multi-core computers to achieve a high degree of parallelism. Using the data parallelism paradigm, we propose a general strategy which can be used to speed up any multiple sequence alignment method. We applied five different clustering algorithms in our strategy and implemented rigorous tests on an 8-core computer using four traditional benchmarks and artificially generated sequences. The results show that our multi-core-based implementations can achieve up to 151-fold improvements in execution time while losing 2.19% accuracy on average. The source code of the proposed strategy, together with the test sets used in our analysis, is available on request.

  4. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer

    PubMed Central

    Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A.

    2016-01-01

    Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution. PMID:27363362

  5. CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences.

    PubMed

    Afonnikov, Dmitry A; Kolchanov, Nikolay A

    2004-07-01

    Recent results suggest that during evolution certain substitutions at protein sites may occur in a coordinated manner due to interactions between amino acid residues. Information on these coordinated substitutions may be useful for analysis of protein structure and function. CRASP is an Internet-available software tool for the detection and analysis of coordinated substitutions in multiple alignments of protein sequences. The approach is based on estimation of the correlation coefficient between the values of a physicochemical parameter at a pair of positions of sequence alignment. The program enables the user to detect and analyze pairwise relationships between amino acid substitutions at protein sequence positions, estimate the contribution of the coordinated substitutions to the evolutionary invariance or variability in integral protein physicochemical characteristics such as the net charge of protein residues and hydrophobic core volume. The CRASP program is available at http://wwwmgs.bionet.nsc.ru/mgs/programs/crasp/.

  6. Detection and quantitation of single nucleotide polymorphisms, DNA sequence variations, DNA mutations, DNA damage and DNA mismatches

    DOEpatents

    McCutchen-Maloney, Sandra L.

    2002-01-01

    DNA mutation binding proteins alone and as chimeric proteins with nucleases are used with solid supports to detect DNA sequence variations, DNA mutations and single nucleotide polymorphisms. The solid supports may be flow cytometry beads, DNA chips, glass slides or DNA dips sticks. DNA molecules are coupled to solid supports to form DNA-support complexes. Labeled DNA is used with unlabeled DNA mutation binding proteins such at TthMutS to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by binding which gives an increase in signal. Unlabeled DNA is utilized with labeled chimeras to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by nuclease activity of the chimera which gives a decrease in signal.

  7. A parallel approach of COFFEE objective function to multiple sequence alignment

    NASA Astrophysics Data System (ADS)

    Zafalon, G. F. D.; Visotaky, J. M. V.; Amorim, A. R.; Valêncio, C. R.; Neves, L. A.; de Souza, R. C. G.; Machado, J. M.

    2015-09-01

    The computational tools to assist genomic analyzes show even more necessary due to fast increasing of data amount available. With high computational costs of deterministic algorithms for sequence alignments, many works concentrate their efforts in the development of heuristic approaches to multiple sequence alignments. However, the selection of an approach, which offers solutions with good biological significance and feasible execution time, is a great challenge. Thus, this work aims to show the parallelization of the processing steps of MSA-GA tool using multithread paradigm in the execution of COFFEE objective function. The standard objective function implemented in the tool is the Weighted Sum of Pairs (WSP), which produces some distortions in the final alignments when sequences sets with low similarity are aligned. Then, in studies previously performed we implemented the COFFEE objective function in the tool to smooth these distortions. Although the nature of COFFEE objective function implies in the increasing of execution time, this approach presents points, which can be executed in parallel. With the improvements implemented in this work, we can verify the execution time of new approach is 24% faster than the sequential approach with COFFEE. Moreover, the COFFEE multithreaded approach is more efficient than WSP, because besides it is slightly fast, its biological results are better.

  8. Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles.

    PubMed

    Gautheret, D; Lambert, A

    2001-11-01

    We present here a new approach to the problem of defining RNA signatures and finding their occurrences in sequence databases. The proposed method is based on "secondary structure profiles". An RNA sequence alignment with secondary structure information is used as an input. Two types of weight matrices/profiles are constructed from this alignment: single strands are represented by a classical lod-scores profile while helical regions are represented by an extended "helical profile" comprising 16 lod-scores per position, one for each of the 16 possible base-pairs. Database searches are then conducted using a simultaneous search for helical profiles and dynamic programming alignment of single strand profiles. The algorithm has been implemented into a new software, ERPIN, that performs both profile construction and database search. Applications are presented for several RNA motifs. The automated use of sequence information in both single-stranded and helical regions yields better sensitivity/specificity ratios than descriptor-based programs. Furthermore, since the translation of alignments into profiles is straightforward with ERPIN, iterative searches can easily be conducted to enrich collections of homologous RNAs.

  9. SVM-BALSA: Remote Homology Detection based on Bayesian Sequence Alignment

    SciTech Connect

    Webb-Robertson, Bobbie-Jo M.; Oehmen, Chris S.; Matzke, Melissa M.

    2005-11-10

    Using biopolymer sequence comparison methods to identify evolutionarily related proteins is one of the most common tasks in bioinformatics. Recently, support vector machines (SVMs) utilizing statistical learning theory have been employed in the problem of remote homology detection and shown to outperform iterative profile methods such as PSI-BLAST. In this study we demonstrate the utilization of a Bayesian alignment score, which accounts for the uncertainty of all possible alignments, in the SVM construction improves sensitivity compared to the traditional dynamic programming implementation.

  10. Comparative Topological Analysis of Neuronal Arbors via Sequence Representation and Alignment

    NASA Astrophysics Data System (ADS)

    Gillette, Todd Aaron

    Neuronal morphology is a key mediator of neuronal function, defining the profile of connectivity and shaping signal integration and propagation. Reconstructing neurite processes is technically challenging and thus data has historically been relatively sparse. Data collection and curation along with more efficient and reliable data production methods provide opportunities for the application of informatics to find new relationships and more effectively explore the field. This dissertation presents a method for aiding the development of data production as well as a novel representation and set of analyses for extracting morphological patterns. The DIADEM Challenge was organized for the purposes of determining the state of the art in automated neuronal reconstruction and what existing challenges remained. As one of the co-organizers of the Challenge, I developed the DIADEM metric, a tool designed to measure the effectiveness of automated reconstruction algorithms by comparing resulting reconstructions to expert-produced gold standards and identifying errors of various types. It has been used in the DIADEM Challenge and in the testing of several algorithms since. Further, this dissertation describes a topological sequence representation of neuronal trees amenable to various forms of sequence analysis, notably motif analysis, global pairwise alignment, clustering, and multiple sequence alignment. Motif analysis of neuronal arbors shows a large difference in bifurcation type proportions between axons and dendrites, but that relatively simple growth mechanisms account for most higher order motifs. Pairwise global alignment of topological sequences, modified from traditional sequence alignment to preserve tree relationships, enabled cluster analysis which displayed strong correspondence with known cell classes by cell type, species, and brain region. Multiple alignment of sequences in selected clusters enabled the extraction of conserved features, revealing mouse

  11. Real-time single-molecule electronic DNA sequencing by synthesis using polymer-tagged nucleotides on a nanopore array.

    PubMed

    Fuller, Carl W; Kumar, Shiv; Porel, Mintu; Chien, Minchen; Bibillo, Arek; Stranges, P Benjamin; Dorwart, Michael; Tao, Chuanjuan; Li, Zengmin; Guo, Wenjing; Shi, Shundi; Korenblum, Daniel; Trans, Andrew; Aguirre, Anne; Liu, Edward; Harada, Eric T; Pollard, James; Bhat, Ashwini; Cech, Cynthia; Yang, Alexander; Arnold, Cleoma; Palla, Mirkó; Hovis, Jennifer; Chen, Roger; Morozova, Irina; Kalachikov, Sergey; Russo, James J; Kasianowicz, John J; Davis, Randy; Roever, Stefan; Church, George M; Ju, Jingyue

    2016-05-10

    DNA sequencing by synthesis (SBS) offers a robust platform to decipher nucleic acid sequences. Recently, we reported a single-molecule nanopore-based SBS strategy that accurately distinguishes four bases by electronically detecting and differentiating four different polymer tags attached to the 5'-phosphate of the nucleotides during their incorporation into a growing DNA strand catalyzed by DNA polymerase. Further developing this approach, we report here the use of nucleotides tagged at the terminal phosphate with oligonucleotide-based polymers to perform nanopore SBS on an α-hemolysin nanopore array platform. We designed and synthesized several polymer-tagged nucleotides using tags that produce different electrical current blockade levels and verified they are active substrates for DNA polymerase. A highly processive DNA polymerase was conjugated to the nanopore, and the conjugates were complexed with primer/template DNA and inserted into lipid bilayers over individually addressable electrodes of the nanopore chip. When an incoming complementary-tagged nucleotide forms a tight ternary complex with the primer/template and polymerase, the tag enters the pore, and the current blockade level is measured. The levels displayed by the four nucleotides tagged with four different polymers captured in the nanopore in such ternary complexes were clearly distinguishable and sequence-specific, enabling continuous sequence determination during the polymerase reaction. Thus, real-time single-molecule electronic DNA sequencing data with single-base resolution were obtained. The use of these polymer-tagged nucleotides, combined with polymerase tethering to nanopores and multiplexed nanopore sensors, should lead to new high-throughput sequencing methods. PMID:27091962

  12. Real-time single-molecule electronic DNA sequencing by synthesis using polymer-tagged nucleotides on a nanopore array

    PubMed Central

    Fuller, Carl W.; Kumar, Shiv; Porel, Mintu; Chien, Minchen; Bibillo, Arek; Stranges, P. Benjamin; Dorwart, Michael; Tao, Chuanjuan; Li, Zengmin; Guo, Wenjing; Shi, Shundi; Korenblum, Daniel; Trans, Andrew; Aguirre, Anne; Liu, Edward; Harada, Eric T.; Pollard, James; Bhat, Ashwini; Cech, Cynthia; Yang, Alexander; Arnold, Cleoma; Palla, Mirkó; Hovis, Jennifer; Chen, Roger; Morozova, Irina; Kalachikov, Sergey; Russo, James J.; Kasianowicz, John J.; Davis, Randy; Roever, Stefan; Church, George M.; Ju, Jingyue

    2016-01-01

    DNA sequencing by synthesis (SBS) offers a robust platform to decipher nucleic acid sequences. Recently, we reported a single-molecule nanopore-based SBS strategy that accurately distinguishes four bases by electronically detecting and differentiating four different polymer tags attached to the 5′-phosphate of the nucleotides during their incorporation into a growing DNA strand catalyzed by DNA polymerase. Further developing this approach, we report here the use of nucleotides tagged at the terminal phosphate with oligonucleotide-based polymers to perform nanopore SBS on an α-hemolysin nanopore array platform. We designed and synthesized several polymer-tagged nucleotides using tags that produce different electrical current blockade levels and verified they are active substrates for DNA polymerase. A highly processive DNA polymerase was conjugated to the nanopore, and the conjugates were complexed with primer/template DNA and inserted into lipid bilayers over individually addressable electrodes of the nanopore chip. When an incoming complementary-tagged nucleotide forms a tight ternary complex with the primer/template and polymerase, the tag enters the pore, and the current blockade level is measured. The levels displayed by the four nucleotides tagged with four different polymers captured in the nanopore in such ternary complexes were clearly distinguishable and sequence-specific, enabling continuous sequence determination during the polymerase reaction. Thus, real-time single-molecule electronic DNA sequencing data with single-base resolution were obtained. The use of these polymer-tagged nucleotides, combined with polymerase tethering to nanopores and multiplexed nanopore sensors, should lead to new high-throughput sequencing methods. PMID:27091962

  13. Assessment of the nucleotide sequence variability in the bovine T-cell receptor alpha delta joining gene region.

    PubMed

    Fries, R; Ewald, D; Thaller, G; Buitkamp, J

    2001-05-01

    The sequence of 2,193 nucleotides from the bovine T-cell receptor alpha/delta joining gene region (TCRADJ) was determined and compared with the corresponding human and murine sequences. The identity was 75.3% for the comparison of the Bos taurus vs. the Homo sapiens sequence and 63.8% for the Bos taurus vs. the Mus musculus sequence. This comparison permitted the identification of the putatively functional elements within the bovine sequence. Direct sequencing of 2,110 nucleotides in nine animals revealed 12 variable sites. Estimates, based on direct sequencing in three Holstein Friesian animals, for the two measures of sequence variability, nucleotide polymorphism (u) and nucleotide diversity (p), were 0.00050 (60.00036) and 0.00077 (60.00056), respectively. The test statistic, Tajima's D, for the comparison of the two measures indicates that the difference between u and p is close to significance (P < 0.05), suggesting the possibility of selective forces acting on the studied genomic region. Allelic variation at 5 of the 12 variable sites was analysed in 359 animals (48 Anatolian Black, 56 Braunvieh, 115 Fleckvieh, 47 Holstein Friesian, 50 Simmental and 43 Pinzgauer) using the oligonucleotide ligation assay (OLA) in combination with the enzyme linked immunoabsorbant assay (ELISA). Nine unambiguous haplotypes could be derived based on animals with a maximum of one heterozygous site. Four to seven haplotypes were present in the different breeds. When taking into account the frequencies of the haplotypes in the different breeds, especially in Anatolian Black, an ancestral cattle population, we could establish the likely phylogenetic relationships of the haplotypes. Such haplotype trees are the basis for cladistic candidate gene analysis. Our study demonstrates that the systematic search of single nucleotide polymorphisms (SNPs) is useful for analysing all aspects of variability of a given genomic region.

  14. Cloning and nucleotide sequence of the gene coding for citrate synthase from a thermotolerant Bacillus sp

    SciTech Connect

    Schendel, F.J.; August, P.R.; Anderson, C.R.; Flickinger, M.C. ); Hanson, R.S. )

    1992-01-01

    Acetate salts are emerging as potentially attractive bulk chemicals for a variety of environmental applications, for example, as catalysts to facilitate combustion of high-sulfur coal by electrical utilities and as the biodegradable noncorrosive highway deicing salt calcium magnesium acetate. The structural gene coding for citrate synthase from the gram-positive soil isolate Bacillus sp. strain C4 (ATCC 55182) capable of secreting acetic acid at pH 5.0 to 7.0 in the presence of dolime has been cloned from a genomic library by complementation of an Escherichia coli auxotrophic mutant lacking citrate synthase. The nucleotide sequence of the entire 3.1-kb HindIII fragment has been determined, and one major open reading frame was found coding for citrate synthase (ctsA). Citrate synthase from Bacillus sp. strain C4 was found to be a dimer (M{sub r}, 84,500) with a sub unit with an M{sub r} of 42,000. The N-terminal sequence was found to be identical with that predicted from the gene sequence. The kinetics were best fit to a bisubstrate enzyme with an ordered mechanism. Bacillus sp. strain C4 citrate synthase was not activated by potassium chloride and was not inhibited by NADH, ATP, ADP, or AMP at levels up to 1 mM. The predicted amino acid sequence was compared with that of the E. coli, Acinetobacter anitratum, Pseudomonas aeruginosa, Rickettsia prowazekii, porcine heart, and Saccharomyces cerevisiae cytoplasmic and mitochondrial enzymes.

  15. The qa repressor gene of Neurospora crassa: wild-type and mutant nucleotide sequences.

    PubMed Central

    Huiet, L; Giles, N H

    1986-01-01

    The qa-1S gene, one of two regulatory genes in the qa gene cluster of Neurospora crassa, encodes the qa repressor. The qa-1S gene together with the qa-1F gene, which encodes the qa activator protein, control the expression of all seven qa genes, including those encoding the inducible enzymes responsible for the utilization of quinic acid as a carbon source. The nucleotide sequence of the qa-1S gene and its flanking regions has been determined. The deduced coding sequence for the qa-1S protein encodes 918 amino acids with a calculated molecular weight of 100,650 and is interrupted by a single 66-base-pair intervening sequence. Both constitutive and noninducible mutants occur in the qa-1S gene and two different mutations of each type have been cloned and sequenced. All four mutations occur within the predicted coding region of the qa-1S gene. This result strongly supports the hypothesis that the qa-1S gene encodes a repressor. All four mutations are located within codons for the last 300 amino acids of the qa-1S protein. The mutations in three of the mutants involve amino acid substitutions, while the fourth mutant, which has a constitutive phenotype, contains a frameshift mutation. The two constitutive mutations occur in the most distal region of the gene, possibly implicating the COOH-terminal region of the qa repressor in binding to its target. The two noninducible mutations occur in a region proximal to the constitutive mutations, possibly implicating this region of the qa repressor in binding the inducer. Images PMID:3010294

  16. Human secreted carbonic anhydrase: cDNA cloning, nucleotide sequence, and hybridization histochemistry

    SciTech Connect

    Aldred, P.; Fu, Ping; Barrett, G.; Penschow, J.D.; Wright, R.D.; Coghlan, J.P.; Fernley, R.T. )

    1991-01-01

    Complementary DNA clones coding for the human secreted carbonic anhydrase isozyme (CAVI) have been isolated and their nucleotide sequences determined. These clones identify a 1.45-kb mRNA that is present in high levels in parotid submandibular salivary glands but absent in other tissues such as the sublingual gland, kidney, liver, and prostate gland. Hybridization histochemistry of human salivary glands shows mRNA for CA VI located in the acinar cells of these glands. The cDNA clones encode a protein of 308 amino acids that includes a 17 amino acid leader sequence typical of secreted proteins. The mature protein has 291 amino acids compared to 259 or 260 for the cytoplasmic isozymes, with most of the extra amino acids present as a carboxyl terminal extension. In comparison, sheep CA VI has a 45 amino acid extension. Overall the human CA VI protein has a sequence identity of 35 {percent} with human CA II, while residues involved in the active site of the enzymes have been conserved. The human and sheep secreted carbonic anhydrases have a sequence identity of 72 {percent}. This includes the two cysteine residues that are known to be involved in an intramolecular disulfide bond in the sheep CA VI. The enzyme is known to be glycosylated and three potential N-glycosylation sites (Asn-X-Thr/Ser) have been identified. Two of these are known to be glycosylated in sheep CA VI. Southern analysis of human DNA indicates that there is only one gene coding for CA VI.

  17. Nucleotide sequence of an immediate-early frog virus 3 gene.

    PubMed

    Willis, D; Foglesong, D; Granoff, A

    1984-12-01

    We have used "gene walking" with synthetic oligonucleotides and M13 dideoxynucleotide sequencing techniques to obtain the complete coding and flanking sequences of the gene encoding a major immediate-early RNA (molecular weight, 169,000) of frog virus 3. R-loop mapping of the cloned XbaI K fragment of frog virus 3 DNA with immediate-early RNA from infected cells showed that an RNA of approximately 500 to 600 nucleotides (the right size to code for the immediate-early viral 18-kilodalton protein of unknown function) hybridized to a region within 100 base pairs of one end of the XbaI K fragment; no evidence for splicing was observed in the electron microscope or by single-strand nuclease analysis. Further restriction mapping narrowed the location of the gene to the XbaI end of a 2-kilobase-pair XbaI-Bg/II fragment, which was bidirectionally subcloned into the bacteriophage pair mp10 and mp11 for sequencing. Mung bean nuclease mapping was used to identify both the 5' and the 3' ends of the mRNA. The 5' end mapped within an AT-rich region 19 base pairs upstream from two in-phase AUG start codons that were immediately followed by an open reading frame of 157 amino acids. Another AT-rich sequence was found at -29 base pairs from the 5' end of the mRNA start site; this sequence may function as a TATA box. The 3' end of the message displayed considerable microheterogeneity, but clearly terminated within a third AT-rich region 50 to 60 base pairs from the translation stop codon. The eucaryotic polyadenylic acid addition signal (AATAAA) was not present, a finding to be expected since frog virus 3 mRNA is not polyadenylated. Both the single-stranded mp10 clone of the XbaI-Bg/II fragment and a 15-base oligonucleotide complementary to the region flanking the two AUG translation start codons inhibited translation of the immediate-early 18-kilodalton protein in vitro, confirming the identity of the sequenced gene. As the regulatory sequences of this gene did not resemble those of

  18. TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction.

    PubMed

    Chang, Jia-Ming; Di Tommaso, Paolo; Lefort, Vincent; Gascuel, Olivier; Notredame, Cedric

    2015-07-01

    This article introduces the Transitive Consistency Score (TCS) web server; a service making it possible to estimate the local reliability of protein multiple sequence alignments (MSAs) using the TCS index. The evaluation can be used to identify the aligned positions most likely to contain structurally analogous residues and also most likely to support an accurate phylogenetic reconstruction. The TCS scoring scheme has been shown to be accurate predictor of structural alignment correctness among commonly used methods. It has also been shown to outperform common filtering schemes like Gblocks or trimAl when doing MSA post-processing prior to phylogenetic tree reconstruction. The web server is available from http://tcoffee.crg.cat/tcs.

  19. OrthoSelect: a web server for selecting orthologous gene alignments from EST sequences.

    PubMed

    Schreiber, Fabian; Wörheide, Gert; Morgenstern, Burkhard

    2009-07-01

    In the absence of whole genome sequences for many organisms, the use of expressed sequence tags (EST) offers an affordable approach for researchers conducting phylogenetic analyses to gain insight about the evolutionary history of organisms. Reliable alignments for phylogenomic analyses are based on orthologous gene sequences from different taxa. So far, researchers have not sufficiently tackled the problem of the completely automated construction of such datasets. Existing software tools are either semi-automated, covering only part of the necessary data processing, or implemented as a pipeline, requiring the installation and configuration of a cascade of external tools, which may be time-consuming and hard to manage. To simplify data set construction for phylogenomic studies, we set up a web server that uses our recently developed OrthoSelect approach. To the best of our knowledge, our web server is the first web-based EST analysis pipeline that allows the detection of orthologous gene sequences in EST libraries and outputs orthologous gene alignments. Additionally, OrthoSelect provides the user with an extensive results section that lists and visualizes all important results, such as annotations, data matrices for each gene/taxon and orthologous gene alignments. The web server is available at http://orthoselect.gobics.de.

  20. Complete nucleotide sequence of a plant tumor-inducing Ti plasmid.

    PubMed

    Suzuki, K; Hattori, Y; Uraji, M; Ohta, N; Iwata, K; Murata, K; Kato, A; Yoshida, K

    2000-01-25

    Crown gall tumor disease in dicot plants is caused by Agrobacterium tumefaciens harboring a giant tumor-inducing (Ti) plasmid. Here, for the first time among agrobacterial plasmids, the nucleotide sequence of a typical nopaline-type Ti plasmid (pTi-SAKURA) was determined completely. In total, 195 open reading frames (ORFs) were estimated in the 206479 bp long sequence. 20 genes for conjugation, three for replication, 22 for pathogenesis and 37 for genetic colonization of host plants were found within two-thirds of the plasmid. These genes formed seven functional gene clusters with narrow inter-cluster spaces. In the remaining one-third of the plasmid, novel genes including homologs of mutT, Rhizobium nodQ and Sphingomonas ligE genes were found, which are likely to be responsible for the broad host range. Restriction fragment length variation indicates extreme plasticity of the part required for conjugational gene transfer and the above-mentioned one-third of the plasmid, even among closely related Ti plasmids. PMID:10721727

  1. Complete Nucleotide Sequence Analysis of the Norovirus GII.4 Sydney Variant in South Korea

    PubMed Central

    Park, Ji-Sun; Lee, Sung-Geun; Cho, Han-Gil; Jheong, Weon-Hwa; Paik, Soon-Young

    2015-01-01

    Norovirus is the primary cause of acute gastroenteritis in individuals of all ages. In Australia, a new strain of norovirus (GII.4) was identified in March 2012, and this strain has spread rapidly around the world. In August 2012, this new GII.4 strain was identified in patients in South Korea. Therefore, to examine the characteristics of the epidemic norovirus GII.4 2012 variant in South Korea, we conducted KM272334 full-length genomic analysis. The genome of the gg-12-08-04 strain consisted of 7,558 bp and contained three open reading frame (ORF) composites throughout the whole genome: ORF1 (5,100 bp), ORF2 (1,623 bp), and ORF3 (807 bp). Phylogenetic analyses showed that gg-12-08-04 belonged to the GII.4 Sydney 2012 variant, sharing 98.92% nucleotide similarity with this variant strain. According to SimPlot analysis, the gg-12-08-04 strain was a recombinant strain with breakpoint at the ORF1/2 junction between Osaka 2007 and Apeldoorn 2008 strains. This study is the first report of the complete sequence of the GII.4 Sydney 2012 strain in South Korea. Therefore, this may represent the standard sequence of the norovirus GII.4 2012 variant in South Korea and could therefore be useful for the development of norovirus vaccines. PMID:25688356

  2. Nucleotide sequence and structural organization of the human vasopressin pituitary receptor (V3) gene.

    PubMed

    René, P; Lenne, F; Ventura, M A; Bertagna, X; de Keyzer, Y

    2000-01-01

    In the pituitary, vasopressin triggers ACTH release through a specific receptor subtype, termed V3 or V1b. We cloned the V3 cDNA and showed that its expression was almost exclusive to pituitary corticotrophs and some corticotroph tumors. To study the determinants of this tissue specificity, we have now cloned the gene for the human (h) V3 receptor and characterized its structure. It is composed of two exons, spanning 10kb, with the coding region interrupted between transmembrane domains 6 and 7. We established that the transcription initiation site is located 498 nucleotides upstream of the initiator codon and showed that two polyadenylation sites may be used, while the most frequent is the most downstream. Sequence analysis of the promoter region showed no TATA box but identified consensus binding motifs for Sp1, CREB, and half sites of the estrogen receptor binding site. However comparison with another corticotroph-specific gene, proopiomelanocortin, did not identify common regulatory elements in the two promoters except for a short GC-rich region. Unexpectedly, hV3 gene analysis revealed that a formerly cloned 'artifactual' hV3 cDNA indeed corresponded to a spliced antisense transcript, overlapping the 5' part of the coding sequence in exon 1 and the promoter region. This transcript, hV3rev, was detected in normal pituitary and in many corticotroph tumors expressing hV3 sense mRNA and may therefore play a role in hV3 gene expression.

  3. Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in Exome Sequencing Studies

    PubMed Central

    Bao, Su-Ying; Yang, Wanling; Ho, Shu-Leong; Song, Yong-Qiang; Sham, Pak C.

    2013-01-01

    Exome sequencing is becoming a standard tool for mapping Mendelian disease-causing (or pathogenic) non-synonymous single nucleotide variants (nsSNVs). Minor allele frequency (MAF) filtering approach and functional prediction methods are commonly used to identify candidate pathogenic mutations in these studies. Combining multiple functional prediction methods may increase accuracy in prediction. Here, we propose to use a logit model to combine multiple prediction methods and compute an unbiased probability of a rare variant being pathogenic. Also, for the first time we assess the predictive power of seven prediction methods (including SIFT, PolyPhen2, CONDEL, and logit) in predicting pathogenic nsSNVs from other rare variants, which reflects the situation after MAF filtering is done in exome-sequencing studies. We found that a logit model combining all or some original prediction methods outperforms other methods examined, but is unable to discriminate between autosomal dominant and autosomal recessive disease mutations. Finally, based on the predictions of the logit model, we estimate that an individual has around 5% of rare nsSNVs that are pathogenic and carries ∼22 pathogenic derived alleles at least, which if made homozygous by consanguineous marriages may lead to recessive diseases. PMID:23341771

  4. A single-nucleotide substitution mutator phenotype revealed by exome sequencing of human colon adenomas.

    PubMed

    Nikolaev, Sergey I; Sotiriou, Sotirios K; Pateras, Ioannis S; Santoni, Federico; Sougioultzis, Stavros; Edgren, Henrik; Almusa, Henrikki; Robyr, Daniel; Guipponi, Michel; Saarela, Janna; Gorgoulis, Vassilis G; Antonarakis, Stylianos E; Halazonetis, Thanos D

    2012-12-01

    Oncogene-induced DNA replication stress is thought to drive genomic instability in cancer. In particular, replication stress can explain the high prevalence of focal genomic deletions mapping within very large genes in human tumors. However, the origin of single-nucleotide substitutions (SNS) in nonfamilial cancers is strongly debated. Some argue that cancers have a mutator phenotype, whereas others argue that the normal DNA replication error rates are sufficient to explain the number of observed SNSs. Here, we sequenced the exomes of 24, mostly precancerous, colon polyps. Analysis of the sequences revealed mutations in the APC, CTNNB1, and BRAF genes as the presumptive cancer-initiating events and many passenger SNSs. We used the number of SNSs in the various lesions to calculate mutation rates for normal colon and adenomas and found that colon adenomas exhibit a mutator phenotype. Interestingly, the SNSs in the adenomas mapped more often than expected within very large genes, where focal deletions in response to DNA replication stress also map. We propose that single-stranded DNA generated in response to oncogene-induced replication stress compromises the repair of deaminated cytosines and other damaged bases, leading to the observed SNS mutator phenotype.

  5. Nucleotide sequences and mutational analysis of the structural genes for nitrogenase 2 of Azotobacter vinelandii.

    PubMed Central

    Joerger, R D; Loveless, T M; Pau, R N; Mitchenall, L A; Simon, B H; Bishop, P E

    1990-01-01

    The nucleotide sequence (6,559 base pairs) of the genomic region containing the structural genes for nitrogenase 2 (V nitrogenase) from Azotobacter vinelandii was determined. The open reading frames present in this region are organized into two transcriptional units. One contains vnfH (encoding dinitrogenase reductase 2) and a ferredoxinlike open reading frame (Fd). The second one includes vnfD (encoding the alpha subunit of dinitrogenase 2), vnfG (encoding a product similar to the delta subunit of dinitrogenase 2 from A. chroococcum), and vnfK (encoding the beta subunit of dinitrogenase 2). The 5'-flanking regions of vnfH and vnfD contain sequences similar to ntrA-dependent promoters. This gene arrangement allows independent expression of vnfH-Fd and vnfDGK. Mutant strains (CA80 and CA11.80) carrying an insertion in vnfH are still able to synthesize the alpha and beta subunits of dinitrogenase 2 when grown in N-free, Mo-deficient, V-containing medium. A strain (RP1.11) carrying a deletion-plus-insertion mutation in the vnfDGK region produced only dinitrogenase reductase 2. Images PMID:2345152

  6. Nucleotide sequences and mutational analysis of the structural genes for nitrogenase 2 of Azotobacter vinelandii.

    PubMed

    Joerger, R D; Loveless, T M; Pau, R N; Mitchenall, L A; Simon, B H; Bishop, P E

    1990-06-01

    The nucleotide sequence (6,559 base pairs) of the genomic region containing the structural genes for nitrogenase 2 (V nitrogenase) from Azotobacter vinelandii was determined. The open reading frames present in this region are organized into two transcriptional units. One contains vnfH (encoding dinitrogenase reductase 2) and a ferredoxinlike open reading frame (Fd). The second one includes vnfD (encoding the alpha subunit of dinitrogenase 2), vnfG (encoding a product similar to the delta subunit of dinitrogenase 2 from A. chroococcum), and vnfK (encoding the beta subunit of dinitrogenase 2). The 5'-flanking regions of vnfH and vnfD contain sequences similar to ntrA-dependent promoters. This gene arrangement allows independent expression of vnfH-Fd and vnfDGK. Mutant strains (CA80 and CA11.80) carrying an insertion in vnfH are still able to synthesize the alpha and beta subunits of dinitrogenase 2 when grown in N-free, Mo-deficient, V-containing medium. A strain (RP1.11) carrying a deletion-plus-insertion mutation in the vnfDGK region produced only dinitrogenase reductase 2.

  7. Nucleotide sequence and phylogenetic analysis of a new potexvirus: Malva mosaic virus.

    PubMed

    Côté, Fabien; Paré, Christine; Majeau, Nathalie; Bolduc, Marilène; Leblanc, Eric; Bergeron, Michel G; Bernardy, Michael G; Leclerc, Denis

    2008-01-01

    A filamentous virus isolated from Malva neglecta Wallr. (common mallow) and propagated in Chenopodium quinoa was grown, cloned and the complete nucleotide sequence was determined (GenBank accession # DQ660333). The genomic RNA is 6858 nt in length and contains five major open reading frames (ORFs). The genomic organization is similar to members and the viral encoded proteins shared homology with the group of the Potexvirus genus in the Flexiviridae family. Phylogenetic analysis revealed a close relationship with narcissus mosaic virus (NMV), scallion virus X (ScaVX) and, to a lesser extent, to Alstroemeria virus X (AlsVX) and pepino mosaic virus (PepMV). A novel putative pseudoknot structure is predicted in the 3'-UTR of a subgroup of potexviruses, including this newly described virus. The consensus GAAAA sequence is detected at the 5'-end of the genomic RNA and experimental data strongly suggest that this motif could be a distinctive hallmark of this genus. The name Malva mosaic virus is proposed. PMID:18054524

  8. Complete nucleotide sequence of rose yellow leaf virus, a new member of the family Tombusviridae.

    PubMed

    Mollov, Dimitre; Lockhart, Ben; Zlesak, David C

    2014-10-01

    The genome of the rose yellow leaf virus (RYLV) has been determined to be 3918 nucleotides long and to contain seven open reading frames (ORFs). ORF1 encodes a 27-kDa peptide (p27). ORF2 shares a common start codon with ORF1 and continues through the amber stop codon of p27 to encode an 87-kDa (p87) protein that has amino acid similarity to the RNA-dependent RNA polymerase (RdRp) of members of the family Tombusviridae. ORFs 3 and 4 have no significant amino acid similarity to known functional viral ORFs. ORF5 encodes a 6-kDa (p6) protein that has similarity to movement proteins of members of the Tombusviridae. ORF5A has no conventional start codon and overlaps with p6. A putative +1 frameshift mechanism allows p6 translation to continue through the stop codon and results in a 12-kDa protein that has high homology to the carmovirus p13 movement protein. The 37-kDa protein encoded by ORF6 has amino acid sequence similarity to coat proteins (CP) of members of the Tombusviridae. ORF7 has no significant amino acid similarity to known viral ORFs. Phylogenetic analysis of the RdRp amino acid sequences grouped RYLV together with the unclassified Rosa rugosa leaf distortion virus (RrLDV), pelargonium line pattern virus (PLPV), and pelargonium chlorotic ring pattern virus (PCRPV) in a distinct subgroup of the family Tombusviridae. PMID:24838852

  9. Predicting mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies.

    PubMed

    Li, Miao-Xin; Kwan, Johnny S H; Bao, Su-Ying; Yang, Wanling; Ho, Shu-Leong; Song, Yong-Qiang; Sham, Pak C

    2013-01-01

    Exome sequencing is becoming a standard tool for mapping Mendelian disease-causing (or pathogenic) non-synonymous single nucleotide variants (nsSNVs). Minor allele frequency (MAF) filtering approach and functional prediction methods are commonly used to identify candidate pathogenic mutations in these studies. Combining multiple functional prediction methods may increase accuracy in prediction. Here, we propose to use a logit model to combine multiple prediction methods and compute an unbiased probability of a rare variant being pathogenic. Also, for the first time we assess the predictive power of seven prediction methods (including SIFT, PolyPhen2, CONDEL, and logit) in predicting pathogenic nsSNVs from other rare variants, which reflects the situation after MAF filtering is done in exome-sequencing studies. We found that a logit model combining all or some original prediction methods outperforms other methods examined, but is unable to discriminate between autosomal dominant and autosomal recessive disease mutations. Finally, based on the predictions of the logit model, we estimate that an individual has around 5% of rare nsSNVs that are pathogenic and carries ~22 pathogenic derived alleles at least, which if made homozygous by consanguineous marriages may lead to recessive diseases. PMID:23341771

  10. Mutations in core nucleotide sequence of hepatitis B virus correlate with fulminant and severe hepatitis.

    PubMed Central

    Ehata, T; Omata, M; Chuang, W L; Yokosuka, O; Ito, Y; Hosoda, K; Ohto, M

    1993-01-01

    Infection with hepatitis B virus leads to a wide spectrum of liver injury, including self-limited acute hepatitis, fulminant hepatitis, and chronic hepatitis with progression to cirrhosis or acute exacerbation to liver failure, as well as an asymptomatic chronic carrier state. Several studies have suggested that the hepatitis B core antigen could be an immunological target of cytotoxic T lymphocytes. To investigate the reason why the extreme immunological attack occurred in fulminant hepatitis and severe exacerbation patients, the entire precore and core region of hepatitis B virus DNA was sequenced in 24 subjects (5 fulminant, 10 severe fatal exacerbation, and 9 self-limited acute hepatitis patients). No significant change in the nucleotide sequence and deduced amino acid residue was noted in the nine self-limited acute hepatitis patients. In contrast, clustering changes in a small segment of 16 amino acids (codon 84-99 from the start of the core gene) in all seven adr subtype infected fulminant and severe exacerbation patients was found. A different segment with clustering substitutions (codon 48-60) was also found in seven of eight adw subtype infected fulminant and severe exacerbation patients. Of the 15 patients, 2 lacked precore stop mutation which was previously reported to be associated with fulminant hepatitis. These data suggest that these core regions with mutations may play an important role in the pathogenesis of hepatitis B viral disease, and such mutations are related to severe liver damage. Images PMID:8450049

  11. Complete nucleotide sequence analysis of the norovirus GII.4 Sydney variant in South Korea.

    PubMed

    Park, Ji-Sun; Lee, Sung-Geun; Jin, Ji-Young; Cho, Han-Gil; Jheong, Weon-Hwa; Paik, Soon-Young

    2015-01-01

    Norovirus is the primary cause of acute gastroenteritis in individuals of all ages. In Australia, a new strain of norovirus (GII.4) was identified in March 2012, and this strain has spread rapidly around the world. In August 2012, this new GII.4 strain was identified in patients in South Korea. Therefore, to examine the characteristics of the epidemic norovirus GII.4 2012 variant in South Korea, we conducted KM272334 full-length genomic analysis. The genome of the gg-12-08-04 strain consisted of 7,558 bp and contained three open reading frame (ORF) composites throughout the whole genome: ORF1 (5,100 bp), ORF2 (1,623 bp), and ORF3 (807 bp). Phylogenetic analyses showed that gg-12-08-04 belonged to the GII.4 Sydney 2012 variant, sharing 98.92% nucleotide similarity with this variant strain. According to SimPlot analysis, the gg-12-08-04 strain was a recombinant strain with breakpoint at the ORF1/2 junction between Osaka 2007 and Apeldoorn 2008 strains. This study is the first report of the complete sequence of the GII.4 Sydney 2012 strain in South Korea. Therefore, this may represent the standard sequence of the norovirus GII.4 2012 variant in South Korea and could therefore be useful for the development of norovirus vaccines.

  12. Cloning and nucleotide sequence of the hemA gene of Agrobacterium radiobacter.

    PubMed

    Drolet, M; Sasarman, A

    1991-04-01

    The hemA gene of Agrobacterium radiobacter ATCC4718 was identified by hybridization with a hemA probe from Rhizobium meliloti and cloned by complementation of a hemA mutant of Escherichia coli K12. E. coli hemA transformants carrying the hemA gene of Agrobacterium showed delta-aminolevulinic acid synthetase (delta-ALAS) activity in vitro. The hemA gene was carried on a 4.4 kb EcoRI fragment which could be reduced to a 2.6 kb EcoRI-SstI fragment without affecting its complementing or delta-ALAS activity. The sequence of the hemA gene showed an open reading frame of 1215 nucleotides, which could code for a protein of 44,361 Da. This is very close to the molecular weight of the HemA protein obtained using an in vitro coupled transcription-translation system (45,000 Da). Comparison of amino acid sequences of the delta-ALAS of A. radiobacter and Bradyrhizobium japonicum showed strong homology between the two enzymes; less, but still significant, homology was observed when A. radiobacter and human delta-ALAS were compared. Primer extension experiments enabled us to identify two promoters for the hemA gene of A. radiobacter. One of these promoters shows some similarity to the first promoter of the hemA gene of R. meliloti.

  13. Characterization of Sri Lanka rabies virus isolates using nucleotide sequence analysis of nucleoprotein gene.

    PubMed

    Arai, Y T; Takahashi, H; Kameoka, Y; Shiino, T; Wimalaratne, O; Lodmell, D L

    2001-01-01

    Thirty-four suspected rabid brain samples from 2 humans, 24 dogs, 4 cats, 2 mongooses, I jackal and I water buffalo were collected in 1995-1996 in Sri Lanka. Total RNA was extracted directly from brain suspensions and examined using a one-step reverse transcription-polymerase chain reaction (RT-PCR) for the rabies virus nucleoprotein (N) gene. Twenty-eight samples were found positive for the virus N gene by RT-PCR and also for the virus antigens by fluorescent antibody (FA) test. Rabies virus isolates obtained from different animal species in different regions of Sri Lanka were genetically homogenous. Sequences of 203 nucleotides (nt)-long RT-PCR products obtained from 16 of 27 samples were found identical. Sequences of 1350 nt of N genes of 14 RT-PCR products were determined. The Sri Lanka isolates under study formed a specific cluster that included also an earlier isolate from India but did not include the known isolates from China, Thailand, Malaysia, Israel, Iran, Oman, Saudi Arabia, Russia, Nepal, Philippines, Japan and from several other countries. These results suggest that one type of rabies virus is circulating among human, dog, cat, mongoose, jackal and water buffalo living near Colombo City and in other five remote regions in Sri Lanka.

  14. Escherichia coli gene purR encoding a repressor protein for purine nucleotide synthesis. Cloning, nucleotide sequence, and interaction with the purF operator.

    PubMed

    Rolfes, R J; Zalkin, H

    1988-12-25

    The Escherichia coli gene purR, encoding a repressor protein, was cloned by complementation of a purR mutation. Gene purR on a multicopy plasmid repressed expression of purF and purF-lacZ and reduced the growth rate of host cells by limiting the rate of de novo purine nucleotide synthesis. The level of a 1.3-kilobase purR mRNA was higher in cells grown with excess adenine, suggesting that synthesis of the repressor may be regulated. The chromosomal locus of purR was mapped to coordinate 1755-kb on the E. coli restriction map (Kohara, Y., Akiyama, K., and Isono, K. (1987) Cell 50, 495-508). Pur repressor bound specifically to purF operator DNA as determined by gel retardation and DNase I footprinting assays. The amino acid sequence of Pur repressor was derived from the nucleotide sequence. Pur repressor subunit contains 341 amino acids and has a calculated Mr of 38,179. Pur repressor is 31-35% identical with the galR and cytR repressors and 26% identical with the lacI repressor. These four repressors are likely homologous. Amino acid sequence similarity is greatest in an amino-terminal region presumed to contain a DNA-binding domain. A similarity is also noted in the operator sites for these repressors.

  15. Nucleotide sequences of genome segments S6, S7 and S10 of Dendrolimus punctatus cypovirus 1.

    PubMed

    Hong, J J; Duan, J L; Zhao, S L; Xu, H G; Peng, H Y

    2004-01-01

    The nucleotide sequences of genome segments S6, S7 and S10 of Dendrolimus punctatus cypovirus 1 Hunan I (DpCPV-HN(I)) and DpCPV-HN(I)-Se(3) (DpCPV-HN(I) passed three times in Spodoptera exigua) were determined. Segment S10 was 944 nucleotides in length and encoded a polyhedrin of 248 amino acids (28,439 Da). Only two nucleotide mutations were found between DpCPV-HN(I) S10 and DpCPV-HN(I)-Se3 S10, and the deduced amino acid sequences of the polyhedrin proteins were identical. Segment S7, 1 501 nucleotides, encoded a protein of 448 amino acids ( approximately 50 kDa; p50). Thirty-one nucleotide mutations were found between DpCPV-HN(I) S7 and DpCPV-HN(I)-Se3 S7, but these resulted in only four amino acid changes. DpCPV-HN(I) S6 encoded a protein of 561 amino acids (63,688 Da; p64). The amino acid sequence of p64, had a high leucine content (10%), and contained a leucine zipper motif and one ATP/GTP-binding site motif.

  16. Aligning biological sequences on distributed bus networks: a divisible load scheduling approach.

    PubMed

    Min, Wong Han; Veeravalli, Bharadwaj

    2005-12-01

    In this paper, we design a multiprocessor strategy that exploits the computational characteristics of the algorithms used for biological sequence comparison proposed in the literature. We employ divisible load theory (DLT) that is suitable for handling large scale processing on network based systems. For the first time in the domain of DLT, the problem of aligning biological sequences is attempted. The objective is to minimize the total processing time of the alignment process. In designing our strategy, DLT facilitates a clever partitioning of the entire computation process involved in such a way that the overall time consumed for aligning the sequences is a minimum. The partitioning takes into account the computation speeds of the nodes and the underlying communication network. Since this is a real-life application, the post-processing phase becomes important, and hence we consider propagating the results back in order to generate an exact alignment. We consider several cases in our analysis such as deriving closed-form solutions for the processing time for heterogeneous, homogeneous, and networks with slow links. Further, we attempt to employ a multiinstallment strategy to distribute the tasks such that a higher degree of parallelism can be achieved. For slow networks, our strategy recommends near-optimal solutions. We derive an important condition to identify such cases and propose two heuristic strategies. Also, our strategy can be extended for multisequence alignment by utilizing a clustering strategy such as the Berger-Munson algorithm proposed in the literature. Finally, we use real-life DNA samples of house mouse mitochondrion (Mus Musculus Mitochondrion, NC_001569) consisting of 16,295 residues and the DNA of human mitochondrion (Homo Sapiens Mitochondrion, NC_001807) consisting of 16,571 residues, obtainable from the GenBank, in our rigorous simulation experiments to illustrate all the theoretical findings.

  17. Enzyme sequence similarity improves the reaction alignment method for cross-species pathway comparison

    SciTech Connect

    Ovacik, Meric A.; Androulakis, Ioannis P.

    2013-09-15

    Pathway-based information has become an important source of information for both establishing evolutionary relationships and understanding the mode of action of a chemical or pharmaceutical among species. Cross-species comparison of pathways can address two broad questions: comparison in order to inform evolutionary relationships and to extrapolate species differences used in a number of different applications including drug and toxicity testing. Cross-species comparison of metabolic pathways is complex as there are multiple features of a pathway that can be modeled and compared. Among the various methods that have been proposed, reaction alignment has emerged as the most successful at predicting phylogenetic relationships based on NCBI taxonomy. We propose an improvement of the reaction alignment method by accounting for sequence similarity in addition to reaction alignment method. Using nine species, including human and some model organisms and test species, we evaluate the standard and improved comparison methods by analyzing glycolysis and citrate cycle pathways conservation. In addition, we demonstrate how organism comparison can be conducted by accounting for the cumulative information retrieved from nine pathways in central metabolism as well as a more complete study involving 36 pathways common in all nine species. Our results indicate that reaction alignment with enzyme sequence similarity results in a more accurate representation of pathway specific cross-species similarities and differences based on NCBI taxonomy.

  18. FAMSA: Fast and accurate multiple sequence alignment of huge protein families

    PubMed Central

    Deorowicz, Sebastian; Debudaj-Grabysz, Agnieszka; Gudyś, Adam

    2016-01-01

    Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa. PMID:27670777

  19. DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments.

    PubMed

    Kelly, Steven; Maini, Philip K

    2013-01-01

    The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.

  20. ClustalXeed: a GUI-based grid computation version for high performance and terabyte size multiple sequence alignment

    PubMed Central

    2010-01-01

    Background There is an increasing demand to assemble and align large-scale biological sequence data sets. The commonly used multiple sequence alignment programs are still limited in their ability to handle very large amounts of sequences because the system lacks a scalable high-performance computing (HPC) environment with a greatly extended data storage capacity. Results We designed ClustalXeed, a software system for multiple sequence alignment with incremental improvements over previous versions of the ClustalX and ClustalW-MPI software. The primary advantage of ClustalXeed over other multiple sequence alignment software is its ability to align a large family of protein or nucleic acid sequences. To solve the conventional memory-dependency problem, ClustalXeed uses both physical random access memory (RAM) and a distributed file-allocation system for distance matrix construction and pair-align computation. The computation efficiency of disk-storage system was markedly improved by implementing an efficient load-balancing algorithm, called "idle node-seeking task algorithm" (INSTA). The new editing option and the graphical user interface (GUI) provide ready access to a parallel-computing environment for users who seek fast and easy alignment of large DNA and protein sequence sets. Conclusions ClustalXeed can now compute a large volume of biological sequence data sets, which were not tractable in any other parallel or single MSA program. The main developments include: 1) the ability to tackle larger sequence alignment problems than possible with previous systems through markedly improved storage-handling capabilities. 2) Implementing an efficient task load-balancing algorithm, INSTA, which improves overall processing times for multiple sequence alignment with input sequences of non-uniform length. 3) Support for both single PC and distributed cluster systems. PMID:20849574

  1. Cloning, mutagenesis, and nucleotide sequence of a siderophore biosynthetic gene (amoA) from Aeromonas hydrophila.

    PubMed Central

    Barghouthi, S; Payne, S M; Arceneaux, J E; Byers, B R

    1991-01-01

    Many isolates of the Aeromonas species produce amonabactin, a phenolate siderophore containing 2,3-dihydroxybenzoic acid (2,3-DHB). An amonabactin biosynthetic gene (amoA) was identified (in a Sau3A1 gene library of Aeromonas hydrophila 495A2 chromosomal DNA) by its complementation of the requirement of Escherichia coli SAB11 for exogenous 2,3-DHB to support siderophore (enterobactin) synthesis. The gene amoA was subcloned as a SalI-HindIII 3.4-kb DNA fragment into pSUP202, and the complete nucleotide sequence of amoA was determined. A putative iron-regulatory sequence resembling the Fur repressor protein-binding site overlapped a possible promoter region. A translational reading frame, beginning with valine and encoding 396 amino acids, was open for 1,188 bp. The C-terminal portion of the deduced amino acid sequence showed 58% identity and 79% similarity with the E. coli EntC protein (isochorismate synthetase), the first enzyme in the E. coli 2,3-DHB biosynthetic pathway, suggesting that amoA probably encodes a step in 2,3-DHB biosynthesis and is the A. hydrophila equivalent of the E. coli entC gene. An isogenic amonabactin-negative mutant, A. hydrophila SB22, was isolated after marker exchange mutagenesis with Tn5-inactivated amoA (amoA::Tn5). The mutant excreted neither 2,3-DHB nor amonabactin, was more sensitive than the wild-type to growth inhibition by iron restriction, and used amonabactin to overcome iron starvation. Images PMID:1830579

  2. A Convex Atomic-Norm Approach to Multiple Sequence Alignment and Motif Discovery

    PubMed Central

    Yen, Ian E. H.; Lin, Xin; Zhang, Jiong; Ravikumar, Pradeep; Dhillon, Inderjit S.

    2016-01-01

    Multiple Sequence Alignment and Motif Discovery, known as NP-hard problems, are two fundamental tasks in Bioinformatics. Existing approaches to these two problems are based on either local search methods such as Expectation Maximization (EM), Gibbs Sampling or greedy heuristic methods. In this work, we develop a convex relaxation approach to both problems based on the recent concept of atomic norm and develop a new algorithm, termed Greedy Direction Method of Multiplier, for solving the convex relaxation with two convex atomic constraints. Experiments show that our convex relaxation approach produces solutions of higher quality than those standard tools widely-used in Bioinformatics community on the Multiple Sequence Alignment and Motif Discovery problems. PMID:27559428

  3. KMAD: knowledge-based multiple sequence alignment for intrinsically disordered proteins

    PubMed Central

    Lange, Joanna; Wyrwicz, Lucjan S.; Vriend, Gert

    2016-01-01

    Summary: Intrinsically disordered proteins (IDPs) lack tertiary structure and thus differ from globular proteins in terms of their sequence–structure–function relations. IDPs have lower sequence conservation, different types of active sites and a different distribution of functionally important regions, which altogether make their multiple sequence alignment (MSA) difficult. The KMAD MSA software has been written specifically for the alignment and annotation of IDPs. It augments the substitution matrix with knowledge about post-translational modifications, functional domains and short linear motifs. Results: MSAs produced with KMAD describe well-conserved features among IDPs, tend to agree well with biological intuition, and are a good basis for designing new experiments to shed light on this large, understudied class of proteins. Availability and implementation: KMAD web server is accessible at http://www.cmbi.ru.nl/kmad/. A standalone version is freely available. Contact: vriend@cmbi.ru.nl PMID:26568635

  4. seqphase: a web tool for interconverting phase input/output files and fasta sequence alignments.

    PubMed

    Flot, J-F

    2010-01-01

    The program phase is widely used for Bayesian inference of haplotypes from diploid genotypes; however, manually creating phase input files from sequence alignments is an error-prone and time-consuming process, especially when dealing with numerous variable sites and/or individuals. Here, a web tool called seqphase is presented that generates phase input files from fasta sequence alignments and converts phase output files back into fasta. During the production of the phase input file, several consistency checks are performed on the dataset and suitable command line options to be used for the actual phase data analysis are suggested. seqphase was written in perl and is freely accessible over the Internet at the address http://www.mnhn.fr/jfflot/seqphase.

  5. rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

    PubMed Central

    Hahn, Lars; Leimeister, Chris-André; Morgenstern, Burkhard

    2016-01-01

    Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/ PMID:27760124

  6. Complete Nucleotide Sequences and Genome Organization of Two Pepper Mild Mottle Virus Isolates from Capsicum annuum in South Korea.

    PubMed

    Choi, Seung-Kook; Choi, Gug-Seoun; Kwon, Sun-Jung; Yoon, Ju-Yeon

    2016-05-19

    The complete genome sequences of pepper mild mottle virus (PMMoV)-P2 and -P3 were determined by the Sanger sequencing method. Although PMMoV-P2 and PMMoV-P3 have different pathogenicity in some pepper cultivars, the complete genome sequences of PMMoV-P2 and -P3 are composed of 6,356 nucleotides (nt). In this study, we report the complete genome sequences and genome organization of PMMoV-P2 and -P3 isolates from pepper species in South Korea.

  7. Complete Nucleotide Sequences and Genome Organization of Two Pepper Mild Mottle Virus Isolates from Capsicum annuum in South Korea

    PubMed Central

    Choi, Seung-Kook; Choi, Gug-Seoun; Kwon, Sun-Jung

    2016-01-01

    The complete genome sequences of pepper mild mottle virus (PMMoV)-P2 and -P3 were determined by the Sanger sequencing method. Although PMMoV-P2 and PMMoV-P3 have different pathogenicity in some pepper cultivars, the complete genome sequences of PMMoV-P2 and -P3 are composed of 6,356 nucleotides (nt). In this study, we report the complete genome sequences and genome organization of PMMoV-P2 and -P3 isolates from pepper species in South Korea. PMID:27198033

  8. [Polymorphism of DNA nucleotide sequence as a source of enhancement of the discrimination potential of the STR-markers].

    PubMed

    Zemskova, E Yu; Timoshenko, T V; Leonov, S N; Ivanov, P L

    2016-01-01

    The objective of the present pilot investigation was to reveal and to study polymorphism of nucleotide sequence in the alleles of STR loci of human autosomal DNA with special reference to the role of this phenomenon as a source of the differences between homonymous allelic variants. The secondary objection was to evaluate the possibility of using the data thus obtained for the enhancement of the informative value of the forensic medical genotyping of STR loci by means of identification of single nucleotide polymorphisms (SNP) for the purpose of extending their allelic spectrum. The methodological basis of the study was constituted by the comprehensive amplified fragment length polymorphism (AFLP) analysis and amplified fragment sequence polymorphisms (AFSP) analysis of DNA with the use of the PLEX-ID^TM analytical mass-spectrometry platform (Abbot Molecular, USA). The study has demonstrated that polymorphism of DNA nucleotide sequence can be regarded as the possible source of enhancement of the discriminating potential of STR markers. It means that the analysis of polymorphism of DNA nucleotide sequence for genotyping AFLP-type markers of chromosomal DNA can considerably increase the effectiveness of their application as individualizing markers for the purpose of molecular genetic expertises.

  9. A high-density simple sequence repeat and single nucleotide polymorphism genetic map of the tetraploid cotton genome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Cotton genome complexity was investigated with a saturated molecular genetic map that combined several sets of microsatellites or simple sequence repeats (SSR) and the first major public set of single nucleotide polymorphism (SNP) markers in cotton genomes (Gossypium spp.), and that was constructed ...

  10. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  11. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  12. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  13. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  14. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  15. [Polymorphism of DNA nucleotide sequence as a source of enhancement of the discrimination potential of the STR-markers].

    PubMed

    Zemskova, E Yu; Timoshenko, T V; Leonov, S N; Ivanov, P L

    2016-01-01

    The objective of the present pilot investigation was to reveal and to study polymorphism of nucleotide sequence in the alleles of STR loci of human autosomal DNA with special reference to the role of this phenomenon as a source of the differences between homonymous allelic variants. The secondary objection was to evaluate the possibility of using the data thus obtained for the enhancement of the informative value of the forensic medical genotyping of STR loci by means of identification of single nucleotide polymorphisms (SNP) for the purpose of extending their allelic spectrum. The methodological basis of the study was constituted by the comprehensive amplified fragment length polymorphism (AFLP) analysis and amplified fragment sequence polymorphisms (AFSP) analysis of DNA with the use of the PLEX-ID^TM analytical mass-spectrometry platform (Abbot Molecular, USA). The study has demonstrated that polymorphism of DNA nucleotide sequence can be regarded as the possible source of enhancement of the discriminating potential of STR markers. It means that the analysis of polymorphism of DNA nucleotide sequence for genotyping AFLP-type markers of chromosomal DNA can considerably increase the effectiveness of their application as individualizing markers for the purpose of molecular genetic expertises. PMID:27500481

  16. Molecular cloning and nucleotide sequence of a transforming gene detected by transfection of chicken B-cell lymphoma DNA

    NASA Astrophysics Data System (ADS)

    Goubin, Gerard; Goldman, Debra S.; Luce, Judith; Neiman, Paul E.; Cooper, Geoffrey M.

    1983-03-01

    A transforming gene detected by transfection of chicken B-cell lymphoma DNA has been isolated by molecular cloning. It is homologous to a conserved family of sequences present in normal chicken and human DNAs but is not related to transforming genes of acutely transforming retroviruses. The nucleotide sequence of the cloned transforming gene suggests that it encodes a protein that is partially homologous to the amino terminus of transferrin and related proteins although only about one tenth the size of transferrin.

  17. Single nucleotide polymorphisms in the IS900 sequence of Mycobacterium avium subsp. paratuberculosis are strain type specific.

    PubMed

    Castellanos, Elena; Aranaz, Alicia; de Juan, Lucia; Alvarez, Julio; Rodríguez, Sabrina; Romero, Beatriz; Bezos, Javier; Stevenson, Karen; Mateos, Ana; Domínguez, Lucas

    2009-07-01

    Insertion sequence IS900 is used as a target for the identification of Mycobacterium avium subsp. paratuberculosis. Previous reports have revealed single nucleotide polymorphisms within IS900. This study, which analyzed the IS900 sequences of a panel of isolates representing M. avium subsp. paratuberculosis strain types I, II, and III, revealed conserved type-specific polymorphisms that could be utilized as a tool for diagnostic and epidemiological purposes.

  18. JAR3D Webserver: Scoring and aligning RNA loop sequences to known 3D motifs

    PubMed Central

    Roll, James; Zirbel, Craig L.; Sweeney, Blake; Petrov, Anton I.; Leontis, Neocles

    2016-01-01

    Many non-coding RNAs have been identified and may function by forming 2D and 3D structures. RNA hairpin and internal loops are often represented as unstructured on secondary structure diagrams, but RNA 3D structures show that most such loops are structured by non-Watson–Crick basepairs and base stacking. Moreover, different RNA sequences can form the same RNA 3D motif. JAR3D finds possible 3D geometries for hairpin and internal loops by matching loop sequences to motif groups from the RNA 3D Motif Atlas, by exact sequence match when possible, and by probabilistic scoring and edit distance for novel sequences. The scoring gauges the ability of the sequences to form the same pattern of interactions observed in 3D structures of the motif. The JAR3D webserver at http://rna.bgsu.edu/jar3d/ takes one or many sequences of a single loop as input, or else one or many sequences of longer RNAs with multiple loops. Each sequence is scored against all current motif groups. The output shows the ten best-matching motif groups. Users can align input sequences to each of the motif groups found by JAR3D. JAR3D will be updated with every release of the RNA 3D Motif Atlas, and so its performance is expected to improve over time. PMID:27235417

  19. Nucleotide sequence of a crustacean 18S ribosomal RNA gene and secondary structure of eukaryotic small subunit ribosomal RNAs.

    PubMed

    Nelles, L; Fang, B L; Volckaert, G; Vandenberghe, A; De Wachter, R

    1984-12-11

    The primary structure of the gene for 18 S rRNA of the crustacean Artemia salina was determined. The sequence has been aligned with 13 other small ribosomal subunit RNA sequences of eukaryotic, archaebacterial, eubacterial, chloroplastic and plant mitochondrial origin. Secondary structure models for these RNAs were derived on the basis of previously proposed models and additional comparative evidence found in the alignment. Although there is a general similarity in the secondary structure models for eukaryotes and prokaryotes, the evidence seems to indicate a different topology in a central area of the structures.

  20. Genomic signal processing methods for computation of alignment-free distances from DNA sequences.

    PubMed

    Borrayo, Ernesto; Mendizabal-Ruiz, E Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P; Morales, J Alejandro

    2014-01-01

    Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.

  1. Genomic Signal Processing Methods for Computation of Alignment-Free Distances from DNA Sequences

    PubMed Central

    Borrayo, Ernesto; Mendizabal-Ruiz, E. Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P.; Morales, J. Alejandro

    2014-01-01

    Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments. PMID:25393409

  2. A single nucleotide polymorphism and sequence analysis of CSN1S1 gene promoter region in Chinese Bos grunniens (yak).

    PubMed

    Bai, W L; Yin, R H; Dou, Q L; Yang, J C; Zhao, S J; Ma, Z J; Yin, R L; Luo, G B; Zhao, Z H

    2010-01-01

    The aim of this study was to investigate the polymorphism of the CSN1S1 gene promoter region in 4 Chinese yak breeds, and compare the yak CSN1S1 gene promoter region sequences with other ruminants. A Polymerase Chain Reaction-Single Strand Conformation Polymorphism protocol was developed for rapid genotyping of the yak CSN1S1 gene. One hundred fifty-eight animals from 4 Chinese yak breeds were genotyped at the CSN1S1 locus using the protocol developed. A single nucleotide polymorphism of the CSN1S1 gene promoter region has been identified in all yak breeds investigated. The polymorphism consists of a single nucleotide substitution G-->A at position 386 of the CSN1S1 gene promoter region, resulting in two alleles named, respectively, G(386) and A(386), based on the nucleotide at position 386. The allele G(386) was found to be more common in the animals investigated. The corresponding nucleotide sequences in GenBank of yak (having the same nucleotides as allele G(386) in this study), bovine, water buffalo, sheep, and goat had similarity of 99.68%, 99.35%, 97.42%, 95.14%, and 94.19%, respectively, with the yak allele A(386.).

  3. The Coding of Biological Information: From Nucleotide Sequence to Protein Recognition

    NASA Astrophysics Data System (ADS)

    Štambuk, Nikola

    The paper reviews the classic results of Swanson, Dayhoff, Grantham, Blalock and Root-Bernstein, which link genetic code nucleotide patterns to the protein structure, evolution and molecular recognition. Symbolic representation of the binary addresses defining particular nucleotide and amino acid properties is discussed, with consideration of: structure and metric of the code, direct correspondence between amino acid and nucleotide information, and molecular recognition of the interacting protein motifs coded by the complementary DNA and RNA strands.

  4. Isolation of a family of resistance gene analogue sequences of the nucleotide binding site (NBS) type from Lens species.

    PubMed

    Yaish, M W F; Sáenz de Miera, L E; Pérez de la Vega, M

    2004-08-01

    Most known plant disease-resistance genes (R genes) include in their encoded products domains such as a nucleotide-binding site (NBS) or leucine-rich repeats (LRRs). Sequences with unknown function, but encoding these conserved domains, have been defined as resistance gene analogues (RGAs). The conserved motifs within plant NBS domains make it possible to use degenerate primers and PCR to isolate RGAs. We used degenerate primers deduced from conserved motifs in the NBS domain of NBS-LRR resistance proteins to amplify genomic sequences from Lens species. Fragments from approximately 500-850 bp were obtained. The nucleotide sequence analysis of these fragments revealed 32 different RGA sequences in Lens species with a high similarity (up to 91%) to RGAs from other plants. The predicted amino acid sequences showed that lentil sequences contain all the conserved motifs (P-loop, kinase-2, kinase-3a, GLPL, and MHD) present in the majority of other known plant NBS-LRR resistance genes. Phylogenetic analyses grouped the Lens NBS sequences with the Toll and interleukin-1 receptor (TIR) subclass of NBS-LRR genes, as well as with RGA sequences isolated from other legume species. Using inverse PCR on one putative RGA of lentil, we were able to amplify the flanking regions of this sequence, which contained features found in R proteins.

  5. Structural organization, nucleotide sequence, and regulation of the Haemophilus influenzae rec-1+ gene.

    PubMed Central

    Zulty, J J; Barcak, G J

    1993-01-01

    The Haemophilus influenzae rec-1+ protein plays a central role in DNA metabolism, participating in general homologous recombination, recombinational (postreplication) DNA repair, and prophage induction. Although many H. influenzae rec-1 mutants have been phenotypically characterized, little is known about the rec-1+ gene at the molecular level. In this study, we present the genetic organization of the rec-1+ locus, the DNA sequence of rec-1+, and studies of the transcriptional regulation of rec-1+ during cellular assault by DNA-damaging agents and during the induction of competence for genetic transformation. Although little is known about promoter structure in H. influenzae, we identified a potential rec-1+ promoter that is identical in 11 of 12 positions to the bacterial sigma 70-dependent promoter consensus sequence. Results from a primer extension analysis revealed that the start site of rec-1+ transcription is centered 6 nucleotides downstream of this promoter. We identified potential DNA binding sites in the rec-1+ gene for LexA, integration host factor, and cyclic AMP receptor protein. We obtained evidence that at least one of the proposed cyclic AMP receptor protein binding sites is active in modulating rec-1+ transcription. This finding makes rec-1+ control circuitry novel among recA+ homologs. Two H. influenzae DNA uptake sequences that may function as a transcription termination signal were identified in inverted orientations at the end of the rec-1+ coding sequence. In addition, we report the first use of the Escherichia coli lacZ operon fusion technique in H. influenzae to study the transcriptional control of rec-1+. Our results indicate that rec-1+ is transcriptionally induced about threefold during DNA-damaging events. Furthermore, we show that rec-1+ can substitute for recA+ in E. coli to modulate SOS induction of dinB1 expression. Surprisingly, although 5% of the H. influenzae genome is in the form of single-stranded DNA during competence for

  6. Multiple Amino Acid Sequence Alignment Nitrogenase Component 1: Insights into Phylogenetics and Structure-Function Relationships

    PubMed Central

    Howard, James B.; Kechris, Katerina J.; Rees, Douglas C.; Glazer, Alexander N.

    2013-01-01

    Amino acid residues critical for a protein's structure-function are retained by natural selection and these residues are identified by the level of variance in co-aligned homologous protein sequences. The relevant residues in the nitrogen fixation Component 1 α- and β-subunits were identified by the alignment of 95 protein sequences. Proteins were included from species encompassing multiple microbial phyla and diverse ecological niches as well as the nitrogen fixation genotypes, anf, nif, and vnf, which encode proteins associated with cofactors differing at one metal site. After adjusting for differences in sequence length, insertions, and deletions, the remaining >85% of the sequence co-aligned the subunits from the three genotypes. Six Groups, designated Anf, Vnf , and Nif I-IV, were assigned based upon genetic origin, sequence adjustments, and conserved residues. Both subunits subdivided into the same groups. Invariant and single variant residues were identified and were defined as “core” for nitrogenase function. Three species in Group Nif-III, Candidatus Desulforudis audaxviator, Desulfotomaculum kuznetsovii, and Thermodesulfatator indicus, were found to have a seleno-cysteine that replaces one cysteinyl ligand of the 8Fe:7S, P-cluster. Subsets of invariant residues, limited to individual groups, were identified; these unique residues help identify the gene of origin (anf, nif, or vnf) yet should not be considered diagnostic of the metal content of associated cofactors. Fourteen of the 19 residues that compose the cofactor pocket are invariant or single variant; the other five residues are highly variable but do not correlate with the putative metal content of the cofactor. The variable residues are clustered on one side of the cofactor, away from other functional centers in the three dimensional structure. Many of the invariant and single variant residues were not previously recognized as potentially critical and their identification provides the bases

  7. Proteus mirabilis MR/P fimbrial operon: genetic organization, nucleotide sequence, and conditions for expression.

    PubMed Central

    Bahrani, F K; Mobley, H L

    1994-01-01

    Proteus mirabilis, an agent of urinary tract infection, expresses at least four fimbrial types. Among these are the MR/P (mannose-resistant/Proteus-like) fimbriae. MrpA, the structural subunit, is optimally expressed at 37 degrees C in Luria broth cultured statically for 48 h by each of seven strains examined. Genes encoding this fimbria were isolated, and the complete nucleotide sequence was determined. The mrp gene cluster encoded by 7,293 bp predicts eight polypeptides: MrpI (22,133 Da), MrpA (17,909 Da), MrpB (19,632 Da), MrpC (96,823 Da), MrpD (27,886 Da), MrpE (19,470 Da), MrpF (17,363 Da), and MrpG (13,169 Da). mrpI is upstream of the gene encoding the major structural subunit gene mrpA and is transcribed in the direction opposite to that of the rest of the operon. All predicted polypeptides share > or = 25% amino acid identity with at least one other enteric fimbrial gene product encoded by the pap, fim, smf, fan, or mrk gene clusters. Images PMID:7910820

  8. Nucleotide sequence and mutational analysis of the vnfENX region of Azotobacter vinelandii.

    PubMed

    Wolfinger, E D; Bishop, P E

    1991-12-01

    The nucleotide sequence (3,600 bp) of a second copy of nifENX-like genes in Azotobacter vinelandii has been determined. These genes are located immediately downstream from vnfA and have been designated vnfENX. The vnfENX genes appear to be organized as a single transcriptional unit that is preceded by a potential RpoN-dependent promoter. While the nifEN genes are thought to be evolutionarily related to nifDK, the vnfEN genes appear to be more closely related to nifEN than to either nifDK, vnfDK, or anfDK. Mutant strains (CA47 and CA48) carrying insertions in vnfE and vnfN, respectively, are able to grow diazotrophically in molybdenum (Mo)-deficient medium containing vanadium (V) (Vnf+) and in medium lacking both Mo and V (Anf+). However, a double mutant (strain DJ42.48) which contains a nifEN deletion and an insertion in vnfE is unable to grow diazotrophically in Mo-sufficient medium or in Mo-deficient medium with or without V. This suggests that NifE and NifN substitute for VnfE and VnfN when the vnfEN genes are mutationally inactivated. AnfA is not required for the expression of a vnfN-lacZ transcriptional fusion, even though this fusion is expressed under Mo- and V-deficient diazotrophic growth conditions.

  9. Whole-genome sequencing identifies genomic heterogeneity at a nucleotide and chromosomal level in bladder cancer

    PubMed Central

    Morrison, Carl D.; Liu, Pengyuan; Woloszynska-Read, Anna; Zhang, Jianmin; Luo, Wei; Qin, Maochun; Bshara, Wiam; Conroy, Jeffrey M.; Sabatini, Linda; Vedell, Peter; Xiong, Donghai; Liu, Song; Wang, Jianmin; Shen, He; Li, Yinwei; Omilian, Angela R.; Hill, Annette; Head, Karen; Guru, Khurshid; Kunnev, Dimiter; Leach, Robert; Eng, Kevin H.; Darlak, Christopher; Hoeflich, Christopher; Veeranki, Srividya; Glenn, Sean; You, Ming; Pruitt, Steven C.; Johnson, Candace S.; Trump, Donald L.

    2014-01-01

    Using complete genome analysis, we sequenced five bladder tumors accrued from patients with muscle-invasive transitional cell carcinoma of the urinary bladder (TCC-UB) and identified a spectrum of genomic aberrations. In three tumors, complex genotype changes were noted. All three had tumor protein p53 mutations and a relatively large number of single-nucleotide variants (SNVs; average of 11.2 per megabase), structural variants (SVs; average of 46), or both. This group was best characterized by chromothripsis and the presence of subclonal populations of neoplastic cells or intratumoral mutational heterogeneity. Here, we provide evidence that the process of chromothripsis in TCC-UB is mediated by nonhomologous end-joining using kilobase, rather than megabase, fragments of DNA, which we refer to as “stitchers,” to repair this process. We postulate that a potential unifying theme among tumors with the more complex genotype group is a defective replication–licensing complex. A second group (two bladder tumors) had no chromothripsis, and a simpler genotype, WT tumor protein p53, had relatively few SNVs (average of 5.9 per megabase) and only a single SV. There was no evidence of a subclonal population of neoplastic cells. In this group, we used a preclinical model of bladder carcinoma cell lines to study a unique SV (translocation and amplification) of the gene glutamate receptor ionotropic N-methyl D-aspertate as a potential new therapeutic target in bladder cancer. PMID:24469795

  10. Isolation and nucleotide sequencing of lactose carrier mutants that transport maltose.

    PubMed Central

    Brooker, R J; Wilson, T H

    1985-01-01

    The wild-type lactose carrier of Escherichia coli has a poor ability to transport the disaccharide maltose. However, it is possible to select lactose carrier mutants that have an enhanced ability to transport maltose by growing E. coli cells on maltose minimal plates in the presence of isopropyl thiogalactoside (an inducer of the lac operon). We have utilized this approach to isolate 18 independent lactose permease mutants that transport maltose. The relevant DNA sequences have been determined, and all of the mutations were found to be single base pair changes either at triplet 177 or at triplet 236. The nucleotide changes replace alanine-177 with valine or threonine, or tyrosine-236 with phenylalanine, asparagine, serine, or histidine. Transport experiments indicate that all of the mutants have faster maltose transport compared with the wild-type strain. Position 177 mutants retain the ability to transport galactosides, such as lactose and melibiose, at rates similar to the rate of the wild-type strain. In contrast, the position 236 mutants are markedly defective in the ability to transport galactosides. With regard to secondary structure, alanine-177 and tyrosine-236 are located on adjacent hydrophobic segments of the lactose carrier that are predicted to span the membrane. Thus, the results of this study indicate that the substrate recognition site of the lactose carrier is located within the plane of the lipid bilayer. In addition, a tertiary structure model is proposed that suggests how certain transmembrane segments might be localized relative to one another. Images PMID:3889919

  11. Whole-genome sequencing identifies genomic heterogeneity at a nucleotide and chromosomal level in bladder cancer.

    PubMed

    Morrison, Carl D; Liu, Pengyuan; Woloszynska-Read, Anna; Zhang, Jianmin; Luo, Wei; Qin, Maochun; Bshara, Wiam; Conroy, Jeffrey M; Sabatini, Linda; Vedell, Peter; Xiong, Donghai; Liu, Song; Wang, Jianmin; Shen, He; Li, Yinwei; Omilian, Angela R; Hill, Annette; Head, Karen; Guru, Khurshid; Kunnev, Dimiter; Leach, Robert; Eng, Kevin H; Darlak, Christopher; Hoeflich, Christopher; Veeranki, Srividya; Glenn, Sean; You, Ming; Pruitt, Steven C; Johnson, Candace S; Trump, Donald L

    2014-02-11

    Using complete genome analysis, we sequenced five bladder tumors accrued from patients with muscle-invasive transitional cell carcinoma of the urinary bladder (TCC-UB) and identified a spectrum of genomic aberrations. In three tumors, complex genotype changes were noted. All three had tumor protein p53 mutations and a relatively large number of single-nucleotide variants (SNVs; average of 11.2 per megabase), structural variants (SVs; average of 46), or both. This group was best characterized by chromothripsis and the presence of subclonal populations of neoplastic cells or intratumoral mutational heterogeneity. Here, we provide evidence that the process of chromothripsis in TCC-UB is mediated by nonhomologous end-joining using kilobase, rather than megabase, fragments of DNA, which we refer to as "stitchers," to repair this process. We postulate that a potential unifying theme among tumors with the more complex genotype group is a defective replication-licensing complex. A second group (two bladder tumors) had no chromothripsis, and a simpler genotype, WT tumor protein p53, had relatively few SNVs (average of 5.9 per megabase) and only a single SV. There was no evidence of a subclonal population of neoplastic cells. In this group, we used a preclinical model of bladder carcinoma cell lines to study a unique SV (translocation and amplification) of the gene glutamate receptor ionotropic N-methyl D-aspertate as a potential new therapeutic target in bladder cancer.

  12. Associations of single nucleotide polymorphisms in the Pygo2 coding sequence with idiopathic oligospermia and azoospermia.

    PubMed

    Ge, S-Q; Grifin, J; Liu, L-H; Aston, K I; Simon, L; Jenkins, T G; Emery, B R; Carrell, D T

    2015-08-07

    Male infertility is often associated with a decreased sperm count. The Pygo2 gene is expressed in the elongating spermatid during chromatin remodeling; thus impairment in PYGO2 function might lead to spermatogenic arrest, sperm count reduction, and subsequent infertility. The aim of this study was to identify mutations in Pygo2 that might lead to idiopathic oligospermia and azoospermia. DNA was isolated from venous blood from 77 men with normal fertility and 195 men with idiopathic oligospermia or azoospermia. Polymerase chain reaction-sequencing analysis was performed for the three Pygo2 coding regions. Non-synonymous single nucleotide polymorphisms (SNPs) were detected and analyzed using SIFT, Polyphen-2, and Mutation Taster softwares to identify possible changes in protein structure that could affect phenotype. Pygo2 sequencing was successful for 178 patients (30 with mild or moderate oligospermia, 57 with severe oligospermia, and 91 with azoospermia). Three previously reported non-synonymous SNPs were identified in patients with azoospermia or severe oligospermic but not in those with mild or moderate oligozoopermia or normozoospermia. SNPs rs61758740 (M141I) and rs141722381 (N240I) cause the replacement of one hydrophobic or hydrophilic amino acid, respectively, with another, and SNP rs61758741 (K261E) causes the replacement of a basic amino acid with an acidic one. The software predictions demonstrated that SNP rsl41722381 would likely result in disrupted tertiary protein structure and thus could be involved in disease pathogenesis. Overall, this study demonstrated that SNPs in the coding region of Pygo2 might be one of the causative factors in idiopathic oligospermia and azoospermia, resulting in male infertility.

  13. Postzygotic single-nucleotide mosaicisms in whole-genome sequences of clinically unremarkable individuals

    PubMed Central

    Huang, August Y; Xu, Xiaojing; Ye, Adam Y; Wu, Qixi; Yan, Linlin; Zhao, Boxun; Yang, Xiaoxu; He, Yao; Wang, Sheng; Zhang, Zheng; Gu, Bowen; Zhao, Han-Qing; Wang, Meng; Gao, Hua; Gao, Ge; Zhang, Zhichao; Yang, Xiaoling; Wu, Xiru; Zhang, Yuehua; Wei, Liping

    2014-01-01

    Postzygotic single-nucleotide mutations (pSNMs) have been studied in cancer and a few other overgrowth human disorders at whole-genome scale and found to play critical roles. However, in clinically unremarkable individuals, pSNMs have never been identified at whole-genome scale largely due to technical difficulties and lack of matched control tissue samples, and thus the genome-wide characteristics of pSNMs remain unknown. We developed a new Bayesian-based mosaic genotyper and a series of effective error filters, using which we were able to identify 17 SNM sites from ∼80× whole-genome sequencing of peripheral blood DNAs from three clinically unremarkable adults. The pSNMs were thoroughly validated using pyrosequencing, Sanger sequencing of individual cloned fragments, and multiplex ligation-dependent probe amplification. The mutant allele fraction ranged from 5%-31%. We found that C→T and C→A were the predominant types of postzygotic mutations, similar to the somatic mutation profile in tumor tissues. Simulation data showed that the overall mutation rate was an order of magnitude lower than that in cancer. We detected varied allele fractions of the pSNMs among multiple samples obtained from the same individuals, including blood, saliva, hair follicle, buccal mucosa, urine, and semen samples, indicating that pSNMs could affect multiple sources of somatic cells as well as germ cells. Two of the adults have children who were diagnosed with Dravet syndrome. We identified two non-synonymous pSNMs in SCN1A, a causal gene for Dravet syndrome, from these two unrelated adults and found that the mutant alleles were transmitted to their children, highlighting the clinical importance of detecting pSNMs in genetic counseling. PMID:25312340

  14. Quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses.

    PubMed

    Greenbaum, Benjamin D; Cocco, Simona; Levine, Arnold J; Monasson, Rémi

    2014-04-01

    We outline a theory to quantify the interplay of entropic and selective forces on nucleotide organization and apply it to the genomes of single-stranded RNA viruses. We quantify these forces as intensive variables that can easily be compared between sequences, outline a computationally efficient transfer-matrix method for their calculation, and apply this method to influenza and HIV viruses. We find viruses altering their dinucleotide motif use under selective forces, with these forces on CpG dinucleotides growing stronger in influenza the longer it replicates in humans. For a subset of genes in the human genome, many involved in antiviral innate immunity, the forces acting on CpG dinucleotides are even greater than the forces observed in viruses, suggesting that both effects are in response to similar selective forces involving the innate immune system. We further find that the dynamics of entropic forces balancing selective forces can be used to predict how long it will take a virus to adapt to a new host, and that it would take H1N1 several centuries to adapt to humans from birds, typically contributing many of its synonymous substitutions to the forcible removal of CpG dinucleotides. By examining the probability landscape of dinucleotide motifs, we predict where motifs are likely to appear using only a single-force parameter and uncover the localization of UpU motifs in HIV. Essentially, we extend the natural language and concepts of statistical physics, such as entropy and conjugated forces, to understanding viral sequences and, more generally, constrained genome evolution.

  15. Nucleotide sequences and genetic analysis of hydrogen oxidation (hox) genes in Azotobacter vinelandii.

    PubMed Central

    Menon, A L; Mortenson, L E; Robson, R L

    1992-01-01

    Azotobacter vinelandii contains a heterodimeric, membrane-bound [NiFe]hydrogenase capable of catalyzing the reversible oxidation of H2. The beta and alpha subunits of the enzyme are encoded by the structural genes hoxK and hoxG, respectively, which appear to form part of an operon that contains at least one further potential gene (open reading frame 3 [ORF3]). In this study, determination of the nucleotide sequence of a region of 2,344 bp downstream of ORF3 revealed four additional closely spaced or overlapping ORFs. These ORFs, ORF4 through ORF7, potentially encode polypeptides with predicted masses of 22.8, 11.4, 16.3, and 31 kDa, respectively. Mutagenesis of the chromosome of A. vinelandii in the area sequenced was carried out by introduction of antibiotic resistance gene cassettes. Disruption of hoxK and hoxG by a kanamycin resistance gene abolished whole-cell hydrogenase activity coupled to O2 and led to loss of the hydrogenase alpha subunit. Insertional mutagenesis of ORF3 through ORF7 with a promoterless lacZ-Kmr cassette established that the region is transcriptionally active and involved in H2 oxidation. We propose to call ORF3 through ORF7 hoxZ, hoxM, hoxL, hoxO, and hoxQ, respectively. The predicted hox gene products resemble those encoded by genes from hydrogenase-related operons in other bacteria, including Escherichia coli and Alcaligenes eutrophus. Images PMID:1624446

  16. Alignment-free analysis of barcode sequences by means of compression-based methods

    PubMed Central

    2013-01-01

    Background The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. Results We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. Conclusions In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our

  17. Graph-based modeling of tandem repeats improves global multiple sequence alignment.

    PubMed

    Szalkowski, Adam M; Anisimova, Maria

    2013-09-01

    Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family.

  18. T box transcription antitermination riboswitch: Influence of nucleotide sequence and orientation on tRNA binding by the antiterminator element

    PubMed Central

    Fauzi, Hamid; Agyeman, Akwasi; Hines, Jennifer V.

    2008-01-01

    Many bacteria utilize riboswitch transcription regulation to monitor and appropriately respond to cellular levels of important metabolites or effector molecules. The T box transcription antitermination riboswitch responds to cognate uncharged tRNA by specifically stabilizing an antiterminator element in the 5′-untranslated mRNA leader region and precluding formation of a thermodynamically more stable terminator element. Stabilization occurs when the tRNA acceptor end base pairs with the first four nucleotides in the seven nucleotide bulge of the highly conserved antiterminator element. The significance of the conservation of the antiterminator bulge nucleotides that do not base pair with the tRNA is unknown, but they are required for optimal function. In vitro selection was used to determine if the isolated antiterminator bulge context alone dictates the mode in which the tRNA acceptor end binds the bulge nucleotides. No sequence conservation beyond complementarity was observed and the location was not constrained to the first four bases of the bulge. The results indicate that formation of a structure that recognizes the tRNA acceptor end in isolation is not the determinant driving force for the high phylogenetic sequence conservation observed within the antiterminator bulge. Additional factors or T box leader features more likely influenced the phylogenetic sequence conservation. PMID:19152843

  19. Structural dynamics of cereal mitochondrial genomes as revealed by complete nucleotide sequencing of the wheat mitochondrial genome.

    PubMed

    Ogihara, Yasunari; Yamazaki, Yukiko; Murai, Koji; Kanno, Akira; Terachi, Toru; Shiina, Takashi; Miyashita, Naohiko; Nasuda, Shuhei; Nakamura, Chiharu; Mori, Naoki; Takumi, Shigeo; Murata, Minoru; Futo, Satoshi; Tsunewaki, Koichiro

    2005-01-01

    The application of a new gene-based strategy for sequencing the wheat mitochondrial genome shows its structure to be a 452 528 bp circular molecule, and provides nucleotide-level evidence of intra-molecular recombination. Single, reciprocal and double recombinant products, and the nucleotide sequences of the repeats that mediate their formation have been identified. The genome has 55 genes with exons, including 35 protein-coding, 3 rRNA and 17 tRNA genes. Nucleotide sequences of seven wheat genes have been determined here for the first time. Nine genes have an exon-intron structure. Gene amplification responsible for the production of multicopy mitochondrial genes, in general, is species-specific, suggesting the recent origin of these genes. About 16, 17, 15, 3.0 and 0.2% of wheat mitochondrial DNA (mtDNA) may be of genic (including introns), open reading frame, repetitive sequence, chloroplast and retro-element origin, respectively. The gene order of the wheat mitochondrial gene map shows little synteny to the rice and maize maps, indicative that thorough gene shuffling occurred during speciation. Almost all unique mtDNA sequences of wheat, as compared with rice and maize mtDNAs, are redundant DNA. Features of the gene-based strategy are discussed, and a mechanistic model of mitochondrial gene amplification is proposed. PMID:16260473

  20. Structural dynamics of cereal mitochondrial genomes as revealed by complete nucleotide sequencing of the wheat mitochondrial genome

    PubMed Central

    Ogihara, Yasunari; Yamazaki, Yukiko; Murai, Koji; Kanno, Akira; Terachi, Toru; Shiina, Takashi; Miyashita, Naohiko; Nasuda, Shuhei; Nakamura, Chiharu; Mori, Naoki; Takumi, Shigeo; Murata, Minoru; Futo, Satoshi; Tsunewaki, Koichiro

    2005-01-01

    The application of a new gene-based strategy for sequencing the wheat mitochondrial genome shows its structure to be a 452 528 bp circular molecule, and provides nucleotide-level evidence of intra-molecular recombination. Single, reciprocal and double recombinant products, and the nucleotide sequences of the repeats that mediate their formation have been identified. The genome has 55 genes with exons, including 35 protein-coding, 3 rRNA and 17 tRNA genes. Nucleotide sequences of seven wheat genes have been determined here for the first time. Nine genes have an exon–intron structure. Gene amplification responsible for the production of multicopy mitochondrial genes, in general, is species-specific, suggesting the recent origin of these genes. About 16, 17, 15, 3.0 and 0.2% of wheat mitochondrial DNA (mtDNA) may be of genic (including introns), open reading frame, repetitive sequence, chloroplast and retro-element origin, respectively. The gene order of the wheat mitochondrial gene map shows little synteny to the rice and maize maps, indicative that thorough gene shuffling occurred during speciation. Almost all unique mtDNA sequences of wheat, as compared with rice and maize mtDNAs, are redundant DNA. Features of the gene-based strategy are discussed, and a mechanistic model of mitochondrial gene amplification is proposed. PMID:16260473

  1. Multiple Sequence Alignment with Hidden Markov Models Learned by Random Drift Particle Swarm Optimization.

    PubMed

    Sun, Jun; Palade, Vasile; Wu, Xiaojun; Fang, Wei

    2014-01-01

    Hidden Markov Models (HMMs) are powerful tools for multiple sequence alignment (MSA), which is known to be an NP-complete and important problem in bioinformatics. Learning HMMs is a difficult task, and many meta-heuristic methods, including particle swarm optimization (PSO), have been used for that. In this paper, a new variant of PSO, called the random drift particle swarm optimization (RDPSO) algorithm, is proposed to be used for HMM learning tasks in MSA problems. The proposed RDPSO algorithm, inspired by the free electron model in metal conductors in an external electric field, employs a novel set of evolution equations that can enhance the global search ability of the algorithm. Moreover, in order to further enhance the algorithmic performance of the RDPSO, we incorporate a diversity control method into the algorithm and, thus, propose an RDPSO with diversity-guided search (RDPSO-DGS). The performances of the RDPSO, RDPSO-DGS and other algorithms are tested and compared by learning HMMs for MSA on two well-known benchmark data sets. The experimental results show that the HMMs learned by the RDPSO and RDPSO-DGS are able to generate better alignments for the benchmark data sets than other most commonly used HMM learning methods, such as the Baum-Welch and other PSO algorithms. The performance comparison with well-known MSA programs, such as ClustalW and MAFFT, also shows that the proposed methods have advantages in multiple sequence alignment.

  2. The Bryopsis hypnoides Plastid Genome: Multimeric Forms and Complete Nucleotide Sequence

    PubMed Central

    Tian, Chao; Wang, Guangce; Niu, Jiangfeng; Pan, Guanghua; Hu, Songnian

    2011-01-01

    Background Bryopsis hypnoides Lamouroux is a siphonous green alga, and its extruded protoplasm can aggregate spontaneously in seawater and develop into mature individuals. The chloroplast of B. hypnoides is the biggest organelle in the cell and shows strong autonomy. To better understand this organelle, we sequenced and analyzed the chloroplast genome of this green alga. Principal Findings A total of 111 functional genes, including 69 potential protein-coding genes, 5 ribosomal RNA genes, and 37 tRNA genes were identified. The genome size (153,429 bp), arrangement, and inverted-repeat (IR)-lacking structure of the B. hypnoides chloroplast DNA (cpDNA) closely resembles that of Chlorella vulgaris. Furthermore, our cytogenomic investigations using pulsed-field gel electrophoresis (PFGE) and southern blotting methods showed that the B. hypnoides cpDNA had multimeric forms, including monomer, dimer, trimer, tetramer, and even higher multimers, which is similar to the higher order organization observed previously for higher plant cpDNA. The relative amounts of the four multimeric cpDNA forms were estimated to be about 1, 1/2, 1/4, and 1/8 based on molecular hybridization analysis. Phylogenetic analyses based on a concatenated alignment of chloroplast protein sequences suggested that B. hypnoides is sister to all Chlorophyceae and this placement received moderate support. Conclusion All of the results suggest that the autonomy of the chloroplasts of B. hypnoides has little to do with the size and gene content of the cpDNA, and the IR-lacking structure of the chloroplasts indirectly demonstrated that the multimeric molecules might result from the random cleavage and fusion of replication intermediates instead of recombinational events. PMID:21339817

  3. Studies on structure-based sequence alignment and phylogenies of beta-lactamases.

    PubMed

    Salahuddin, Parveen; Khan, Asad U

    2014-01-01

    The β-lactamases enzymes cleave the amide bond in β-lactam ring, rendering β-lactam antibiotics harmless to bacteria. In this communication we have studied structure-function relationship and phylogenies of class A, B and D beta-lactamases using structure-based sequence alignment and phylip programs respectively. The data of structure-based sequence alignment suggests that in different isolates of TEM-1, mutations did not occur at or near sequence motifs. Since deletions are reported to be lethal to structure and function of enzyme. Therefore, in these variants antibiotic hydrolysis profile and specificity will be affected. The alignment data of class A enzyme SHV-1, CTX-M-15, class D enzyme, OXA-10, and class B enzyme VIM-2 and SIM-1 show sequence motifs along with other part of polypeptide are essentially conserved. These results imply that conformations of betalactamases are close to native state and possess normal hydrolytic activities towards beta-lactam antibiotics. However, class B enzyme such as IMP-1 and NDM-1 are less conserved than other class A and D studied here because mutation and deletions occurred at critically important region such as active site. Therefore, the structure of these beta-lactamases will be altered and antibiotic hydrolysis profile will be affected. Phylogenetic studies suggest that class A and D beta-lactamases including TOHO-1 and OXA-10 respectively evolved by horizontal gene transfer (HGT) whereas other member of class A such as TEM-1 evolved by gene duplication mechanism. Taken together, these studies justify structure-function relationship of beta-lactamases and phylogenetic studies suggest these enzymes evolved by different mechanisms. PMID:24966539

  4. Sequence-Specific Incorporation of Enzyme-Nucleotide Chimera by DNA Polymerases.

    PubMed

    Welter, Moritz; Verga, Daniela; Marx, Andreas

    2016-08-16

    DNA polymerases select the right nucleotide for the growing polynucleotide chain based on the shape and geometry of the nascent nucleotide pairs and thereby ensure high DNA replication selectivity. High-fidelity DNA polymerases are believed to possess tight active sites that allow little deviation from the canonical structures. However, DNA polymerases are known to use nucleotides with small modifications as substrates, which is key for numerous core biotechnology applications. We show that even high-fidelity DNA polymerases are capable of efficiently using nucleotide chimera modified with a large protein like horseradish peroxidase as substrates for template-dependent DNA synthesis, despite this "cargo" being more than 100-fold larger than the natural substrates. We exploited this capability for the development of systems that enable naked-eye detection of DNA and RNA at single nucleotide resolution. PMID:27392211

  5. Diagnostic assay for Helicobacter hepaticus based on nucleotide sequence of its 16S rRNA gene.

    PubMed Central

    Battles, J K; Williamson, J C; Pike, K M; Gorelick, P L; Ward, J M; Gonda, M A

    1995-01-01

    Conserved primers were used to PCR amplify 95% of the Helicobacter hepaticus 16S rRNA gene. Its sequence was determined and aligned to those of related bacteria, enabling the selection of primers to highly diverged regions of the 16S rRNA gene and an oligonucleotide probe for the development of a PCR-liquid hybridization assay. This assay was shown to be both sensitive and specific for H. hepaticus 16S rRNA gene sequences. PMID:7542270

  6. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine.

    PubMed

    Ye, Hao; Meehan, Joe; Tong, Weida; Hong, Huixiao

    2015-01-01

    Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants.

  7. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine

    PubMed Central

    Ye, Hao; Meehan, Joe; Tong, Weida; Hong, Huixiao

    2015-01-01

    Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants. PMID:26610555

  8. Nucleotide sequence of a complementary DNA encoding pea cytosolic copper/zinc superoxide dismutase. [Pisum sativum L

    SciTech Connect

    White, D.A.; Zilinskas, B.A. )

    1991-08-01

    The authors now report the nucleotide sequence of the cytosolic Cu/Zn SOD cloned from a {lambda}gt11 cDNA library constructed from mRNA extracted from leaves of 7- to 10-d pea seedlings (Pisum sativum L.). The clone was isolated using a 22-base synthetic oligonucleotide complementary to the amino acid sequence CGIIGLQG. This sequence, found at the protein's carboxy terminus, is highly conserved among plant cytosolic Cu/Zn SODs but not chloroplastic Cu/Zn SODs. The 738-base pair sequence contains an open reading frame specifying 152 codons and a predicted M{sub r} of 18,024 D. The deduced amino acid sequence is highly homologous (79-82% identity) with the sequences of other known plant cytosolic Cu/Zn SODs but less highly conserved (63-65%) when compared with several chloroplastic Cu/Zn SODs including pea (10).

  9. CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi-GPUs.

    PubMed

    Hung, Che-Lun; Lin, Yu-Shiang; Lin, Chun-Yuan; Chung, Yeh-Ching; Chung, Yi-Fang

    2015-10-01

    For biological applications, sequence alignment is an important strategy to analyze DNA and protein sequences. Multiple sequence alignment is an essential methodology to study biological data, such as homology modeling, phylogenetic reconstruction and etc. However, multiple sequence alignment is a NP-hard problem. In the past decades, progressive approach has been proposed to successfully align multiple sequences by adopting iterative pairwise alignments. Due to rapid growth of the next generation sequencing technologies, a large number of sequences can be produced in a short period of time. When the problem instance is large, progressive alignment will be time consuming. Parallel computing is a suitable solution for such applications, and GPU is one of the important architectures for contemporary parallel computing researches. Therefore, we proposed a GPU version of ClustalW v2.0.11, called CUDA ClustalW v1.0, in this work. From the experiment results, it can be seen that the CUDA ClustalW v1.0 can achieve more than 33× speedups for overall execution time by comparing to ClustalW v2.0.11.

  10. Nucleotide Sequence and Genetic Structure of a Novel Carbaryl Hydrolase Gene (cehA) from Rhizobium sp. Strain AC100

    PubMed Central

    Hashimoto, Masayuki; Fukui, Mitsuru; Hayano, Kouichi; Hayatsu, Masahito

    2002-01-01

    Rhizobium sp. strain AC100, which is capable of degrading carbaryl (1-naphthyl-N-methylcarbamate), was isolated from soil treated with carbaryl. This bacterium hydrolyzed carbaryl to 1-naphthol and methylamine. Carbaryl hydrolase from the strain was purified to homogeneity, and its N-terminal sequence, molecular mass (82 kDa), and enzymatic properties were determined. The purified enzyme hydrolyzed 1-naphthyl acetate and 4-nitrophenyl acetate indicating that the enzyme is an esterase. We then cloned the carbaryl hydrolase gene (cehA) from the plasmid DNA of the strain and determined the nucleotide sequence of the 10-kb region containing cehA. No homologous sequences were found by a database homology search using the nucleotide and deduced amino acid sequences of the cehA gene. Six open reading frames including the cehA gene were found in the 10-kb region, and sequencing analysis shows that the cehA gene is flanked by two copies of insertion sequence-like sequence, suggesting that it makes part of a composite transposon. PMID:11872471

  11. Nucleotide sequence and mutational analysis of the vnfENX region of Azotobacter vinelandii.

    PubMed Central

    Wolfinger, E D; Bishop, P E

    1991-01-01

    The nucleotide sequence (3,600 bp) of a second copy of nifENX-like genes in Azotobacter vinelandii has been determined. These genes are located immediately downstream from vnfA and have been designated vnfENX. The vnfENX genes appear to be organized as a single transcriptional unit that is preceded by a potential RpoN-dependent promoter. While the nifEN genes are thought to be evolutionarily related to nifDK, the vnfEN genes appear to be more closely related to nifEN than to either nifDK, vnfDK, or anfDK. Mutant strains (CA47 and CA48) carrying insertions in vnfE and vnfN, respectively, are able to grow diazotrophically in molybdenum (Mo)-deficient medium containing vanadium (V) (Vnf+) and in medium lacking both Mo and V (Anf+). However, a double mutant (strain DJ42.48) which contains a nifEN deletion and an insertion in vnfE is unable to grow diazotrophically in Mo-sufficient medium or in Mo-deficient medium with or without V. This suggests that NifE and NifN substitute for VnfE and VnfN when the vnfEN genes are mutationally inactivated. AnfA is not required for the expression of a vnfN-lacZ transcriptional fusion, even though this fusion is expressed under Mo- and V-deficient diazotrophic growth conditions. PMID:1938952

  12. Nucleotide sequence and structural determinants of specific binding of coat protein or coat protein peptides to the 3' untranslated region of alfalfa mosaic virus RNA 4.

    PubMed Central

    Houser-Scott, F; Baer, M L; Liem, K F; Cai, J M; Gehrke, L

    1994-01-01

    The specific binding of alfalfa mosaic virus coat protein to viral RNA requires determinants in the 3' untranslated region (UTR). Coat protein and peptide binding sites in the 3' UTR of alfalfa mosaic virus RNA 4 have been analyzed by hydroxyl radical footprinting, deletion mapping, and site-directed mutagenesis experiments. The 3' UTR has several stable hairpins that are flanked by single-stranded (A/U)UGC sequences. Hydroxyl radical footprinting data show that five sites in the 3' UTR of alfalfa mosaic virus RNA 4 are protected by coat protein, and four of the five protected regions contain AUGC or UUGC. Electrophoretic mobility band shift results suggest four coat protein binding sites in the 3' UTR. A 3'-terminal 39-nucleotide RNA fragment containing four AUGC repeats bound coat protein and coat protein peptides with high affinity; however, coat protein bound poorly to antisense 3' UTR transcripts and poly(AUGC)10. Site-directed mutagenesis of AUGC865-868 resulted in a loss of coat protein binding and peptide binding by the RNA fragment. Alignment of alfalfa mosaic RNA sequences with those from several closely related ilarviruses demonstrates that AUGC865-868 is perfectly conserved; moreover, the RNAs are predicted to form similar 3'-terminal secondary structures. The data strongly suggest that alfalfa mosaic virus coat protein and ilavirus coat proteins recognize invariant AUGC sequences in the context of conserved structural elements. Images PMID:8139004

  13. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples.

  14. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. PMID:25625550

  15. Identification and Evaluation of Single-Nucleotide Polymorphisms in Allotetraploid Peanut (Arachis hypogaea L.) Based on Amplicon Sequencing Combined with High Resolution Melting (HRM) Analysis.

    PubMed

    Hong, Yanbin; Pandey, Manish K; Liu, Ying; Chen, Xiaoping; Liu, Hong; Varshney, Rajeev K; Liang, Xuanqiang; Huang, Shangzhi

    2015-01-01

    The cultivated peanut (Arachis hypogaea L.) is an allotetraploid (AABB) species derived from the A-genome (Arachis duranensis) and B-genome (Arachis ipaensis) progenitors. Presence of two versions of a DNA sequence based on the two progenitor genomes poses a serious technical and analytical problem during single nucleotide polymorphism (SNP) marker identification and analysis. In this context, we have analyzed 200 amplicons derived from expressed sequence tags (ESTs) and genome survey sequences (GSS) to identify SNPs in a panel of genotypes consisting of 12 cultivated peanut varieties and two diploid progenitors representing the ancestral genomes. A total of 18 EST-SNPs and 44 genomic-SNPs were identified in 12 peanut varieties by aligning the sequence of A. hypogaea with diploid progenitors. The average frequency of sequence polymorphism was higher for genomic-SNPs than the EST-SNPs with one genomic-SNP every 1011 bp as compared to one EST-SNP every 2557 bp. In order to estimate the potential and further applicability of these identified SNPs, 96 peanut varieties were genotyped using high resolution melting (HRM) method. Polymorphism information content (PIC) values for EST-SNPs ranged between 0.021 and 0.413 with a mean of 0.172 in the set of peanut varieties, while genomic-SNPs ranged between 0.080 and 0.478 with a mean of 0.249. Total 33 SNPs were used for polymorphism detection among the parents and 10 selected lines from mapping population Y13Zh (Zhenzhuhei × Yueyou13). Of the total 33 SNPs, nine SNPs showed polymorphism in the mapping population Y13Zh, and seven SNPs were successfully mapped into five linkage groups. Our results showed that SNPs can be identified in allotetraploid peanut with high accuracy through amplicon sequencing and HRM assay. The identified SNPs were very informative and can be used for different genetic and breeding applications in peanut.

  16. Identification and Evaluation of Single-Nucleotide Polymorphisms in Allotetraploid Peanut (Arachis hypogaea L.) Based on Amplicon Sequencing Combined with High Resolution Melting (HRM) Analysis

    PubMed Central

    Hong, Yanbin; Pandey, Manish K.; Liu, Ying; Chen, Xiaoping; Liu, Hong; Varshney, Rajeev K.; Liang, Xuanqiang; Huang, Shangzhi

    2015-01-01

    The cultivated peanut (Arachis hypogaea L.) is an allotetraploid (AABB) species derived from the A-genome (Arachis duranensis) and B-genome (Arachis ipaensis) progenitors. Presence of two versions of a DNA sequence based on the two progenitor genomes poses a serious technical and analytical problem during single nucleotide polymorphism (SNP) marker identification and analysis. In this context, we have analyzed 200 amplicons derived from expressed sequence tags (ESTs) and genome survey sequences (GSS) to identify SNPs in a panel of genotypes consisting of 12 cultivated peanut varieties and two diploid progenitors representing the ancestral genomes. A total of 18 EST-SNPs and 44 genomic-SNPs were identified in 12 peanut varieties by aligning the sequence of A. hypogaea with diploid progenitors. The average frequency of sequence polymorphism was higher for genomic-SNPs than the EST-SNPs with one genomic-SNP every 1011 bp as compared to one EST-SNP every 2557 bp. In order to estimate the potential and further applicability of these identified SNPs, 96 peanut varieties were genotyped using high resolution melting (HRM) method. Polymorphism information content (PIC) values for EST-SNPs ranged between 0.021 and 0.413 with a mean of 0.172 in the set of peanut varieties, while genomic-SNPs ranged between 0.080 and 0.478 with a mean of 0.249. Total 33 SNPs were used for polymorphism detection among the parents and 10 selected lines from mapping population Y13Zh (Zhenzhuhei × Yueyou13). Of the total 33 SNPs, nine SNPs showed polymorphism in the mapping population Y13Zh, and seven SNPs were successfully mapped into five linkage groups. Our results showed that SNPs can be identified in allotetraploid peanut with high accuracy through amplicon sequencing and HRM assay. The identified SNPs were very informative and can be used for different genetic and breeding applications in peanut. PMID:26697032

  17. Open-Phylo: a customizable crowd-computing platform for multiple sequence alignment.

    PubMed

    Kwak, Daniel; Kam, Alfred; Becerra, David; Zhou, Qikuan; Hops, Adam; Zarour, Eleyine; Kam, Arthur; Sarmenta, Luis; Blanchette, Mathieu; Waldispühl, Jérôme

    2013-01-01

    Citizen science games such as Galaxy Zoo, Foldit, and Phylo aim to harness the intelligence and processing power generated by crowds of online gamers to solve scientific problems. However, the selection of the data to be analyzed through these games is under the exclusive control of the game designers, and so are the results produced by gamers. Here, we introduce Open-Phylo, a freely accessible crowd-computing platform that enables any scientist to enter our system and use crowds of gamers to assist computer programs in solving one of the most fundamental problems in genomics: the multiple sequence alignment problem. PMID:24148814

  18. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51. Copies of WIPO... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid...

  19. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51. Copies of WIPO... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid...

  20. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51. Copies of WIPO... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid...

  1. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51. Copies of WIPO... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid...

  2. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51. Copies of WIPO... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid...

  3. Two Simple and Efficient Algorithms to Compute the SP-Score Objective Function of a Multiple Sequence Alignment

    PubMed Central

    Ranwez, Vincent

    2016-01-01

    Background Multiple sequence alignment (MSA) is a crucial step in many molecular analyses and many MSA tools have been developed. Most of them use a greedy approach to construct a first alignment that is then refined by optimizing the sum of pair score (SP-score). The SP-score estimation is thus a bottleneck for most MSA tools since it is repeatedly required and is time consuming. Results Given an alignment of n sequences and L sites, I introduce here optimized solutions reaching O(nL) time complexity for affine gap cost, instead of O(n2L), which are easy to implement. PMID:27505054

  4. A comparison of nucleotide sequences of measles virus L genes derived from wild-type viruses and SSPE brain tissues.

    PubMed

    Komase, K; Rima, B K; Pardowitz, I; Kunz, C; Billeter, M A; ter Meulen, V; Baczko, K

    1995-04-20

    The nucleotide sequences of the large protein (L) gene derived from two wild-type measles viruses (MV) and two SSPE brain-derived viruses have been determined. All sequences have single large open reading frames encoding 2183 amino acid residues. The deduced L proteins are well conserved and the proposed functional domains which have been identified for rhabdo- and paramyxoviruses are completely conserved in all strains. The degree of variability of L proteins is the lowest of all structural proteins of MV, reflecting its role in virus reproduction and persistence. Biased hypermutation was not observed in the L genes derived from SSPE brain tissue. None of the nucleotide changes can be associated with the attenuated phenotype of the Edmonston vaccine viruses. PMID:7747453

  5. Genetic divergence between subpopulations of the eastern Pacific goose barnacle Pollicipes elegans: mitochondrial cytochrome c subunit 1 nucleotide sequences.

    PubMed

    Van Syoc, R J

    1994-12-01

    Nucleotide sequence data derived from polymerase chain reaction products from the cytochrome oxidase subunit 1 gene of mitochondrial DNA provide evidence for interrupted gene flow and subsequent genetic divergence between geographically separate subpopulations of the edible goose barnacle, Pollicipes elegans, with a 4400-km latitudinal distribution in the eastern Pacific Ocean. The amphitropical subpopulations of Pollicipes elegans have a net nucleotide sequence divergence of about 1.2%. A range of mutation rates are applied to calculate estimates for the timing of this divergence. The earliest estimated time of divergence agrees with a Pliocene time of general warming in the eastern Pacific. The latest estimated times coincide with the Pleistocene epoch and periods of cooling and warming that could have allowed for a series of expansions and contractions of P. elegans populations in the eastern tropical Pacific. These expansions and contractions may, therefore, represent alternating periods of genetic exchange and isolation of the two populations.

  6. Molecular Identification of Necrophagous Muscidae and Sarcophagidae Fly Species Collected in Korea by Mitochondrial Cytochrome c Oxidase Subunit I Nucleotide Sequences

    PubMed Central

    Ham, Chan Seon; Kim, Seong Yoon; Ko, Kwang Soo; Jo, Tae-Ho; Son, Gi Hoon

    2014-01-01

    Identification of insect species is an important task in forensic entomology. For more convenient species identification, the nucleotide sequences of cytochrome c oxidase subunit I (COI) gene have been widely utilized. We analyzed full-length COI nucleotide sequences of 10 Muscidae and 6 Sarcophagidae fly species collected in Korea. After DNA extraction from collected flies, PCR amplification and automatic sequencing of the whole COI sequence were performed. Obtained sequences were analyzed for a phylogenetic tree and a distance matrix. Our data showed very low intraspecific sequence distances and species-level monophylies. However, sequence comparison with previously reported sequences revealed a few inconsistencies or paraphylies requiring further investigation. To the best of our knowledge, this study is the first report of COI nucleotide sequences from Hydrotaea occulta, Muscina angustifrons, Muscina pascuorum, Ophyra leucostoma, Sarcophaga haemorrhoidalis, Sarcophaga harpax, and Phaonia aureola. PMID:24982938

  7. 3-d structure-based amino acid sequence alignment of esterases, lipases and related proteins

    SciTech Connect

    Gentry, M.K.; Doctor, B.P.; Cygler, M.; Schrag, J.D.; Sussman, J.L.

    1993-05-13

    Acetylcholinesterase and butyrylcholinesterase, enzymes with potential as pretreatment drugs for organophosphate toxicity, are members of a larger family of homologous proteins that includes carboxylesterases, cholesterol esterases, lipases, and several nonhydrolytic proteins. A computer-generated alignment of 18 of the proteins, the acetylcholinesases, butyrylcholinesterases, carboxylesterases, some esterases, and the nonenzymatic proteins has been previously presented. More recently, the three-dimensional structures of two enzymes enzymes in this group, acetylcholinesterase from Torpedo californica and lipase from Geotrichum candidum, have been determined. Based on the x-ray structures and the superposition of these two enzymes, it was possible to obtain an improved amino acid sequence alignment of 32 members of this family of proteins. Examination of this alignment reveals that 24 amino acids are invariant in all of the hydrolytic proteins, and an additional 49 are well conserved. Conserved amino acids include those of the active site, the disulfide bridges, the salt bridges, in the core of the proteins, and at the edges of secondary structural elements. Comparison of the three-dimensional structures makes it possible to find a well-defined structural basis for the conservation of many of these amino acids.

  8. Characterization of Newcastle disease virus isolates by reverse transcription PCR coupled to direct nucleotide sequencing and development of sequence database for pathotype prediction and molecular epidemiological analysis.

    PubMed Central

    Seal, B S; King, D J; Bennett, J D

    1995-01-01

    Degenerate oligonucleotide primers were synthesized to amplify nucleotide sequences from portions of the fusion protein and matrix protein genes of Newcastle disease virus (NDV) genomic RNA that could be used diagnostically. These primers were used in a single-tube reverse transcription PCR of NDV genomic RNA coupled to direct nucleotide sequencing of the amplified product to characterize more than 30 NDV isolates. In agreement with previous reports, differences in the fusion protein cleavage sequence that correlated genotypically with virulence among various NDV pathotypes were detected. By using sequences generated from the matrix protein gene coding for the nuclear localization signal, lentogenic viruses were again grouped phylogenetically separate from other pathotypes. These techniques were applied to compare neurotropic velogenic viruses isolated from an outbreak of Newcastle disease in cormorants and turkeys. Cormorant NDV isolates and an NDV isolate from an infected turkey flock in North Dakota had the fusion protein cleavage sequence 109SRGRRQKRFVG119. The R-for-G substitution at position 110 may be unique for the cormorant-type isolates. Although the amino acid sequences from the fusion protein cleavage site were identical, nucleotide sequence data correlate the outbreak in turkeys to a cormorant virus isolate from Minnesota and not to a cormorant virus isolate from Michigan. On the basis of sequence information, the cormorant isolates are virulent viruses related to isolates of psittacine origin, possibly genotypically distinct from other velogenic NDV isolates. These techniques can be used reliably for Newcastle disease epidemiology and for prediction of pathotypes of NDV isolates without traditional live-bird inoculations. PMID:8567895

  9. Nucleotide sequencing and serological evidence that the recently recognized deer tick virus is a genotype of Powassan virus.

    PubMed

    Beasley, D W; Suderman, M T; Holbrook, M R; Barrett, A D

    2001-11-01

    Deer tick virus (DTV) is a recently recognized North American virus isolated from Ixodes dammini ticks. Nucleotide sequencing of fragments of structural and non-structural protein genes suggested that this virus was most closely related to the tick-borne flavivirus Powassan (POW), which causes potentially fatal encephalitis in humans. To determine whether DTV represents a new and distinct member of the Flavivirus genus of the family Flaviviridae, we sequenced the structural protein genes and 5' and 3' non-coding regions of this virus. In addition, we compared the reactivity of DTV and POW in hemagglutination inhibition tests with a panel of polyclonal and monoclonal antisera, and performed cross-neutralization experiments using anti-DTV antisera. Nucleotide sequencing revealed a high degree of homology between DTV and POW at both nucleotide (>80% homology) and amino acid (>90% homology) levels, and the two viruses were indistinguishable in serological assays and mouse neuroinvasiveness. On the basis of these results, we suggest that DTV should be classified as a genotype of POW virus. PMID:11551648

  10. Complete nucleotide sequence and gene rearrangement of the mitochondrial genome of the bell-ring frog, Buergeria buergeri (family Rhacophoridae).

    PubMed

    Sano, Naomi; Kurabayashi, Atsushi; Fujii, Tamotsu; Yonekawa, Hiromichi; Sumida, Masayuki

    2004-06-01

    In this study we determined the complete nucleotide sequence (19,959 bp) of the mitochondrial DNA of the rhacophorid frog Buergeria buergeri. The gene content, nucleotide composition, and codon usage of B. buergeri conformed to those of typical vertebrate patterns. However, due to an accumulation of lengthy repetitive sequences in the D-loop region, this species possesses the largest mitochondrial genome among all the vertebrates examined so far. Comparison of the gene organizations among amphibian species (Rana, Xenopus, salamanders and caecilians) revealed that the positioning of four tRNA genes and the ND5 gene in the mtDNA of B. buergeri diverged from the common vertebrate gene arrangement shared by Xenopus, salamanders and caecilians. The unique positions of the tRNA genes in B. buergeri are shared by ranid frogs, indicating that the rearrangements of the tRNA genes occurred in a common ancestral lineage of ranids and rhacophorids. On the other hand, the novel position of the ND5 gene seems to have arisen in a lineage leading to rhacophorids (and other closely related taxa) after ranid divergence. Phylogenetic analysis based on nucleotide sequence data of all mitochondrial genes also supported the gene rearrangement pathway.

  11. Complete nucleotide sequence and gene rearrangement of the mitochondrial genome of the bell-ring frog, Buergeria buergeri (family Rhacophoridae).

    PubMed

    Sano, Naomi; Kurabayashi, Atsushi; Fujii, Tamotsu; Yonekawa, Hiromichi; Sumida, Masayuki

    2004-06-01

    In this study we determined the complete nucleotide sequence (19,959 bp) of the mitochondrial DNA of the rhacophorid frog Buergeria buergeri. The gene content, nucleotide composition, and codon usage of B. buergeri conformed to those of typical vertebrate patterns. However, due to an accumulation of lengthy repetitive sequences in the D-loop region, this species possesses the largest mitochondrial genome among all the vertebrates examined so far. Comparison of the gene organizations among amphibian species (Rana, Xenopus, salamanders and caecilians) revealed that the positioning of four tRNA genes and the ND5 gene in the mtDNA of B. buergeri diverged from the common vertebrate gene arrangement shared by Xenopus, salamanders and caecilians. The unique positions of the tRNA genes in B. buergeri are shared by ranid frogs, indicating that the rearrangements of the tRNA genes occurred in a common ancestral lineage of ranids and rhacophorids. On the other hand, the novel position of the ND5 gene seems to have arisen in a lineage leading to rhacophorids (and other closely related taxa) after ranid divergence. Phylogenetic analysis based on nucleotide sequence data of all mitochondrial genes also supported the gene rearrangement pathway. PMID:15329496

  12. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment

    PubMed Central

    Newberg, Lee A.

    2008-01-01

    Motivation: A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis. Results: Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000. Availability: Sample C++-code for optimal backtrace is available in the Supplementary Materials. Contact: leen@cs.rpi.edu Supplementary information: Supplementary data is available at Bioinformatics online. PMID:18558620

  13. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

    PubMed Central

    Pratas, Diogo; Silva, Raquel M.; Pinho, Armando J.; Ferreira, Paulo J.S.G.

    2015-01-01

    Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail. PMID:25984837

  14. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences.

    PubMed

    Pratas, Diogo; Silva, Raquel M; Pinho, Armando J; Ferreira, Paulo J S G

    2015-01-01

    Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.

  15. SGP-1: prediction and validation of homologous genes based on sequence alignments.

    PubMed

    Wiehe, T; Gebauer-Jung, S; Mitchell-Olds, T; Guigó, R

    2001-09-01

    Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of depends little on species-specific properties such as codon usage or the nucleotide distribution. may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.

  16. The nucleotide sequences of 5S rRNAs from a sea-cucumber, a starfish and a sea-urchin.

    PubMed Central

    Ohama, T; Hori, H; Osawa, S

    1983-01-01

    The nucleotide sequences of 5S rRNA from three echinoderms, a sea-cucumber Stichopus oshimae, a starfish Asterina pectinifera and a sea-urchin Hemicentrotus pulcherrimus have been determined. These 5S rRNAs are all 120 nucleotides long. The echinoderm sequences are more related to the sequences of proterostomes animals such as mollusc, annelids and some others (87% identity on average) than to those of vertebrates (82% identity on average). PMID:6878041

  17. A probabilistic coding based quantum genetic algorithm for multiple sequence alignment.

    PubMed

    Huo, Hongwei; Xie, Qiaoluan; Shen, Xubang; Stojkovic, Vojislav

    2008-01-01

    This paper presents an original Quantum Genetic algorithm for Multiple sequence ALIGNment (QGMALIGN) that combines a genetic algorithm and a quantum algorithm. A quantum probabilistic coding is designed for representing the multiple sequence alignment. A quantum rotation gate as a mutation operator is used to guide the quantum state evolution. Six genetic operators are designed on the coding basis to improve the solution during the evolutionary process. The features of implicit parallelism and state superposition in quantum mechanics and the global search capability of the genetic algorithm are exploited to get efficient computation. A set of well known test cases from BAliBASE2.0 is used as reference to evaluate the efficiency of the QGMALIGN optimization. The QGMALIGN results have been compared with the most popular methods (CLUSTALX, SAGA, DIALIGN, SB_PIMA, and QGMALIGN) results. The QGMALIGN results show that QGMALIGN performs well on the presenting biological data. The addition of genetic operators to the quantum algorithm lowers the cost of overall running time.

  18. QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors

    PubMed Central

    Gudyś, Adam; Deorowicz, Sebastian

    2014-01-01

    Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors. PMID:24586435

  19. Resolving the multiple sequence alignment problem using biogeography-based optimization with multiple populations.

    PubMed

    Zemali, El-Amine; Boukra, Abdelmadjid

    2015-08-01

    The multiple sequence alignment (MSA) is one of the most challenging problems in bioinformatics, it involves discovering similarity between a set of protein or DNA sequences. This paper introduces a new method for the MSA problem called biogeography-based optimization with multiple populations (BBOMP). It is based on a recent metaheuristic inspired from the mathematics of biogeography named biogeography-based optimization (BBO). To improve the exploration ability of BBO, we have introduced a new concept allowing better exploration of the search space. It consists of manipulating multiple populations having each one its own parameters. These parameters are used to build up progressive alignments allowing more diversity. At each iteration, the best found solution is injected in each population. Moreover, to improve solution quality, six operators are defined. These operators are selected with a dynamic probability which changes according to the operators efficiency. In order to test proposed approach performance, we have considered a set of datasets from Balibase 2.0 and compared it with many recent algorithms such as GAPAM, MSA-GA, QEAMSA and RBT-GA. The results show that the proposed approach achieves better average score than the previously cited methods.

  20. Palindrome analyser - A new web-based server for predicting and evaluating inverted repeats in nucleotide sequences.

    PubMed

    Brázda, Václav; Kolomazník, Jan; Lýsek, Jiří; Hároníková, Lucia; Coufal, Jan; Št'astný, Jiří

    2016-09-30

    DNA cruciform structures play an important role in the regulation of natural processes including gene replication and expression, as well as nucleosome structure and recombination. They have also been implicated in the evolution and development of diseases such as cancer and neurodegenerative disorders. Cruciform structures are formed by inverted repeats, and their stability is enhanced by DNA supercoiling and protein binding. They have received broad attention because of their important roles in biology. Computational approaches to study inverted repeats have allowed detailed analysis of genomes. However, currently there are no easily accessible and user-friendly tools that can analyse inverted repeats, especially among long nucleotide sequences. We have developed a web-based server, Palindrome analyser, which is a user-friendly application for analysing inverted repeats in various DNA (or RNA) sequences including genome sequences and oligonucleotides. It allows users to search and retrieve desired gene/nucleotide sequence entries from the NCBI databases, and provides data on length, sequence, locations and energy required for cruciform formation. Palindrome analyser also features an interactive graphical data representation of the distribution of the inverted repeats, with options for sorting according to the length of inverted repeat, length of loop, and number of mismatches. Palindrome analyser can be accessed at http://bioinformatics.ibp.cz.

  1. MoD Tools: regulatory motif discovery in nucleotide sequences from co-regulated or homologous genes.

    PubMed

    Pavesi, Giulio; Mereghetti, Paolo; Zambelli, Federico; Stefani, Marco; Mauri, Giancarlo; Pesole, Graziano

    2006-07-01

    Understanding the complex mechanisms regulating gene expression at the transcriptional and post-transcriptional levels is one of the greatest challenges of the post-genomic era. The MoD (MOtif Discovery) Tools web server comprises a set of tools for the discovery of novel conserved sequence and structure motifs in nucleotide sequences, motifs that in turn are good candidates for regulatory activity. The server includes the following programs: Weeder, for the discovery of conserved transcription factor binding sites (TFBSs) in nucleotide sequences from co-regulated genes; WeederH, for the discovery of conserved TFBSs and distal regulatory modules in sequences from homologous genes; RNAProfile, for the discovery of conserved secondary structure motifs in unaligned RNA sequences whose secondary structure is not known. In this way, a given gene can be compared with other co-regulated genes or with its homologs, or its mRNA can be analyzed for conserved motifs regulating its post-transcriptional fate. The web server thus provides researchers with different strategies and methods to investigate the regulation of gene expression, at both the transcriptional and post-transcriptional levels. Available at http://www.pesolelab.it/modtools/ and http://www.beacon.unimi.it/modtools/.

  2. Palindrome analyser - A new web-based server for predicting and evaluating inverted repeats in nucleotide sequences.

    PubMed

    Brázda, Václav; Kolomazník, Jan; Lýsek, Jiří; Hároníková, Lucia; Coufal, Jan; Št'astný, Jiří

    2016-09-30

    DNA cruciform structures play an important role in the regulation of natural processes including gene replication and expression, as well as nucleosome structure and recombination. They have also been implicated in the evolution and development of diseases such as cancer and neurodegenerative disorders. Cruciform structures are formed by inverted repeats, and their stability is enhanced by DNA supercoiling and protein binding. They have received broad attention because of their important roles in biology. Computational approaches to study inverted repeats have allowed detailed analysis of genomes. However, currently there are no easily accessible and user-friendly tools that can analyse inverted repeats, especially among long nucleotide sequences. We have developed a web-based server, Palindrome analyser, which is a user-friendly application for analysing inverted repeats in various DNA (or RNA) sequences including genome sequences and oligonucleotides. It allows users to search and retrieve desired gene/nucleotide sequence entries from the NCBI databases, and provides data on length, sequence, locations and energy required for cruciform formation. Palindrome analyser also features an interactive graphical data representation of the distribution of the inverted repeats, with options for sorting according to the length of inverted repeat, length of loop, and number of mismatches. Palindrome analyser can be accessed at http://bioinformatics.ibp.cz. PMID:27603574

  3. Comparative nucleotide sequences encoding the immunity proteins and the carboxyl-terminal peptides of colicins E2 and E3.

    PubMed Central

    Lau, P C; Rowsome, R W; Zuker, M; Visentin, L P

    1984-01-01

    Using the M13 dideoxy sequencing technique, we have established the DNA sequences of colicins E2 and E3 which encompass the receptor-binding and the catalytic domains of each of the nucleases, and their immunity (imm) genes. The imm gene of plasmid ColE2-P9 is 255 bp long and is separated from the end of the col gene by a dinucleotide. This gene pair is arranged similarly in plasmid ColE3-CA38 except that the intergenic space is 9 bp and the E3 imm gene is one codon shorter than its E2 counterpart. Comparisons of the E2 and E3 imm sequences indicate considerable divergence whereas the receptor-binding domains of both colicins are highly conserved. The two nuclease domains appear to share some sequence homology. A possible evolutionary relationship between colicin E3 and other microbial extracellular ribonucleases is also suggested from the sequence alignment analysis. PMID:6095211

  4. SMRT Sequencing of Long Tandem Nucleotide Repeats in SCA10 Reveals Unique Insight of Repeat Expansion Structure

    PubMed Central

    Landrian, Ivette; Godiska, Ronald; Shanker, Savita; Yu, Fahong; Farmerie, William G.; Ashizawa, Tetsuo

    2015-01-01

    A large, non-coding ATTCT repeat expansion causes the neurodegenerative disorder, spinocerebellar ataxia type 10 (SCA10). In a subset of SCA10 patients, interruption motifs are present at the 5’ end of the expansion and strongly correlate with epileptic seizures. Thus, interruption motifs are a predictor of the epileptic phenotype and are hypothesized to act as a phenotypic modifier in SCA10. Yet, the exact internal sequence structure of SCA10 expansions remains unknown due to limitations in current technologies for sequencing across long extended tracts of tandem nucleotide repeats. We used the third generation sequencing technology, Single Molecule Real Time (SMRT) sequencing, to obtain full-length contiguous expansion sequences, ranging from 2.5 to 4.4 kb in length, from three SCA10 patients with different clinical presentations. We obtained sequence spanning the entire length of the expansion and identified the structure of known and novel interruption motifs within the SCA10 expansion. The exact interruption patterns in expanded SCA10 alleles will allow us to further investigate the potential contributions of these interrupting sequences to the pathogenic modification leading to the epilepsy phenotype in SCA10. Our results also demonstrate that SMRT sequencing is useful for deciphering long tandem repeats that pose as “gaps” in the human genome sequence. PMID:26295943

  5. Unifying bacteria from decaying wood with various ubiquitous Gibbsiella species as G. acetica sp. nov. based on nucleotide sequence similarities and their acetic acid secretion.

    PubMed

    Geider, Klaus; Gernold, Marina; Jock, Susanne; Wensing, Annette; Völksch, Beate; Gross, Jürgen; Spiteller, Dieter

    2015-12-01

    Bacteria were isolated from necrotic apple and pear tree tissue and from dead wood in Germany and Austria as well as from pear tree exudate in China. They were selected for growth at 37 °C, screened for levan production and then characterized as Gram-negative, facultatively anaerobic rods. Nucleotide sequences from 16S rRNA genes, the housekeeping genes dnaJ, gyrB, recA and rpoB alignments, BLAST searches and phenotypic data confirmed by MALDI-TOF analysis showed that these bacteria belong to the genus Gibbsiella and resembled strains isolated from diseased oaks in Britain and Spain. Gibbsiella-specific PCR primers were designed from the proline isomerase and the levansucrase genes. Acid secretion was investigated by screening for halo formation on calcium carbonate agar and the compound identified by NMR as acetic acid. Its production by Gibbsiella spp. strains was also determined in culture supernatants by GC/MS analysis after derivatization with pentafluorobenzyl bromide. Some strains were differentiated by the PFGE patterns of SpeI digests and by sequence analyses of the lsc and the ppiD genes, and the Chinese Gibbsiella strain was most divergent. The newly investigated bacteria as well as Gibbsiella querinecans, Gibbsiella dentisursi and Gibbsiella papilionis, isolated in Britain, Spain, Korea and Japan, are taxonomically related Enterobacteriaceae, tolerate and secrete acetic acid. We therefore propose to unify them in the species Gibbsiella acetica sp. nov.

  6. MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment.

    PubMed

    Kumar, Sudhir; Tamura, Koichiro; Nei, Masatoshi

    2004-06-01

    With its theoretical basis firmly established in molecular evolutionary and population genetics, the comparative DNA and protein sequence analysis plays a central role in reconstructing the evolutionary histories of species and multigene families, estimating rates of molecular evolution, and inferring the nature and extent of selective forces shaping the evolution of genes and genomes. The scope of these investigations has now expanded greatly owing to the development of high-throughput sequencing techniques and novel statistical and computational methods. These methods require easy-to-use computer programs. One such effort has been to produce Molecular Evolutionary Genetics Analysis (MEGA) software, with its focus on facilitating the exploration and analysis of the DNA and protein sequence variation from an evolutionary perspective. Currently in its third major release, MEGA3 contains facilities for automatic and manual sequence alignment, web-based mining of databases, inference of the phylogenetic trees, estimation of evolutionary distances and testing evolutionary hypotheses. This paper provides an overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA.

  7. AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization

    NASA Astrophysics Data System (ADS)

    Newkirk, Daniel; Biesinger, Jacob; Chon, Alvin; Yokomori, Kyoko; Xie, Xiaohui

    High-throughput sequencing coupled to chromatin immunoprecipitation (ChIP-Seq) is widely used in characterizing genome-wide binding patterns of transcription factors, cofactors, chromatin modifiers, and other DNA binding proteins. A key step in ChIP-Seq data analysis is to map short reads from high-throughput sequencing to a reference genome and identify peak regions enriched with short reads. Although several methods have been proposed for ChIP-Seq analysis, most existing methods only consider reads that can be uniquely placed in the reference genome, and therefore have low power for detecting peaks located within repeat sequences. Here we introduce a probabilistic approach for ChIP-Seq data analysis which utilizes all reads, providing a truly genome-wide view of binding patterns. Reads are modeled using a mixture model corresponding to K enriched regions and a null genomic background. We use maximum likelihood to estimate the locations of the enriched regions, and implement an expectation-maximization (E-M) algorithm, called AREM (aligning reads by expectation maximization), to update the alignment probabilities of each read to different genomic locations. We apply the algorithm to identify genome-wide binding events of two proteins: Rad21, a component of cohesin and a key factor involved in chromatid cohesion, and Srebp-1, a transcription factor important for lipid/cholesterol homeostasis. Using AREM, we were able to identify 19,935 Rad21 peaks and 1,748 Srebp-1 peaks in the mouse genome with high confidence, including 1,517 (7.6%) Rad21 peaks and 227 (13%) Srebp-1 peaks that were missed using only uniquely mapped reads. The open source implementation of our algorithm is available at http://sourceforge.net/projects/arem

  8. Nucleotide sequence variation of the VP7 gene of two G3-type rotaviruses isolated from dogs.

    PubMed

    Martella, V; Pratelli, A; Greco, G; Gentile, M; Fiorente, P; Tempesta, M; Buonavoglia, C

    2001-04-01

    The sequence of the VP7 gene of two rotaviruses isolated from dogs in southern Italy was determined and the inferred amino acid sequence was compared with that of other rotavirus strains. There was very high nucleotide and amino acid identity between canine strain RV198/95 and other canine strains, and to the human strain HCR3A. Strain RV52/96, however, was found to have about 95% identity to the G3 serotype canine strains K9, A79-10 and CU-1 and 96% identity to strain RV198/95 and to the simian strain RRV. Therefore both of the canine strains belong to the G3 serotype. Nevertheless, detailed analysis of the VP7 variable regions revealed that RV52/96 possesses amino acid substitutions uncommon to the other canine isolates. In addition, strain RV52/96 exhibited a nucleotide divergence greater than 16% from all the other canine strains studied; however, it revealed the closest identity (90.4%) to the simian strain RRV. With only a few exceptions, phylogenetic analysis allowed clear differentiation of the G3 rotaviruses on the basis of the species of origin. The nucleotide and amino acid variations observed in strain RV52/96 could account for the existence of a canine rotavirus G3 sub-type. PMID:11226570

  9. Nucleotide sequences and mutations of the 5'-nontranslated region (5'NTR) of natural isolates of an epidemic echovirus 11' (prime).

    PubMed

    Szendrõi, A; El-Sageyer, M; Takács, M; Mezey, I; Berencsi, G

    2000-01-01

    An echovirus 11' (prime) virus caused an epidemic in Hungary in 1989. The leading clinical form of the diseases was myocarditis. Hemorrhagic hepatitis syndroms were also caused, however, with lethal outcome in 13 newborn babies. Altogether 386 children suffered from registered clinical disease. No accumulation of serous meningitis cases and intrauterine death were observed during the epidemic, and the monovalent oral poliovirus vaccination campaign has prevented the further circulation of the virus. The 5'-nontranslated region (5'-NTR) of 12 natural isolates were sequenced (nucleotides: 260-577). The 5'-NTR was found to be different from that of the prototype Gregory strain (X80059) of EV11 (less than 90% identity), but related to the swine vesicular disease virus (D16364) SVDV and EV9 (X92886) as indicated by the best fitting dendogram. The examination of the variable nucleotides in the internal ribosomal entry site (IRES) revealed, that the nucleotide sequence of a region of the epidemic 5'-NTR was identical to that of coxsackievirus B2. Five of the epidemic isolates were found to carry mutations. Seven EV11' IRES elements possessed identical sequences indicating, that the virus has evolved before its arrival to Hungary. The comparative examination of the suboptimal secondary structures revealed, that no one of the mutations affected the secondary structure of stem-loop structures IV and V in the IRES elements. Although it has been shown previously, that the echovirus group is genetically coherent and related to coxsackie B viruses the sequence differences in the epidemic isolates resulted in profound modification of the central stem (residues 477-529) of stem-loop structure No.V known to be affecting neurovirulence of polioviruses. Two alternate cloverleaf (stem-loop) structures were also recognised (nucleotides 376 to 460 and 540 to 565) which seem to mask both regions of the IRES element complementary to the 3'-end of the 18 S rRNA (460 to 466 and 561 to 570

  10. Molecular cloning, nucleotide sequence, and expression of a carboxypeptidase-encoding gene from the archaebacterium Sulfolobus solfataricus.

    PubMed

    Colombo, S; Toietta, G; Zecca, L; Vanoni, M; Tortora, P

    1995-10-01

    Mammalian metallocarboxypeptidases play key roles in major biological processes, such as digestive-protein degradation and specific proteolytic processing. A Sulfolobus solfataricus gene (cpsA) encoding a recently described zinc carboxypeptidase with an unusually broad substrate specificity was cloned, sequenced, and expressed in Escherichia coli. Despite the lack of overall sequence homology with known carboxypeptidases, seven homology blocks, including the Zn-coordinating and catalytic residues, were identified by multiple alignment with carboxypeptidases A, B, and T. S. solfataricus carboxypeptidase expressed in E. coli was found to be enzymatically active, and both its substrate specificity and thermostability were comparable to those of the purified S. solfataricus enzyme. PMID:7559343

  11. Molecular cloning, nucleotide sequence, and expression of a carboxypeptidase-encoding gene from the archaebacterium Sulfolobus solfataricus.

    PubMed Central

    Colombo, S; Toietta, G; Zecca, L; Vanoni, M; Tortora, P

    1995-01-01

    Mammalian metallocarboxypeptidases play key roles in major biological processes, such as digestive-protein degradation and specific proteolytic processing. A Sulfolobus solfataricus gene (cpsA) encoding a recently described zinc carboxypeptidase with an unusually broad substrate specificity was cloned, sequenced, and expressed in Escherichia coli. Despite the lack of overall sequence homology with known carboxypeptidases, seven homology blocks, including the Zn-coordinating and catalytic residues, were identified by multiple alignment with carboxypeptidases A, B, and T. S. solfataricus carboxypeptidase expressed in E. coli was found to be enzymatically active, and both its substrate specificity and thermostability were comparable to those of the purified S. solfataricus enzyme. PMID:7559343

  12. Molecular cloning, nucleotide sequence, and expression of a carboxypeptidase-encoding gene from the archaebacterium Sulfolobus solfataricus.

    PubMed

    Colombo, S; Toietta, G; Zecca, L; Vanoni, M; Tortora, P

    1995-10-01

    Mammalian metallocarboxypeptidases play key roles in major biological processes, such as digestive-protein degradation and specific proteolytic processing. A Sulfolobus solfataricus gene (cpsA) encoding a recently described zinc carboxypeptidase with an unusually broad substrate specificity was cloned, sequenced, and expressed in Escherichia coli. Despite the lack of overall sequence homology with known carboxypeptidases, seven homology blocks, including the Zn-coordinating and catalytic residues, were identified by multiple alignment with carboxypeptidases A, B, and T. S. solfataricus carboxypeptidase expressed in E. coli was found to be enzymatically active, and both its substrate specificity and thermostability were comparable to those of the purified S. solfataricus enzyme.

  13. An Interpretation of the Ancestral Codon from Miller’s Amino Acids and Nucleotide Correlations in Modern Coding Sequences

    PubMed Central

    Carels, Nicolas; de Leon, Miguel Ponce

    2015-01-01

    Purine bias, which is usually referred to as an “ancestral codon”, is known to result in short-range correlations between nucleotides in coding sequences, and it is common in all species. We demonstrate that RWY is a more appropriate pattern than the classical RNY, and purine bias (Rrr) is the product of a network of nucleotide compensations induced by functional constraints on the physicochemical properties of proteins. Through deductions from universal correlation properties, we also demonstrate that amino acids from Miller’s spark discharge experiment are compatible with functional primeval proteins at the dawn of living cell radiation on earth. These amino acids match the hydropathy and secondary structures of modern proteins. PMID:25922573

  14. NanoOK: multi-reference alignment analysis of nanopore sequencing data, quality and error profiles

    PubMed Central

    Leggett, Richard M.; Heavens, Darren; Caccamo, Mario; Clark, Matthew D.; Davey, Robert P.

    2016-01-01

    Motivation: The Oxford Nanopore MinION sequencer, currently in pre-release testing through the MinION Access Programme (MAP), promises long reads in real-time from an inexpensive, compact, USB device. Tools have been released to extract FASTA/Q from the MinION base calling output and to provide basic yield statistics. However, no single tool yet exists to provide comprehensive alignment-based quality control and error profile analysis—something that is extremely important given the speed with which the platform is evolving. Results: NanoOK generates detailed tabular and graphical output plus an in-depth multi-page PDF report including error profile, quality and yield data. NanoOK is multi-reference, enabling detailed analysis of metagenomic or multiplexed samples. Four popular Nanopore aligners are supported and it is easily extensible to include others. Availability and implementation: NanoOK is an open-source software, implemented in Java with supporting R scripts. It has been tested on Linux and Mac OS X and can be downloaded from https://github.com/TGAC/NanoOK. A VirtualBox VM containing all dependencies and the DH10B read set used in this article is available from http://opendata.tgac.ac.uk/nanook/. A Docker image is also available from Docker Hub—see program documentation https://documentation.tgac.ac.uk/display/NANOOK. Contact: richard.leggett@tgac.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26382197

  15. elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling.

    PubMed

    Herzeel, Charlotte; Costanza, Pascal; Decap, Dries; Fostier, Jan; Reumers, Joke

    2015-01-01

    elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1:40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost. PMID:26182406

  16. Alignment of 3D Building Models and TIR Video Sequences with Line Tracking

    NASA Astrophysics Data System (ADS)

    Iwaszczuk, D.; Stilla, U.

    2014-11-01

    Thermal infrared imagery of urban areas became interesting for urban climate investigations and thermal building inspections. Using a flying platform such as UAV or a helicopter for the acquisition and combining the thermal data with the 3D building models via texturing delivers a valuable groundwork for large-area building inspections. However, such thermal textures are useful for further analysis if they are geometrically correctly extracted. This can be achieved with a good coregistrations between the 3D building models and thermal images, which cannot be achieved by direct georeferencing. Hence, this paper presents methodology for alignment of 3D building models and oblique TIR image sequences taken from a flying platform. In a single image line correspondences between model edges and image line segments are found using accumulator approach and based on these correspondences an optimal camera pose is calculated to ensure the best match between the projected model and the image structures. Among the sequence the linear features are tracked based on visibility prediction. The results of the proposed methodology are presented using a TIR image sequence taken from helicopter in a densely built-up urban area. The novelty of this work is given by employing the uncertainty of the 3D building models and by innovative tracking strategy based on a priori knowledge from the 3D building model and the visibility checking.

  17. elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling.

    PubMed

    Herzeel, Charlotte; Costanza, Pascal; Decap, Dries; Fostier, Jan; Reumers, Joke

    2015-01-01

    elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1:40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost.

  18. elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling

    PubMed Central

    Decap, Dries; Fostier, Jan; Reumers, Joke

    2015-01-01

    elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1:40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost. PMID:26182406

  19. The identification of complete domains within protein sequences using accurate E-values for semi-global alignment

    PubMed Central

    Kann, Maricel G.; Sheetlin, Sergey L.; Park, Yonil; Bryant, Stephen H.; Spouge, John L.

    2007-01-01

    The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance. PMID:17596268

  20. Characterization and nucleotide sequence of a chicken gene encoding an opal suppressor tRNA and its flanking DNA segments.

    PubMed Central

    Hatfield, D L; Dudock, B S; Eden, F C

    1983-01-01

    A naturally occurring opal suppressor serine tRNA has been purified from chicken liver and used as a probe to isolate the corresponding gene from a library of chicken DNA in bacteriophage lambda. This minor tRNA is encoded by a single-copy gene that is not part of a tRNA gene cluster. DNA sequence analysis of the gene and its flanking DNA segments shows that the gene is encoded in an 87-base-pair segment without intervening sequences and specifies a tRNA that reads the termination codon UGA. This gene has additional nucleotides in the 5' internal promoter region but has a normal 3' internal promoter sequence and the usual termination signal. Images PMID:6308662

  1. The nucleotide sequences of several tRNA genes from rat mitochondria: common features and relatedness to homologous species.

    PubMed Central

    Cantatore, P; De Benedetto, C; Gadaleta, G; Gallerani, R; Kroon, A M; Holtrop, M; Lanave, C; Pepe, G; Quagliariello, C; Saccone, C; Sbisa, E

    1982-01-01

    We have determined the nucleotide sequences of thirteen rat mt tRNA genes. The features of the primary and secondary structures of these tRNAs show that those for Gln, Ser, and f-Met resemble, while those for Lys, Cys, and Trp depart strikingly from the universal type. The remainder are slightly abnormal. Among many mammalian mt DNA sequences, those of mt tRNA genes are highly conserved, thus suggesting for those genes an additional, perhaps regulatory, function. A simple evolutionary relationship between the tRNAs of animal mitochondria and those of eukaryotic cytoplasm, of lower eukaryotic mitochondria or of prokaryotes, is not evident owing to the extreme divergence of the tRNA sequences in the two groups. However, a slightly higher homology does exist between a few animal mt tRNAs and those from prokaryotes or from lower eukaryotic mitochondria. PMID:7099963

  2. Nucleotide sequences of fic and fic-1 genes involved in cell filamentation induced by cyclic AMP in Escherichia coli.

    PubMed Central

    Kawamukai, M; Matsuda, H; Fujii, W; Utsumi, R; Komano, T

    1989-01-01

    The nucleotide sequences of fic-1 involved in the cell filamentation induced by cyclic AMP in Escherichia coli and its normal counterpart fic were analyzed. The open reading frame of both fic-1 and fic coded for 200 amino acids. The Gly at position 55 in the Fic protein was changed to Arg in the Fic-1 protein. The promoter activity of fic was confirmed by fusing fic and lacZ. The gene downstream from fic was found to be pabA (p-aminobenzoate). There is an open reading frame (ORF190) coding for 190 amino acids upstream from the fic gene. Computer-assisted analysis showed that Fic has sequence similarity with part of CDC28 of Saccharomyces cerevisiae, CDC2 of Schizosaccharomyces pombe, and FtsA of E. coli. In addition, ORF190 has sequence similarity with the cyclosporin A-binding protein cyclophilin. PMID:2546924

  3. IBBOMSA: An Improved Biogeography-based Approach for Multiple Sequence Alignment

    PubMed Central

    Yadav, Rohit Kumar; Banka, Haider

    2016-01-01

    In bioinformatics, multiple sequence alignment (MSA) is an NP-hard problem. Hence, nature-inspired techniques can better approximate the solution. In the current study, a novel biogeography-based optimization (NBBO) is proposed to solve an MSA problem. The biogeography-based optimization (BBO) is a new paradigm for optimization. But, there exists some deficiencies in solving complicated problems such as low population diversity and slow convergence rate. NBBO is an enhanced version of BBO, in which, a new migration operation is proposed to overcome the limitations of BBO. The new migration adopts more information from other habitats, maintains population diversity, and preserves exploitation ability. In the performance analysis, the proposed and existing techniques such as VDGA, MOMSA, and GAPAM are tested on publicly available benchmark datasets (ie, Bali base). It has been observed that the proposed method shows the superiority/competitiveness with the existing techniques. PMID:27812276

  4. Are sites with multiple single nucleotide variants in cancer genomes a consequence of drivers, hypermutable sites or sequencing errors?

    PubMed Central

    Carr, Antony M.

    2016-01-01

    Across independent cancer genomes it has been observed that some sites have been recurrently hit by single nucleotide variants (SNVs). Such recurrently hit sites might be either (i) drivers of cancer that are postively selected during oncogenesis, (ii) due to mutation rate variation, or (iii) due to sequencing and assembly errors. We have investigated the cause of recurrently hit sites in a dataset of >3 million SNVs from 507 complete cancer genome sequences. We find evidence that many sites have been hit significantly more often than one would expect by chance, even taking into account the effect of the adjacent nucleotides on the rate of mutation. We find that the density of these recurrently hit sites is higher in non-coding than coding DNA and hence conclude that most of them are unlikely to be drivers. We also find that most of them are found in parts of the genome that are not uniquely mappable and hence are likely to be due to mapping errors. In support of the error hypothesis, we find that recurently hit sites are not randomly distributed across sequences from different laboratories. We fit a model to the data in which the rate of mutation is constant across sites but the rate of error varies. This model suggests that ∼4% of all SNVs are errors in this dataset, but that the rate of error varies by thousands-of-fold between sites. PMID:27688957

  5. Are sites with multiple single nucleotide variants in cancer genomes a consequence of drivers, hypermutable sites or sequencing errors?

    PubMed Central

    Carr, Antony M.

    2016-01-01

    Across independent cancer genomes it has been observed that some sites have been recurrently hit by single nucleotide variants (SNVs). Such recurrently hit sites might be either (i) drivers of cancer that are postively selected during oncogenesis, (ii) due to mutation rate variation, or (iii) due to sequencing and assembly errors. We have investigated the cause of recurrently hit sites in a dataset of >3 million SNVs from 507 complete cancer genome sequences. We find evidence that many sites have been hit significantly more often than one would expect by chance, even taking into account the effect of the adjacent nucleotides on the rate of mutation. We find that the density of these recurrently hit sites is higher in non-coding than coding DNA and hence conclude that most of them are unlikely to be drivers. We also find that most of them are found in parts of the genome that are not uniquely mappable and hence are likely to be due to mapping errors. In support of the error hypothesis, we find that recurently hit sites are not randomly distributed across sequences from different laboratories. We fit a model to the data in which the rate of mutation is constant across sites but the rate of error varies. This model suggests that ∼4% of all SNVs are errors in this dataset, but that the rate of error varies by thousands-of-fold between sites.

  6. Nucleotide sequence of dengue 2 RNA and comparison of the encoded proteins with those of other flaviviruses.

    PubMed

    Hahn, Y S; Galler, R; Hunkapiller, T; Dalrymple, J M; Strauss, J H; Strauss, E G

    1988-01-01

    We have determined the complete sequence of the RNA of dengue 2 virus (S1 candidate vaccine strain derived from the PR-159 isolate) with the exception of about 15 nucleotides at the 5' end. The genome organization is the same as that deduced earlier for other flaviviruses and the amino acid sequences of the encoded dengue 2 proteins show striking homology to those of other flaviviruses. The overall amino acid sequence similarity between dengue 2 and yellow fever virus is 44.7%, whereas that between dengue 2 and West Nile virus is 50.7%. These viruses represent three different serological subgroups of mosquito-borne flaviviruses. Comparison of the amino acid sequences shows that amino acid sequence homology is not uniformly distributed among the proteins; highest homology is found in some domains of nonstructural protein NS5 and lowest homology in the hydrophobic polypeptides ns2a and 2b. In general the structural proteins are less well conserved than the nonstructural proteins. Hydrophobicity profiles, however, are remarkably similar throughout the translated region. Comparison of the dengue 2 PR-159 sequence to partial sequence data from dengue 4 and another strain of dengue 2 virus reveals amino acid sequence homologies of about 64 and 96%, respectively, in the structural protein region. Thus as a general rule for flaviviruses examined to date, members of different serological subgroups demonstrate 50% or less amino acid sequence homology, members of the same subgroup average 65-75% homology, and strains of the same virus demonstrate greater than 95% amino acid sequence similarity.

  7. SP-Designer: a user-friendly program for designing species-specific primer pairs from DNA sequence alignments.

    PubMed

    Villard, Pierre; Malausa, Thibaut

    2013-07-01

    SP-Designer is an open-source program providing a user-friendly tool for the design of specific PCR primer pairs from a DNA sequence alignment containing sequences from various taxa. SP-Designer selects PCR primer pairs for the amplification of DNA from a target species on the basis of several criteria: (i) primer specificity, as assessed by interspecific sequence polymorphism in the annealing regions, (ii) the biochemical characteristics of the primers and (iii) the intended PCR conditions. SP-Designer generates tables, detailing the primer pair and PCR characteristics, and a FASTA file locating the primer sequences in the original sequence alignment. SP-Designer is Windows-compatible and freely available from http://www2.sophia.inra.fr/urih/sophia_mart/sp_designer/info_sp_designer.php.

  8. DNA sequencing by a single molecule detection of labeled nucleotides sequentially cleaved from a single strand of DNA

    SciTech Connect

    Goodwin, P.M.; Schecker, J.A.; Wilkerson, C.W.; Hammond, M.L.; Ambrose, W.P.; Jett, J.H.; Martin, J.C.; Marrone, B.L.; Keller, R.A. ); Haces, A.; Shih, P.J.; Harding, J.D. )

    1993-01-01

    We are developing a laser-based technique for the rapid sequencing of large DNA fragments (several kb in size) at a rate of 100 to 1000 bases per second. Our approach relies on fluorescent labeling of the bases in a single fragment of DNA, attachment of this labeled DNA fragment to a support, movement of the supported DNA into a flowing sample stream, sequential cleavage of the end nucleotide from the DNA fragment with an exonuclease, and detection of the individual fluorescently labeled bases by laser-induced fluorescence.

  9. DNA sequencing by a single molecule detection of labeled nucleotides sequentially cleaved from a single strand of DNA

    SciTech Connect

    Goodwin, P.M.; Schecker, J.A.; Wilkerson, C.W.; Hammond, M.L.; Ambrose, W.P.; Jett, J.H.; Martin, J.C.; Marrone, B.L.; Keller, R.A.; Haces, A.; Shih, P.J.; Harding, J.D.

    1993-02-01

    We are developing a laser-based technique for the rapid sequencing of large DNA fragments (several kb in size) at a rate of 100 to 1000 bases per second. Our approach relies on fluorescent labeling of the bases in a single fragment of DNA, attachment of this labeled DNA fragment to a support, movement of the supported DNA into a flowing sample stream, sequential cleavage of the end nucleotide from the DNA fragment with an exonuclease, and detection of the individual fluorescently labeled bases by laser-induced fluorescence.

  10. Analysis of a nucleotide-binding site of 5-lipoxygenase by affinity labelling: binding characteristics and amino acid sequences.

    PubMed Central

    Zhang, Y Y; Hammarberg, T; Radmark, O; Samuelsson, B; Ng, C F; Funk, C D; Loscalzo, J

    2000-01-01

    5-Lipoxygenase (5LO) catalyses the first two steps in the biosynthesis of leukotrienes, which are inflammatory mediators derived from arachidonic acid. 5LO activity is stimulated by ATP; however, a consensus ATP-binding site or nucleotide-binding site has not been found in its protein sequence. In the present study, affinity and photoaffinity labelling of 5LO with 5'-p-fluorosulphonylbenzoyladenosine (FSBA) and 2-azido-ATP showed that 5LO bound to the ATP analogues quantitatively and specifically and that the incorporation of either analogue inhibited ATP stimulation of 5LO activity. The stoichiometry of the labelling was 1.4 mol of FSBA/mol of 5LO (of which ATP competed with 1 mol/mol) or 0.94 mol of 2-azido-ATP/mol of 5LO (of which ATP competed with 0.77 mol/mol). Labelling with FSBA prevented further labelling with 2-azido-ATP, indicating that the same binding site was occupied by both analogues. Other nucleotides (ADP, AMP, GTP, CTP and UTP) also competed with 2-azido-ATP labelling, suggesting that the site was a general nucleotide-binding site rather than a strict ATP-binding site. Ca(2+), which also stimulates 5LO activity, had no effect on the labelling of the nucleotide-binding site. Digestion with trypsin and peptide sequencing showed that two fragments of 5LO were labelled by 2-azido-ATP. These fragments correspond to residues 73-83 (KYWLNDDWYLK, in single-letter amino acid code) and 193-209 (FMHMFQSSWNDFADFEK) in the 5LO sequence. Trp-75 and Trp-201 in these peptides were modified by the labelling, suggesting that they were immediately adjacent to the C-2 position of the adenine ring of ATP. Given the stoichiometry of the labelling, the two peptide sequences of 5LO were probably near each other in the enzyme's tertiary structure, composing or surrounding the ATP-binding site of 5LO. PMID:11042125

  11. PerPlot & PerScan: tools for analysis of DNA curvature-related periodicity in genomic nucleotide sequences

    PubMed Central

    2011-01-01

    Background Periodic spacing of short adenine or thymine runs phased with DNA helical period of ~10.5 bp is associated with intrinsic DNA curvature and deformability, which play important roles in DNA-protein interactions and in the organization of chromosomes in both eukaryotes and prokaryotes. Local differences in DNA sequence periodicity have been linked to differences in gene expression in some organisms. Despite the significance of these periodic patterns, there are virtually no publicly accessible tools for their analysis. Results We present novel tools suitable for assessments of DNA curvature-related sequence periodicity in nucleotide sequences at the genome scale. Utility of the present software is demonstrated on a comparison of sequence periodicities in the genomes of Haemophilus influenzae, Methanocaldococcus jannaschii, Saccharomyces cerevisiae, and Arabidopsis thaliana. The software can be accessed through a web interface and the programs are also available for download. Conclusions The present software is suitable for comparing DNA curvature-related sequence periodicity among different genomes as well as for analysis of intrachromosomal heterogeneity of the sequence periodicity. It provides a quick and convenient way to detect anomalous regions of chromosomes that could have unusual structural and functional properties and/or distinct evolutionary history. PMID:22587738

  12. Phylogeny of Populus (Salicaceae) based on nucleotide sequences of chloroplast TRNT-TRNF region and nuclear rDNA.

    PubMed

    Hamzeh, Mona; Dayanandan, Selvadurai

    2004-09-01

    The species of the genus Populus, collectively known as poplars, are widely distributed over the northern hemisphere and well known for their ecological, economical, and evolutionary importance. The extensive interspecific hybridization and high morphological diversity in this group pose difficulties in identifying taxonomic units for comparative evolutionary studies and systematics. To understand the evolutionary relationships among poplars and to provide a framework for biosystematic classification, we reconstructed a phylogeny of the genus Populus based on nucleotide sequences of three noncoding regions of the chloroplast DNA (intron of trnL and intergenic regions of trnT-trnL and trnL-trnF) and ITS1 and ITS2 of the nuclear rDNA. The resulting phylogenetic trees showed polyphyletic relationships among species in the sections Tacamahaca and Aigeiros. Based on chloroplast DNA sequence data, P. nigra had a close affinity to species of section Populus, whereas nuclear DNA sequence data suggested a close relationship between P. nigra and species of the section Aigeiros, suggesting a possible hybrid origin for P. nigra. Similarly, the chloroplast DNA sequences of P. tristis and P. szechuanica were similar to that of the species of section Aigeiros, while the nuclear sequences revealed a close affinity to species of the section Tacamahaca, suggesting a hybrid origin for these two Asiatic balsam poplars. The incongruence between phylogenetic trees based on nuclear- and chloroplast-DNA sequence data suggests a reticulate evolution in the genus Populus.

  13. Infectivity and complete nucleotide sequence of the genome of a genetically distinct strain of maize streak virus from Reunion Island.

    PubMed

    Peterschmitt, M; Granier, M; Frutos, R; Reynaud, B

    1996-01-01

    A complete infectious genome of an isolate of maize streak subgroup 1 geminivirus from Reunion Island (MSV-R) was cloned and sequenced. Using an Agrobacterium tumefaciens Ti plasmid delivery system, the cloned 2.7 kb circular DNA was shown to be infectious in maize. The agroinfected virus could be transmitted by Cicadulina mbila, the most common vector species of MSV in Reunion. Analysis of open reading frames (ORFs) revealed seven potential coding regions including the 4 ORFs conserved in all geminiviruses infecting monocotyledonous plants, the 2 on the viral "+" strand (MP, CP), and the 2 on the complementary "-" strand (RepA, RepB). The nucleotide sequence of MSV-R was compared to previously determined sequence of three African clones from Nigeria (MSV-N), Kenya (MSV-K), and South Africa (MSV-S). More similarity was found between the African clones (97.0-97.3%) than between these and MSV-R (94.4-95.3%). Nucleotide substitutions were frequent in the large intergenic region, particularly in and around the most likely TATA box for the complementary sense genes, and in the 5' end of ORF V1. The comparison of the predicted peptide sequences of the proteins encoded by ORFs MP, RepA and RepB confirmed the higher similarity between the African clones (97.8-99.3%) than between these and MSV-R (95.1-97.1%). However the amino acid sequences of the protein encoded by ORF CP (capsid protein) were very conserved among all the 4 clones, suggesting a high selection pressure on this ORF. PMID:8893787

  14. Molecular cloning, nucleotide sequence, and expression in Escherichia coli of a hemolytic toxin (aerolysin) gene from Aeromonas trota

    SciTech Connect

    Khan, A.A.; Kim, E.; Cerniglia, C.E.

    1998-07-01

    Aeromonas trota AK2, which was derived from ATCC 49659 and produces the extracellular pore-forming hemolytic toxin aerolysin, was mutagenized with the transposon mini-Tn5Km1 to generate a hemolysin-deficient mutant, designated strain AK253. Southern blotting data indicated that an 8.7-kb NotI fragment of the genomic DNA of strain AK253 contained the kanamycin resistance gene of mini-Tn5Km1. The 8.7-kb NotI DNA fragment was cloned into the vector pGEM5Zf({minus}) by selecting for kanamycin resistance, and the resultant clone, pAK71, showed aerolysin activity in Escherichia coli JM109. The nucleotide sequence of the aerA gene, located on the 1.8-kb ApaI-EcoRI fragment, was determined to consist of 1,479 bp and to have an ATG initiation codon and a TAA termination codon. An in vitro coupled transcription-translation analysis of the 1.8-kb region suggested that the aerA gene codes for a 54-kDa protein, in agreement with nucleotide sequence data. The deduced amino acid sequence of the aerA gene product of A. trota exhibited 99% homology with the amino acid sequence of the aerA product of Aeromonas sobria AB3 and 57% homology with the amino acid sequences of the products of the aerA genes of Aeromonas salmonicida 17-2 and A. sobria 33.

  15. Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.

    PubMed

    Nguyen, Thuy-Diem; Schmidt, Bertil; Zheng, Zejun; Kwoh, Chee-Keong

    2015-01-01

    De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications. PMID:26451819

  16. Rapid DNA Sequencing by Direct Nanoscale Reading of Nucleotide Bases on Individual DNA Chains

    SciTech Connect

    Lee, James Weifu; Meller, Amit

    2007-01-01

    Since the independent invention of DNA sequencing by Sanger and by Gilbert 30 years ago, it has grown from a small scale technique capable of reading several kilobase-pair of sequence per day into today's multibillion dollar industry. This growth has spurred the development of new sequencing technologies that do not involve either electrophoresis or Sanger sequencing chemistries. Sequencing by Synthesis (SBS) involves multiple parallel micro-sequencing addition events occurring on a surface, where data from each round is detected by imaging. New High Throughput Technologies for DNA Sequencing and Genomics is the second volume in the Perspectives in Bioanalysis series, which looks at the electroanalytical chemistry of nucleic acids and proteins, development of electrochemical sensors and their application in biomedicine and in the new fields of genomics and proteomics. The authors have expertly formatted the information for a wide variety of readers, including new developments that will inspire students and young scientists to create new tools for science and medicine in the 21st century. Reviews of complementary developments in Sanger and SBS sequencing chemistries, capillary electrophoresis and microdevice integration, MS sequencing and applications set the framework for the book.

  17. Domain structures and molecular evolution of class I and class II major histocompatibility gene complex (MHC) products deduced from amino acid and nucleotide sequence homologies

    NASA Astrophysics Data System (ADS)

    Ohnishi, Koji

    1984-12-01

    Domain structures of class I and class II MHC products were analyzed from a viewpoint of amino acid and nucleotide sequence homologies. Alignment statistics revealed that class I (transplantation) antigen H chains consist of four mutually homologous domains, and that class II (HLA-DR) antigen β and α chains are both composed of three mutually homologous ones. The N-terminal three and two domains of class I and class II (both β and α) gene products, respectively, all of which being ˜90 residues long, were concluded to be homologous to β2-microglobulin (β2M). The membraneembedded C-terminal shorter domains of these MHC products were also found to be homologous to one another and to the third domain of class I H chains. Class I H chains were found to be more closely related to class II α chains than to class II β chains. Based on these findings, an exon duplication history from a common ancestral gene encoding a β2M-like primodial protein of one-domain-length up to the contemporary MHC products was proposed.

  18. [Classification of nucleotide sequences over their frequency dictionaries reveals a relation between the structure of sequences and taxonomy of their bearers].

    PubMed

    Gorban', A N; Popova, T G; Sadovskiĭ, M G

    2003-01-01

    Classification of 16S RNA sequences over their frequency dictionaries, both real ones, and transformed ones was studied. Two entities were considered to be close each other from the point of view of their structure, if their frequency dictionaries were close, in Eucledian metric. A transformation procedure of a frequency dictionary has been implemented that reveals the peculiarities of information structure of a nucleotide sequence. A comparative study of two classification developed over the real frequency dictionary vs. that one developed over the transformed frequency dictionary was carried out. The strong correlation is revealed between the classification and the taxonomy of 16S RNA bearer. For the classes isolated, the information valuable words were identified. These words are the main factors of a difference between the classes. The frequency dictionaries containing the words of the length 3 exhibit the best correlation between a class and a genus. A genus, as a rule, is included into the same class, and the exclusion are sporadic. A development of hierarchy classification over the transformed frequency dictionaries separated one or two taxonomy groups, as each stage of classification. The unexpectedly frequent, or contrary, unexpectedly rare occurred of words (of the length 3) in entities under consideration make the structure difference between the classes of the nucleotide sequences.

  19. Nucleotide sequence of the pnd gene in plasmid R483 and role of the pnd gene product in plasmolysis.

    PubMed

    Ono, K; Akimoto, S; Ohnishi, Y

    1987-01-01

    The pnd gene of R plasmid R483, like the srnB gene of the F plasmid, increases the degradation of stable RNA in Escherichia coli. The nucleotide sequence of the pnd locus was determined and compared with that of the srnB locus. The genes have open reading frames that are 54% homologous, and both have an upstream inverted repeat sequence. The pnd gene expression seems to decrease the osmotic barrier of the cytoplasmic membrane, since no plasmolytic vacuoles were formed in the cells carrying the gene when the cells were exposed to hypertonic sucrose solution. This result suggests that RNase I in the periplasm passes through the altered membrane to degrade stable RNA in the cytoplasm.

  20. Purification of the gam gene-product of bacteriophage Mu and determination of the nucleotide sequence of the gam gene.

    PubMed Central

    Akroyd, J E; Clayson, E; Higgins, N P

    1986-01-01

    The gam gene of bacteriophage Mu encodes a protein which protects linear double stranded DNA from exonuclease degradation in vitro and in vivo. We purified the Mu gam gene product to apparent homogeneity from cells in which it is over-produced from a plasmid clone. The purified protein is a dimer of identical subunits of 18.9 kd. It can aggregate DNA into large, rapidly sedimenting complexes and is a potent exonuclease inhibitor when bound to DNA. The N-terminal amino acid sequence of the purified protein was determined by automated degradation and the nucleotide sequence of the Mu gam gene is presented to accurately map its position in the Mu genome. Images PMID:2945162