Science.gov

Sample records for aligned nucleotide sequences

  1. Nucleotide sequence alignment of hdcA from Gram-positive bacteria.

    PubMed

    Diaz, Maria; Ladero, Victor; Redruello, Begoña; Sanchez-Llana, Esther; Del Rio, Beatriz; Fernandez, Maria; Martin, Maria Cruz; Alvarez, Miguel A

    2016-03-01

    The decarboxylation of histidine -carried out mainly by some gram-positive bacteria- yields the toxic dietary biogenic amine histamine (Ladero et al. 2010 〈10.2174/157340110791233256〉 [1], Linares et al. 2016 〈http://dx.doi.org/10.1016/j.foodchem.2015.11.013〉〉 [2]). The reaction is catalyzed by a pyruvoyl-dependent histidine decarboxylase (Linares et al. 2011 〈10.1080/10408398.2011.582813〉 [3]), which is encoded by the gene hdcA. In order to locate conserved regions in the hdcA gene of Gram-positive bacteria, this article provides a nucleotide sequence alignment of all the hdcA sequences from Gram-positive bacteria present in databases. For further utility and discussion, see 〈http://dx.doi.org/ 10.1016/j.foodcont.2015.11.035〉〉 [4]. PMID:26958625

  2. Nucleotide sequence alignment of hdcA from Gram-positive bacteria

    PubMed Central

    Diaz, Maria; Ladero, Victor; Redruello, Begoña; Sanchez-Llana, Esther; del Rio, Beatriz; Fernandez, Maria; Martin, Maria Cruz; Alvarez, Miguel A.

    2016-01-01

    The decarboxylation of histidine -carried out mainly by some gram-positive bacteria- yields the toxic dietary biogenic amine histamine (Ladero et al. 2010 〈10.2174/157340110791233256〉 [1], Linares et al. 2016 〈http://dx.doi.org/10.1016/j.foodchem.2015.11.013〉〉 [2]). The reaction is catalyzed by a pyruvoyl-dependent histidine decarboxylase (Linares et al. 2011 〈10.1080/10408398.2011.582813〉 [3]), which is encoded by the gene hdcA. In order to locate conserved regions in the hdcA gene of Gram-positive bacteria, this article provides a nucleotide sequence alignment of all the hdcA sequences from Gram-positive bacteria present in databases. For further utility and discussion, see 〈http://dx.doi.org/ 10.1016/j.foodcont.2015.11.035〉〉 [4]. PMID:26958625

  3. ANTICALIgN: visualizing, editing and analyzing combined nucleotide and amino acid sequence alignments for combinatorial protein engineering.

    PubMed

    Jarasch, Alexander; Kopp, Melanie; Eggenstein, Evelyn; Richter, Antonia; Gebauer, Michaela; Skerra, Arne

    2016-07-01

    ANTIC ALIGN: is an interactive software developed to simultaneously visualize, analyze and modify alignments of DNA and/or protein sequences that arise during combinatorial protein engineering, design and selection. ANTIC ALIGN: combines powerful functions known from currently available sequence analysis tools with unique features for protein engineering, in particular the possibility to display and manipulate nucleotide sequences and their translated amino acid sequences at the same time. ANTIC ALIGN: offers both template-based multiple sequence alignment (MSA), using the unmutated protein as reference, and conventional global alignment, to compare sequences that share an evolutionary relationship. The application of similarity-based clustering algorithms facilitates the identification of duplicates or of conserved sequence features among a set of selected clones. Imported nucleotide sequences from DNA sequence analysis are automatically translated into the corresponding amino acid sequences and displayed, offering numerous options for selecting reading frames, highlighting of sequence features and graphical layout of the MSA. The MSA complexity can be reduced by hiding the conserved nucleotide and/or amino acid residues, thus putting emphasis on the relevant mutated positions. ANTIC ALIGN: is also able to handle suppressed stop codons or even to incorporate non-natural amino acids into a coding sequence. We demonstrate crucial functions of ANTIC ALIGN: in an example of Anticalins selected from a lipocalin random library against the fibronectin extradomain B (ED-B), an established marker of tumor vasculature. Apart from engineered protein scaffolds, ANTIC ALIGN: provides a powerful tool in the area of antibody engineering and for directed enzyme evolution. PMID:27261456

  4. SNUFER: A software for localization and presentation of single nucleotide polymorphisms using a Clustal multiple sequence alignment output file

    PubMed Central

    Mansur, Marco A B; Cardozo, Giovana P; Santos, Elaine V; Marins, Mozart

    2008-01-01

    SNUFER is a software for the automatic localization and generation of tables used for the presentation of single nucleotide polymorphisms (SNPs). After input of a fasta file containing the sequences to be analyzed, a multiple sequence alignment is generated using ClustalW ran inside SNUFER. The ClustalW output file is then used to generate a table which displays the SNPs detected in the aligned sequences and their degree of similarity. This table can be exported to Microsoft Word, Microsoft Excel or as a single text file, permitting further editing for publication. The software was written using Delphi 7 for programming and FireBird 2.0 for sequence database management. It is freely available for noncommercial use and can be downloaded from http://www.heranza.com.br/bioinformatica2.htm. PMID:19238196

  5. Data in support of the discovery of alternative splicing variants of quail LEPR and the evolutionary conservation of qLEPRl by nucleotide and amino acid sequences alignment.

    PubMed

    Wang, Dandan; Xu, Chunlin; Wang, Taian; Li, Hong; Li, Yanmin; Ren, Junxiao; Tian, Yadong; Li, Zhuanjian; Jiao, Yuping; Kang, Xiangtao; Liu, Xiaojun

    2016-03-01

    Leptin receptor (LEPR) belongs to the class I cytokine receptor superfamily which share common structural features and signal transduction pathways. Although multiple LEPR isoforms, which are derived from one gene, were identified in mammals, they were rarely found in avian except the long LEPR. Four alternative splicing variants of quail LEPR (qLEPR) had been cloned and sequenced for the first time (Wang et al., 2015 [1]). To define patterns of the four splicing variants (qLEPRl, qLEPR-a, qLEPR-b and qLEPR-c) and locate the conserved regions of qLEPRl, this data article provides nucleotide sequence alignment of qLEPR and amino acid sequence alignment of representative vertebrate LEPR. The detailed analysis was shown in [1]. PMID:26759819

  6. Data in support of the discovery of alternative splicing variants of quail LEPR and the evolutionary conservation of qLEPRl by nucleotide and amino acid sequences alignment

    PubMed Central

    Wang, Dandan; Xu, Chunlin; Wang, Taian; Li, Hong; Li, Yanmin; Ren, Junxiao; Tian, Yadong; Li, Zhuanjian; Jiao, Yuping; Kang, Xiangtao; Liu, Xiaojun

    2015-01-01

    Leptin receptor (LEPR) belongs to the class I cytokine receptor superfamily which share common structural features and signal transduction pathways. Although multiple LEPR isoforms, which are derived from one gene, were identified in mammals, they were rarely found in avian except the long LEPR. Four alternative splicing variants of quail LEPR (qLEPR) had been cloned and sequenced for the first time (Wang et al., 2015 [1]). To define patterns of the four splicing variants (qLEPRl, qLEPR-a, qLEPR-b and qLEPR-c) and locate the conserved regions of qLEPRl, this data article provides nucleotide sequence alignment of qLEPR and amino acid sequence alignment of representative vertebrate LEPR. The detailed analysis was shown in [1]. PMID:26759819

  7. Pairwise Sequence Alignment Library

    2015-05-20

    Vector extensions, such as SSE, have been part of the x86 CPU since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprintmore » that features complex data dependencies. The trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. Therefore, a novel SIMD implementation of a parallel scan-based sequence alignment algorithm that can better exploit wider SIMD units was implemented as part of the Parallel Sequence Alignment Library (parasail). Parasail features: Reference implementations of all known vectorized sequence alignment approaches. Implementations of Smith Waterman (SW), semi-global (SG), and Needleman Wunsch (NW) sequence alignment algorithms. Implementations across all modern CPU instruction sets including AVX2 and KNC. Language interfaces for C/C++ and Python.« less

  8. Pairwise Sequence Alignment Library

    SciTech Connect

    Jeff Daily, PNNL

    2015-05-20

    Vector extensions, such as SSE, have been part of the x86 CPU since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprint that features complex data dependencies. The trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. Therefore, a novel SIMD implementation of a parallel scan-based sequence alignment algorithm that can better exploit wider SIMD units was implemented as part of the Parallel Sequence Alignment Library (parasail). Parasail features: Reference implementations of all known vectorized sequence alignment approaches. Implementations of Smith Waterman (SW), semi-global (SG), and Needleman Wunsch (NW) sequence alignment algorithms. Implementations across all modern CPU instruction sets including AVX2 and KNC. Language interfaces for C/C++ and Python.

  9. Large-scale nucleotide sequence alignment and sequence variability assessment to identify the evolutionarily highly conserved regions for universal screening PCR assay design: an example of influenza A virus.

    PubMed

    Nagy, Alexander; Jiřinec, Tomáš; Černíková, Lenka; Jiřincová, Helena; Havlíčková, Martina

    2015-01-01

    The development of a diagnostic polymerase chain reaction (PCR) or quantitative PCR (qPCR) assay for universal detection of highly variable viral genomes is always a difficult task. The purpose of this chapter is to provide a guideline on how to align, process, and evaluate a huge set of homologous nucleotide sequences in order to reveal the evolutionarily most conserved positions suitable for universal qPCR primer and hybridization probe design. Attention is paid to the quantification and clear graphical visualization of the sequence variability at each position of the alignment. In addition, specific problems related to the processing of the extremely large sequence pool are highlighted. All of these steps are performed using an ordinary desktop computer without the need for extensive mathematical or computational skills. PMID:25697651

  10. The EMBL Nucleotide Sequence Database.

    PubMed

    Stoesser, G; Tuli, M A; Lopez, R; Sterk, P

    1999-01-01

    The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl.html) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. While automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO), the preferred submission tool for individual submitters is Webin (WWW). Through all stages, dataflow is monitored by EBI biologists communicating with the sequencing groups. In collaboration with DDBJ and GenBank the database is produced, maintained and distributed at the European Bioinformatics Institute (EBI). Database releases are produced quarterly and are distributed on CD-ROM. Network services allow access to the most up-to-date data collection via Internet and World Wide Web interface. EBI's Sequence Retrieval System (SRS) is a Network Browser for Databanks in Molecular Biology, integrating and linking the main nucleotide and protein databases, plus many specialised databases. For sequence similarity searching a variety of tools (e.g. Blitz, Fasta, Blast etc) are available for external users to compare their own sequences against the most currently available data in the EMBL Nucleotide Sequence Database and SWISS-PROT. PMID:9847133

  11. Nucleotide sequences 1986/1987

    SciTech Connect

    Not Available

    1987-01-01

    These eight volumes are the third annual published compendium of nucleic acid sequences included in the European Molecular Biology Laboratory Nucleotide Sequence Data Library and the GenBank Genetic Sequences Data Bank. Each volume surveys one or more subdivisions of the database. The volume subtitles are: Primates; Rodents; Other Vertebrates and Invertebrates, Plants and Organelles, Bacteria and Bacteriophage, Viruses, Structural RNA, Synthetic and Unannotated Sequences, and Database Directory and Master Indices.

  12. Automated Identification of Nucleotide Sequences

    NASA Technical Reports Server (NTRS)

    Osman, Shariff; Venkateswaran, Kasthuri; Fox, George; Zhu, Dian-Hui

    2007-01-01

    STITCH is a computer program that processes raw nucleotide-sequence data to automatically remove unwanted vector information, perform reverse-complement comparison, stitch shorter sequences together to make longer ones to which the shorter ones presumably belong, and search against the user s choice of private and Internet-accessible public 16S rRNA databases. ["16S rRNA" denotes a ribosomal ribonucleic acid (rRNA) sequence that is common to all organisms.] In STITCH, a template 16S rRNA sequence is used to position forward and reverse reads. STITCH then automatically searches known 16S rRNA sequences in the user s chosen database(s) to find the sequence most similar to (the sequence that lies at the smallest edit distance from) each spliced sequence. The result of processing by STITCH is the identification of the most similar well-described bacterium. Whereas previously commercially available software for analyzing genetic sequences operates on one sequence at a time, STITCH can manipulate multiple sequences simultaneously to perform the aforementioned operations. A typical analysis of several dozen sequences (length of the order of 103 base pairs) by use of STITCH is completed in a few minutes, whereas such an analysis performed by use of prior software takes hours or days.

  13. Alignment-Annotator web server: rendering and annotating sequence alignments

    PubMed Central

    Gille, Christoph; Fähling, Michael; Weyand, Birgit; Wieland, Thomas; Gille, Andreas

    2014-01-01

    Alignment-Annotator is a novel web service designed to generate interactive views of annotated nucleotide and amino acid sequence alignments (i) de novo and (ii) embedded in other software. All computations are performed at server side. Interactivity is implemented in HTML5, a language native to web browsers. The alignment is initially displayed using default settings and can be modified with the graphical user interfaces. For example, individual sequences can be reordered or deleted using drag and drop, amino acid color code schemes can be applied and annotations can be added. Annotations can be made manually or imported (BioDAS servers, the UniProt, the Catalytic Site Atlas and the PDB). Some edits take immediate effect while others require server interaction and may take a few seconds to execute. The final alignment document can be downloaded as a zip-archive containing the HTML files. Because of the use of HTML the resulting interactive alignment can be viewed on any platform including Windows, Mac OS X, Linux, Android and iOS in any standard web browser. Importantly, no plugins nor Java are required and therefore Alignment-Anotator represents the first interactive browser-based alignment visualization. Availability: http://www.bioinformatics.org/strap/aa/ and http://strap.charite.de/aa/. PMID:24813445

  14. Simplified computer programs for search of homology within nucleotide sequences.

    PubMed Central

    Kröger, M; Kröger-Block, A

    1984-01-01

    Four new computer programs for search of homology within nucleotide sequences are presented. The main scope of the program design is flexibility, independence of sequence length and the capability to be used by any molecular biologist without any prior computer experience. The programs offer a linear search, a search for maximal identity, an alignment along a given sequence and a search based on homology within the amino acid coding capacity of nucleotide sequences. The language is Fortran V. Copies are available on request. PMID:6546417

  15. Nucleotide sequences encoding a thermostable alkaline protease

    DOEpatents

    Wilson, David B.; Lao, Guifang

    1998-01-01

    Nucleotide sequences, derived from a thermophilic actinomycete microorganism, which encode a thermostable alkaline protease are disclosed. Also disclosed are variants of the nucleotide sequences which encode a polypeptide having thermostable alkaline proteolytic activity. Recombinant thermostable alkaline protease or recombinant polypeptide may be obtained by culturing in a medium a host cell genetically engineered to contain and express a nucleotide sequence according to the present invention, and recovering the recombinant thermostable alkaline protease or recombinant polypeptide from the culture medium.

  16. Nucleotide sequences encoding a thermostable alkaline protease

    DOEpatents

    Wilson, D.B.; Lao, G.

    1998-01-06

    Nucleotide sequences, derived from a thermophilic actinomycete microorganism, which encode a thermostable alkaline protease are disclosed. Also disclosed are variants of the nucleotide sequences which encode a polypeptide having thermostable alkaline proteolytic activity. Recombinant thermostable alkaline protease or recombinant polypeptide may be obtained by culturing in a medium a host cell genetically engineered to contain and express a nucleotide sequence according to the present invention, and recovering the recombinant thermostable alkaline protease or recombinant polypeptide from the culture medium. 3 figs.

  17. Long-range correlations in nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Peng, C.-K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simons, M.; Stanley, H. E.

    1992-03-01

    DNA SEQUENCES have been analysed using models, such as an it-step Markov chain, that incorporate the possibility of short-range nucleotide correlations1. We propose here a method for studying the stochastic properties of nucleotide sequences by constructing a 1:1 map of the nucleotide sequence onto a walk, which we term a 'DNA walk'. We then use the mapping to provide a quantitative measure of the correlation between nucleotides over long distances along the DNA chain. Thus we uncover in the nucleotide sequence a remarkably long-range power law correlation that implies a new scale-invariant property of DNA. We find such long-range correlations in intron-containing genes and in nontranscribed regulatory DNA sequences, but not in complementary DNA sequences or intron-less genes.

  18. Sequence alignment with tandem duplication

    SciTech Connect

    Benson, G.

    1997-12-01

    Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modification of sequences proceeds through any of the operations of substitution, insertion or deletion (the latter two collectively termed indels). While this model has worked farily well, it has long been apparent that other mutational events occur. In this paper, we introduce a new model, the DSI model which includes another common mutational event, tandem duplication. Tandem duplication produces tandem repeats which are common in DNA, making up perhaps 10% of the human genome. They are responsible for some human diseases and may serve a multitude of functions in DNA regulation and evolution. Using the DSI model, we develop new exact and heuristic algorithms for comparing and aligning DNA sequences when they contain tandem repeats. 30 refs., 3 figs.

  19. Moss Phylogeny Reconstruction Using Nucleotide Pangenome of Complete Mitogenome Sequences.

    PubMed

    Goryunov, D V; Nagaev, B E; Nikolaev, M Yu; Alexeevski, A V; Troitsky, A V

    2015-11-01

    Stability of composition and sequence of genes was shown earlier in 13 mitochondrial genomes of mosses (Rensing, S. A., et al. (2008) Science, 319, 64-69). It is of interest to study the evolution of mitochondrial genomes not only at the gene level, but also on the level of nucleotide sequences. To do this, we have constructed a "nucleotide pangenome" for mitochondrial genomes of 24 moss species. The nucleotide pangenome is a set of aligned nucleotide sequences of orthologous genome fragments covering the totality of all genomes. The nucleotide pangenome was constructed using specially developed new software, NPG-explorer (NPGe). The stable part of the mitochondrial genome (232 stable blocks) is shown to be, on average, 45% of its length. In the joint alignment of stable blocks, 82% of positions are conserved. The phylogenetic tree constructed with the NPGe program is in good correlation with other phylogenetic reconstructions. With the NPGe program, 30 blocks have been identified with repeats no shorter than 50 bp. The maximal length of a block with repeats is 140 bp. Duplications in the mitochondrial genomes of mosses are rare. On average, the genome contains about 500 bp in large duplications. The total length of insertions and deletions was determined in each genome. The losses and gains of DNA regions are rather active in mitochondrial genomes of mosses, and such rearrangements presumably can be used as additional markers in the reconstruction of phylogeny. PMID:26615445

  20. Global Alignment System for Large Genomic Sequencing

    2002-03-01

    AVID is a global alignment system tailored for the alignment of large genomic sequences up to megabases in length. Features include the possibility of one sequence being in draft form, fast alignment, robustness and accuracy. The method is an anchor based alignment using maximal matches derived from suffix trees.

  1. Single Nucleotide Polymorphism Mapping Using Genome-Wide Unique Sequences

    PubMed Central

    Chen, Leslie Y.Y.; Lu, Szu-Hsien; Shih, Edward S.C.; Hwang, Ming-Jing

    2002-01-01

    As more and more genomic DNAs are sequenced to characterize human genetic variations, the demand for a very fast and accurate method to genomically position these DNA sequences is high. We have developed a new mapping method that does not require sequence alignment. In this method, we first identified DNA fragments of 15 bp in length that are unique in the human genome and then used them to position single nucleotide polymorphism (SNP) sequences. By use of four desktop personal computers with AMD K7 (1 GHz) processors, our new method mapped more than 1.6 million SNP sequences in 20 hr and achieved a very good agreement with mapping results from alignment-based methods. PMID:12097348

  2. Statistical analysis of nucleotide sequences.

    PubMed Central

    Stückle, E E; Emmrich, C; Grob, U; Nielsen, P J

    1990-01-01

    In order to scan nucleic acid databases for potentially relevant but as yet unknown signals, we have developed an improved statistical model for pattern analysis of nucleic acid sequences by modifying previous methods based on Markov chains. We demonstrate the importance of selecting the appropriate parameters in order for the method to function at all. The model allows the simultaneous analysis of several short sequences with unequal base frequencies and Markov order k not equal to 0 as is usually the case in databases. As a test of these modifications, we show that in E. coli sequences there is a bias against palindromic hexamers which correspond to known restriction enzyme recognition sites. PMID:2251125

  3. Nucleotide Sequencing and Identification of Some Wild Mushrooms

    PubMed Central

    Das, Sudip Kumar; Mandal, Aninda; Datta, Animesh K.; Gupta, Sudha; Paul, Rita; Saha, Aditi; Sengupta, Sonali; Dubey, Priyanka Kumari

    2013-01-01

    The rDNA-ITS (Ribosomal DNA Internal Transcribed Spacers) fragment of the genomic DNA of 8 wild edible mushrooms (collected from Eastern Chota Nagpur Plateau of West Bengal, India) was amplified using ITS1 (Internal Transcribed Spacers 1) and ITS2 primers and subjected to nucleotide sequence determination for identification of mushrooms as mentioned. The sequences were aligned using ClustalW software program. The aligned sequences revealed identity (homology percentage from GenBank data base) of Amanita hemibapha [CN (Chota Nagpur) 1, % identity 99 (JX844716.1)], Amanita sp. [CN 2, % identity 98 (JX844763.1)], Astraeus hygrometricus [CN 3, % identity 87 (FJ536664.1)], Termitomyces sp. [CN 4, % identity 90 (JF746992.1)], Termitomyces sp. [CN 5, % identity 99 (GU001667.1)], T. microcarpus [CN 6, % identity 82 (EF421077.1)], Termitomyces sp. [CN 7, % identity 76 (JF746993.1)], and Volvariella volvacea [CN 8, % identity 100 (JN086680.1)]. Although out of 8 mushrooms 4 could be identified up to species level, the nucleotide sequences of the rest may be relevant to further characterization. A phylogenetic tree is constructed using Neighbor-Joining method showing interrelationship between/among the mushrooms. The determined nucleotide sequences of the mushrooms may provide additional information enriching GenBank database aiding to molecular taxonomy and facilitating its domestication and characterization for human benefits. PMID:24489501

  4. Nucleotide sequencing and identification of some wild mushrooms.

    PubMed

    Das, Sudip Kumar; Mandal, Aninda; Datta, Animesh K; Gupta, Sudha; Paul, Rita; Saha, Aditi; Sengupta, Sonali; Dubey, Priyanka Kumari

    2013-01-01

    The rDNA-ITS (Ribosomal DNA Internal Transcribed Spacers) fragment of the genomic DNA of 8 wild edible mushrooms (collected from Eastern Chota Nagpur Plateau of West Bengal, India) was amplified using ITS1 (Internal Transcribed Spacers 1) and ITS2 primers and subjected to nucleotide sequence determination for identification of mushrooms as mentioned. The sequences were aligned using ClustalW software program. The aligned sequences revealed identity (homology percentage from GenBank data base) of Amanita hemibapha [CN (Chota Nagpur) 1, % identity 99 (JX844716.1)], Amanita sp. [CN 2, % identity 98 (JX844763.1)], Astraeus hygrometricus [CN 3, % identity 87 (FJ536664.1)], Termitomyces sp. [CN 4, % identity 90 (JF746992.1)], Termitomyces sp. [CN 5, % identity 99 (GU001667.1)], T. microcarpus [CN 6, % identity 82 (EF421077.1)], Termitomyces sp. [CN 7, % identity 76 (JF746993.1)], and Volvariella volvacea [CN 8, % identity 100 (JN086680.1)]. Although out of 8 mushrooms 4 could be identified up to species level, the nucleotide sequences of the rest may be relevant to further characterization. A phylogenetic tree is constructed using Neighbor-Joining method showing interrelationship between/among the mushrooms. The determined nucleotide sequences of the mushrooms may provide additional information enriching GenBank database aiding to molecular taxonomy and facilitating its domestication and characterization for human benefits. PMID:24489501

  5. R3D Align web server for global nucleotide to nucleotide alignments of RNA 3D structures.

    PubMed

    Rahrig, Ryan R; Petrov, Anton I; Leontis, Neocles B; Zirbel, Craig L

    2013-07-01

    The R3D Align web server provides online access to 'RNA 3D Align' (R3D Align), a method for producing accurate nucleotide-level structural alignments of RNA 3D structures. The web server provides a streamlined and intuitive interface, input data validation and output that is more extensive and easier to read and interpret than related servers. The R3D Align web server offers a unique Gallery of Featured Alignments, providing immediate access to pre-computed alignments of large RNA 3D structures, including all ribosomal RNAs, as well as guidance on effective use of the server and interpretation of the output. By accessing the non-redundant lists of RNA 3D structures provided by the Bowling Green State University RNA group, R3D Align connects users to structure files in the same equivalence class and the best-modeled representative structure from each group. The R3D Align web server is freely accessible at http://rna.bgsu.edu/r3dalign/. PMID:23716643

  6. Conditional alignment random fields for multiple motion sequence alignment.

    PubMed

    Kim, Minyoung

    2013-11-01

    We consider the multiple time-series alignment problem, typically focusing on the task of synchronizing multiple motion videos of the same kind of human activity. Finding an optimal global alignment of multiple sequences is infeasible, while there have been several approximate solutions, including iterative pairwise warping algorithms and variants of hidden Markov models. In this paper, we propose a novel probabilistic model that represents the conditional densities of the latent target sequences which are aligned with the given observed sequences through the hidden alignment variables. By imposing certain constraints on the target sequences at the learning stage, we have a sensible model for multiple alignments that can be learned very efficiently by the EM algorithm. Compared to existing methods, our approach yields more accurate alignment while being more robust to local optima and initial configurations. We demonstrate its efficacy on both synthetic and real-world motion videos including facial emotions and human activities. PMID:24051737

  7. R3D Align web server for global nucleotide to nucleotide alignments of RNA 3D structures

    PubMed Central

    Rahrig, Ryan R.; Petrov, Anton I.; Leontis, Neocles B.; Zirbel, Craig L.

    2013-01-01

    The R3D Align web server provides online access to ‘RNA 3D Align’ (R3D Align), a method for producing accurate nucleotide-level structural alignments of RNA 3D structures. The web server provides a streamlined and intuitive interface, input data validation and output that is more extensive and easier to read and interpret than related servers. The R3D Align web server offers a unique Gallery of Featured Alignments, providing immediate access to pre-computed alignments of large RNA 3D structures, including all ribosomal RNAs, as well as guidance on effective use of the server and interpretation of the output. By accessing the non-redundant lists of RNA 3D structures provided by the Bowling Green State University RNA group, R3D Align connects users to structure files in the same equivalence class and the best-modeled representative structure from each group. The R3D Align web server is freely accessible at http://rna.bgsu.edu/r3dalign/. PMID:23716643

  8. The International Nucleotide Sequence Database Collaboration.

    PubMed

    Cochrane, Guy; Karsch-Mizrachi, Ilene; Takagi, Toshihisa

    2016-01-01

    The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org) comprises three global partners committed to capturing, preserving and providing comprehensive public-domain nucleotide sequence information. The INSDC establishes standards, formats and protocols for data and metadata to make it easier for individuals and organisations to submit their nucleotide data reliably to public archives. This work enables the continuous, global exchange of information about living things. Here we present an update of the INSDC in 2015, including data growth and diversification, new standards and requirements by publishers for authors to submit their data to the public archives. The INSDC serves as a model for data sharing in the life sciences. PMID:26657633

  9. Nucleotide sequence of bacteriophage fd DNA.

    PubMed Central

    Beck, E; Sommer, R; Auerswald, E A; Kurz, C; Zink, B; Osterburg, G; Schaller, H; Sugimoto, K; Sugisaki, H; Okamoto, T; Takanami, M

    1978-01-01

    The sequence of the 6,408 nucleotides of bacteriophage fd DNA has been determined. This allows to deduce the exact organisation of the filamentous phage genome and provides easy access to DNA segments of known structure and function. PMID:745987

  10. Complete Nucleotide Sequence of Tn10

    PubMed Central

    Chalmers, Ronald; Sewitz, Sven; Lipkow, Karen; Crellin, Paul

    2000-01-01

    The complete nucleotide sequence of Tn10 has been determined. The dinucleotide signature and percent G+C of the sequence had no discontinuities, indicating that Tn10 constitutes a homogeneous unit. The new sequence contained three new open reading frames corresponding to a glutamate permease, repressors of heavy metal resistance operons, and a hypothetical protein in Bacillus subtilis. The glutamate permease was fully functional when expressed, but Tn10 did not protect Escherichia coli from the toxic effects of various metals. PMID:10781570

  11. The multiple codes of nucleotide sequences.

    PubMed

    Trifonov, E N

    1989-01-01

    Nucleotide sequences carry genetic information of many different kinds, not just instructions for protein synthesis (triplet code). Several codes of nucleotide sequences are discussed including: (1) the translation framing code, responsible for correct triplet counting by the ribosome during protein synthesis; (2) the chromatin code, which provides instructions on appropriate placement of nucleosomes along the DNA molecules and their spatial arrangement; (3) a putative loop code for single-stranded RNA-protein interactions. The codes are degenerate and corresponding messages are not only interspersed but actually overlap, so that some nucleotides belong to several messages simultaneously. Tandemly repeated sequences frequently considered as functionless "junk" are found to be grouped into certain classes of repeat unit lengths. This indicates some functional involvement of these sequences. A hypothesis is formulated according to which the tandem repeats are given the role of weak enhancer-silencers that modulate, in a copy number-dependent way, the expression of proximal genes. Fast amplification and elimination of the repeats provides an attractive mechanism of species adaptation to a rapidly changing environment. PMID:2673451

  12. Comparing compressed sequences for faster nucleotide BLAST searches.

    PubMed

    Cameron, Michael; Williams, Hugh E

    2007-01-01

    Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 BLAST paper and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from http://www.fsa-blast.org/. PMID:17666756

  13. Update of AMmtDB: a database of multi-aligned metazoa mitochondrial DNA sequences.

    PubMed

    Lanave, C; Liuni, S; Licciulli, F; Attimonelli, M

    2000-01-01

    The AMmtDB database (http://bio-www.ba.cnr.it:8000/srs6/ ) has been updated by collecting the multi-aligned sequences of Chordata mitochondrial genes coding for proteins and tRNAs. The genes coding for proteins are multi-aligned based on the translated sequences and both the nucleotide and amino acid multi-alignments are provided. AMmtDB data selected through SRS can be viewed and managed using GeneDoc or other programs for the management of multi-aligned data depending on the user's operative system. The multiple alignments have been produced with CLUSTALW and PILEUP programs and then carefully optimized manually. PMID:10592208

  14. Robust temporal alignment of multimodal cardiac sequences

    NASA Astrophysics Data System (ADS)

    Perissinotto, Andrea; Queirós, Sandro; Morais, Pedro; Baptista, Maria J.; Monaghan, Mark; Rodrigues, Nuno F.; D'hooge, Jan; Vilaça, João. L.; Barbosa, Daniel

    2015-03-01

    Given the dynamic nature of cardiac function, correct temporal alignment of pre-operative models and intraoperative images is crucial for augmented reality in cardiac image-guided interventions. As such, the current study focuses on the development of an image-based strategy for temporal alignment of multimodal cardiac imaging sequences, such as cine Magnetic Resonance Imaging (MRI) or 3D Ultrasound (US). First, we derive a robust, modality-independent signal from the image sequences, estimated by computing the normalized cross-correlation between each frame in the temporal sequence and the end-diastolic frame. This signal is a resembler for the left-ventricle (LV) volume curve over time, whose variation indicates different temporal landmarks of the cardiac cycle. We then perform the temporal alignment of these surrogate signals derived from MRI and US sequences of the same patient through Dynamic Time Warping (DTW), allowing to synchronize both sequences. The proposed framework was evaluated in 98 patients, which have undergone both 3D+t MRI and US scans. The end-systolic frame could be accurately estimated as the minimum of the image-derived surrogate signal, presenting a relative error of 1.6 +/- 1.9% and 4.0 +/- 4.2% for the MRI and US sequences, respectively, thus supporting its association with key temporal instants of the cardiac cycle. The use of DTW reduces the desynchronization of the cardiac events in MRI and US sequences, allowing to temporally align multimodal cardiac imaging sequences. Overall, a generic, fast and accurate method for temporal synchronization of MRI and US sequences of the same patient was introduced. This approach could be straightforwardly used for the correct temporal alignment of pre-operative MRI information and intra-operative US images.

  15. Image correlation method for DNA sequence alignment.

    PubMed

    Curilem Saldías, Millaray; Villarroel Sassarini, Felipe; Muñoz Poblete, Carlos; Vargas Vásquez, Asticio; Maureira Butler, Iván

    2012-01-01

    The complexity of searches and the volume of genomic data make sequence alignment one of bioinformatics most active research areas. New alignment approaches have incorporated digital signal processing techniques. Among these, correlation methods are highly sensitive. This paper proposes a novel sequence alignment method based on 2-dimensional images, where each nucleic acid base is represented as a fixed gray intensity pixel. Query and known database sequences are coded to their pixel representation and sequence alignment is handled as object recognition in a scene problem. Query and database become object and scene, respectively. An image correlation process is carried out in order to search for the best match between them. Given that this procedure can be implemented in an optical correlator, the correlation could eventually be accomplished at light speed. This paper shows an initial research stage where results were "digitally" obtained by simulating an optical correlation of DNA sequences represented as images. A total of 303 queries (variable lengths from 50 to 4500 base pairs) and 100 scenes represented by 100 x 100 images each (in total, one million base pair database) were considered for the image correlation analysis. The results showed that correlations reached very high sensitivity (99.01%), specificity (98.99%) and outperformed BLAST when mutation numbers increased. However, digital correlation processes were hundred times slower than BLAST. We are currently starting an initiative to evaluate the correlation speed process of a real experimental optical correlator. By doing this, we expect to fully exploit optical correlation light properties. As the optical correlator works jointly with the computer, digital algorithms should also be optimized. The results presented in this paper are encouraging and support the study of image correlation methods on sequence alignment. PMID:22761742

  16. DNA sequence alignment by microhomology sampling during homologous recombination

    PubMed Central

    Qi, Zhi; Redding, Sy; Lee, Ja Yil; Gibb, Bryan; Kwon, YoungHo; Niu, Hengyao; Gaines, William A.; Sung, Patrick

    2015-01-01

    Summary Homologous recombination (HR) mediates the exchange of genetic information between sister or homologous chromatids. During HR, members of the RecA/Rad51 family of recombinases must somehow search through vast quantities of DNA sequence to align and pair ssDNA with a homologous dsDNA template. Here we use single-molecule imaging to visualize Rad51 as it aligns and pairs homologous DNA sequences in real-time. We show that Rad51 uses a length-based recognition mechanism while interrogating dsDNA, enabling robust kinetic selection of 8-nucleotide (nt) tracts of microhomology, which kinetically confines the search to sites with a high probability of being a homologous target. Successful pairing with a 9th nucleotide coincides with an additional reduction in binding free energy and subsequent strand exchange occurs in precise 3-nt steps, reflecting the base triplet organization of the presynaptic complex. These findings provide crucial new insights into the physical and evolutionary underpinnings of DNA recombination. PMID:25684365

  17. Regular language constrained sequence alignment revisited.

    PubMed

    Kucherov, Gregory; Pinhas, Tamar; Ziv-Ukelson, Michal

    2011-05-01

    Imposing constraints in the form of a finite automaton or a regular expression is an effective way to incorporate additional a priori knowledge into sequence alignment procedures. With this motivation, the Regular Expression Constrained Sequence Alignment Problem was introduced, which proposed an O(n²t⁴) time and O(n²t²) space algorithm for solving it, where n is the length of the input strings and t is the number of states in the input non-deterministic automaton. A faster O(n²t³) time algorithm for the same problem was subsequently proposed. In this article, we further speed up the algorithms for Regular Language Constrained Sequence Alignment by reducing their worst case time complexity bound to O(n²t³)/log t). This is done by establishing an optimal bound on the size of Straight-Line Programs solving the maxima computation subproblem of the basic dynamic programming algorithm. We also study another solution based on a Steiner Tree computation. While it does not improve the worst case, our simulations show that both approaches are efficient in practice, especially when the input automata are dense. PMID:21554020

  18. Multiple Sequence Alignment Based on Chaotic PSO

    NASA Astrophysics Data System (ADS)

    Lei, Xiu-Juan; Sun, Jing-Jing; Ma, Qian-Zhi

    This paper introduces a new improved algorithm called chaotic PSO (CPSO) based on the thought of chaos optimization to solve multiple sequence alignment. For one thing, the chaotic variables are generated between 0 and 1 when initializing the population so that the particles are distributed uniformly in the solution space. For another thing, the chaotic sequences are generated using the Logistic mapping function in order to make chaotic search and strengthen the diversity of the population. The simulation results of several benchmark data sets of BAliBase show that the improved algorithm is effective and has good performances for the data sets with different similarity.

  19. The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment

    PubMed Central

    Altschul, Stephen F.; Wootton, John C.; Zaslavsky, Elena; Yu, Yi-Kuo

    2010-01-01

    Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct “BILD” (“Bayesian Integral Log-odds”) substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum. PMID:20657661

  20. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

    PubMed Central

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-01-01

    Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: ivan.borozan@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913

  1. Multiple sequence alignment: a major challenge to large-scale phylogenetics

    PubMed Central

    Liu, Kevin; Linder, C. Randal; Warnow, Tandy

    2011-01-01

    Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies. PMID:21113338

  2. DNA Sequence Alignment during Homologous Recombination.

    PubMed

    Greene, Eric C

    2016-05-27

    Homologous recombination allows for the regulated exchange of genetic information between two different DNA molecules of identical or nearly identical sequence composition, and is a major pathway for the repair of double-stranded DNA breaks. A key facet of homologous recombination is the ability of recombination proteins to perfectly align the damaged DNA with homologous sequence located elsewhere in the genome. This reaction is referred to as the homology search and is akin to the target searches conducted by many different DNA-binding proteins. Here I briefly highlight early investigations into the homology search mechanism, and then describe more recent research. Based on these studies, I summarize a model that includes a combination of intersegmental transfer, short-distance one-dimensional sliding, and length-specific microhomology recognition to efficiently align DNA sequences during the homology search. I also suggest some future directions to help further our understanding of the homology search. Where appropriate, I direct the reader to other recent reviews describing various issues related to homologous recombination. PMID:27129270

  3. Nucleotide sequence of SHV-2 beta-lactamase gene

    SciTech Connect

    Garbarg-Chenon, A.; Godard, V.; Labia, R.; Nicolas, J.C. )

    1990-07-01

    The nucleotide sequence of plasmid-mediated beta-lactamase SHV-2 from Salmonella typhimurium (SHV-2pHT1) was determined. The gene was very similar to chromosomally encoded beta-lactamase LEN-1 of Klebsiella pneumoniae. Compared with the sequence of the Escherichia coli SHV-2 enzyme (SHV-2E.coli) obtained by protein sequencing, the deduced amino acid sequence of SHV-2pHT1 differed by three amino acid substitutions.

  4. Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV).

    PubMed

    Martin, Andrew C R

    2014-01-01

    The JavaScript Sequence Alignment Viewer (JSAV) is designed as a simple-to-use JavaScript component for displaying sequence alignments on web pages. The display of sequences is highly configurable with options to allow alternative coloring schemes, sorting of sequences and 'dotifying' repeated amino acids. An option is also available to submit selected sequences to another web site, or to other JavaScript code. JSAV is implemented purely in JavaScript making use of the JQuery and JQuery-UI libraries. It does not use any HTML5-specific options to help with browser compatibility. The code is documented using JSDOC and is available from http://www.bioinf.org.uk/software/jsav/. PMID:25653836

  5. Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV)

    PubMed Central

    Martin, Andrew C. R.

    2014-01-01

    The JavaScript Sequence Alignment Viewer (JSAV) is designed as a simple-to-use JavaScript component for displaying sequence alignments on web pages. The display of sequences is highly configurable with options to allow alternative coloring schemes, sorting of sequences and ’dotifying’ repeated amino acids. An option is also available to submit selected sequences to another web site, or to other JavaScript code. JSAV is implemented purely in JavaScript making use of the JQuery and JQuery-UI libraries. It does not use any HTML5-specific options to help with browser compatibility. The code is documented using JSDOC and is available from http://www.bioinf.org.uk/software/jsav/. PMID:25653836

  6. Reading biological processes from nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Murugan, Anand

    Cellular processes have traditionally been investigated by techniques of imaging and biochemical analysis of the molecules involved. The recent rapid progress in our ability to manipulate and read nucleic acid sequences gives us direct access to the genetic information that directs and constrains biological processes. While sequence data is being used widely to investigate genotype-phenotype relationships and population structure, here we use sequencing to understand biophysical mechanisms. We present work on two different systems. First, in chapter 2, we characterize the stochastic genetic editing mechanism that produces diverse T-cell receptors in the human immune system. We do this by inferring statistical distributions of the underlying biochemical events that generate T-cell receptor coding sequences from the statistics of the observed sequences. This inferred model quantitatively describes the potential repertoire of T-cell receptors that can be produced by an individual, providing insight into its potential diversity and the probability of generation of any specific T-cell receptor. Then in chapter 3, we present work on understanding the functioning of regulatory DNA sequences in both prokaryotes and eukaryotes. Here we use experiments that measure the transcriptional activity of large libraries of mutagenized promoters and enhancers and infer models of the sequence-function relationship from this data. For the bacterial promoter, we infer a physically motivated 'thermodynamic' model of the interaction of DNA-binding proteins and RNA polymerase determining the transcription rate of the downstream gene. For the eukaryotic enhancers, we infer heuristic models of the sequence-function relationship and use these models to find synthetic enhancer sequences that optimize inducibility of expression. Both projects demonstrate the utility of sequence information in conjunction with sophisticated statistical inference techniques for dissecting underlying biophysical

  7. Update of AMmtDB: a database of multi-aligned Metazoa mitochondrial DNA sequences.

    PubMed

    Lanave, Cecilia; Licciulli, Flavio; De Robertis, Mariateresa; Marolla, Alessandra; Attimonelli, Marcella

    2002-01-01

    The AMmtDB database (http://bighost.area.ba.cnr.it/mitochondriome) has been updated by collecting the multi-aligned sequences of Chordata and Invertebrata mitochondrial genes coding for proteins and tRNAs. Links to the multi-aligned mtDNA intraspecies variants, collected in VarMmtDB at the Mitochondriome web site, have been introduced. The genes coding for proteins are multi-aligned based on the translated sequences and both the nucleotide and amino acid multi-alignments are provided. AMmtDB data selected through SRS can be viewed and managed using GeneDoc or other programs for the management of multi-aligned data depending on the user's operative system. The multiple alignments have been produced with CLUSTALW and PILEUP programs and then carefully optimized manually. PMID:11752285

  8. Update of AMmtDB: a database of multi-aligned Metazoa mitochondrial DNA sequences

    PubMed Central

    Lanave, Cecilia; Licciulli, Flavio; De Robertis, Mariateresa; Marolla, Alessandra; Attimonelli, Marcella

    2002-01-01

    The AMmtDB database (http://bighost.area.ba.cnr.it/mitochondriome) has been updated by collecting the multi-aligned sequences of Chordata and Invertebrata mitochondrial genes coding for proteins and tRNAs. Links to the multi-aligned mtDNA intraspecies variants, collected in VarMmtDB at the Mitochondriome web site, have been introduced. The genes coding for proteins are multi-aligned based on the translated sequences and both the nucleotide and amino acid multi-alignments are provided. AMmtDB data selected through SRS can be viewed and managed using GeneDoc or other programs for the management of multi-aligned data depending on the user’s operative system. The multiple alignments have been produced with CLUSTALW and PILEUP programs and then carefully optimized manually. PMID:11752285

  9. DUC-Curve, a highly compact 2D graphical representation of DNA sequences and its application in sequence alignment

    NASA Astrophysics Data System (ADS)

    Li, Yushuang; Liu, Qian; Zheng, Xiaoqi

    2016-08-01

    A highly compact and simple 2D graphical representation of DNA sequences, named DUC-Curve, is constructed through mapping four nucleotides to a unit circle with a cyclic order. DUC-Curve could directly detect nucleotide, di-nucleotide compositions and microsatellite structure from DNA sequences. Moreover, it also could be used for DNA sequence alignment. Taking geometric center vectors of DUC-Curves as sequence descriptor, we perform similarity analysis on the first exons of β-globin genes of 11 species, oncogene TP53 of 27 species and twenty-four Influenza A viruses, respectively. The obtained reasonable results illustrate that the proposed method is very effective in sequence comparison problems, and will at least play a complementary role in classification and clustering problems.

  10. PSAR: measuring multiple sequence alignment reliability by probabilistic sampling

    PubMed Central

    Kim, Jaebum; Ma, Jian

    2011-01-01

    Multiple sequence alignment, which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the alignments and incorporate it into downstream analyses. We propose a new probabilistic sampling-based alignment reliability (PSAR) score. Instead of relying on heuristic assumptions, such as the correlation between alignment quality and guide tree uncertainty in progressive alignment methods, we directly generate suboptimal alignments from an input multiple sequence alignment by a probabilistic sampling method, and compute the agreement of the input alignment with the suboptimal alignments as the alignment reliability score. We construct the suboptimal alignments by an approximate method that is based on pairwise comparisons between each single sequence and the sub-alignment of the input alignment where the chosen sequence is left out. By using simulation-based benchmarks, we find that our approach is superior to existing ones, supporting that the suboptimal alignments are highly informative source for assessing alignment reliability. We apply the PSAR method to the alignments in the UCSC Genome Browser to measure the reliability of alignments in different types of regions, such as coding exons and conserved non-coding regions, and use it to guide cross-species conservation study. PMID:21576232

  11. The nucleotide sequence of cowpea mosaic virus B RNA

    PubMed Central

    Lomonossoff, G.P.; Shanks, M.

    1983-01-01

    The complete sequence of the bottom component RNA (B RNA) of cowpea mosaic virus (CPMV) has been determined. Restriction enzyme fragments of double-stranded cDNA were cloned in M13 and the sequence of the inserts was determined by a combination of enzymatic and chemical sequencing techniques. Additional sequence information was obtained by primed synthesis on first strand cDNA. The complete sequence deduced is 5889 nucleotides long excluding the 3' poly(A), and contains an open reading frame sufficient to code for a polypeptide of mol. wt. 207 760. The coding region is flanked by a 5' leader sequence of 206 nucleotides and a 3' non-coding region of 82 residues which does not contain a polyadenylation signal. PMID:16453487

  12. PROMALS web server for accurate multiple protein sequence alignments.

    PubMed

    Pei, Jimin; Kim, Bong-Hyun; Tang, Ming; Grishin, Nick V

    2007-07-01

    Multiple sequence alignments are essential in homology inference, structure modeling, functional prediction and phylogenetic analysis. We developed a web server that constructs multiple protein sequence alignments using PROMALS, a progressive method that improves alignment quality by using additional homologs from PSI-BLAST searches and secondary structure predictions from PSIPRED. PROMALS shows higher alignment accuracy than other advanced methods, such as MUMMALS, ProbCons, MAFFT and SPEM. The PROMALS web server takes FASTA format protein sequences as input. The output includes a colored alignment augmented with information about sequence grouping, predicted secondary structures and positional conservation. The PROMALS web server is available at: http://prodata.swmed.edu/promals/ PMID:17452345

  13. Blasting and Zipping: Sequence Alignment and Mutual Information

    NASA Astrophysics Data System (ADS)

    Penner, Orion; Grassberger, Peter; Paczuski, Maya

    2009-03-01

    Alignment of biological sequences such as DNA, RNA or proteins is one of the most widely used tools in computational bioscience. While the accomplishments of sequence alignment algorithms are undeniable the fact remains that these algorithms are based upon heuristic scoring schemes. Therefore, these algorithms do not provide model independent and objective measures for how similar two (or more) sequences actually are. Although information theory provides such a similarity measure - the mutual information (MI) - numerous previous attempts to connect sequence alignment and information have not produced realistic estimates for the MI from a given alignment. We report on a simple and flexible approach to get robust estimates of MI from global alignments. The presented results may help establish MI as a reliable tool for evaluating the quality of global alignments, judging the relative merits of different alignment algorithms, and estimating the significance of specific alignments.

  14. Cover song identification by sequence alignment algorithms

    NASA Astrophysics Data System (ADS)

    Wang, Chih-Li; Zhong, Qian; Wang, Szu-Ying; Roychowdhury, Vwani

    2011-10-01

    Content-based music analysis has drawn much attention due to the rapidly growing digital music market. This paper describes a method that can be used to effectively identify cover songs. A cover song is a song that preserves only the crucial melody of its reference song but different in some other acoustic properties. Hence, the beat/chroma-synchronous chromagram, which is insensitive to the variation of the timber or rhythm of songs but sensitive to the melody, is chosen. The key transposition is achieved by cyclically shifting the chromatic domain of the chromagram. By using the Hidden Markov Model (HMM) to obtain the time sequences of songs, the system is made even more robust. Similar structure or length between the cover songs and its reference are not necessary by the Smith-Waterman Alignment Algorithm.

  15. Update of AMmtDB: a database of multi-aligned Metazoa mitochondrial DNA sequences

    PubMed Central

    Lanave, Cecilia; Liuni, Sabino; Licciulli, Flavio; Attimonelli, Marcella

    2000-01-01

    The AMmtDB database (http://bio-www.ba.cnr.it:8000/srs6/ ) has been updated by collecting the multi-aligned sequences of Chordata mitochondrial genes coding for proteins and tRNAs. The genes coding for proteins are multi-aligned based on the translated sequences and both the nucleotide and amino acid multi-alignments are provided. AMmtDB data selected through SRS can be viewed and managed using GeneDoc or other programs for the management of multi-aligned data depending on the user’s operative system. The multiple alignments have been produced with CLUSTALW and PILEUP programs and then carefully optimized manually. PMID:10592208

  16. DNA sequence representation by trianders and determinative degree of nucleotides

    PubMed Central

    Duplij, Diana; Duplij, Steven

    2005-01-01

    A new version of DNA walks, where nucleotides are regarded unequal in their contribution to a walk is introduced, which allows us to study thoroughly the “fine structure” of nucleotide sequences. The approach is based on the assumption that nucleotides have an inner abstract characteristic, the determinative degree, which reflects genetic code phenomenological properties and is adjusted to nucleotides physical properties. We consider each codon position independently, which gives three separate walks characterized by different angles and lengths, and that such an object is called triander which reflects the “strength” of branch. A general method for identifying DNA sequence “by triander” which can be treated as a unique “genogram” (or “gene passport”) is proposed. The two- and three-dimensional trianders are considered. The difference of sequences fine structure in genes and the intergenic space is shown. A clear triplet signal in coding sequences was found which is absent in the intergenic space and is independent from the sequence length. This paper presents the topological classification of trianders which can allow us to provide a detailed working out signatures of functionally different genomic regions. PMID:16052707

  17. On Quantum Algorithm for Multiple Alignment of Amino Acid Sequences

    NASA Astrophysics Data System (ADS)

    Iriyama, Satoshi; Ohya, Masanori

    2009-02-01

    The alignment of genome sequences or amino acid sequences is one of fundamental operations for the study of life. Usual computational complexity for the multiple alignment of N sequences with common length L by dynamic programming is O(LN). This alignment is considered as one of the NP problems, so that it is desirable to find a nice algorithm of the multiple alignment. Thus in this paper we propose the quantum algorithm for the multiple alignment based on the works12,1,2 in which the NP complete problem was shown to be the P problem by means of quantum algorithm and chaos information dynamics.

  18. MULTAN: a program to align multiple DNA sequences.

    PubMed Central

    Bains, W

    1986-01-01

    I describe a computer program which can align a large number of nucleic acid sequences with one another. The program uses an heuristic, iterative algorithm which has been tested extensively, and is found to produce useful alignments of a variety of sequence families. The algorithm is fast enough to be practical for the analysis of large number of sequences, and is implemented in a program which contains a variety of other functions to facilitate the analysis of the aligned result. PMID:3003672

  19. Complete nucleotide sequence of Nootka lupine vein-clearing virus

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The complete genome sequence of Nootka lupine vein-clearing virus (NLVCV) was determined to be 4,172 nucleotides in length containing four open reading frames ORFs with a similar genetic organization and conceptual translations of virus species in the genus Carmovirus, family Tombusviridae. The orde...

  20. Method for the detection of specific nucleic acid sequences by polymerase nucleotide incorporation

    DOEpatents

    Castro, Alonso

    2004-06-01

    A method for rapid and efficient detection of a target DNA or RNA sequence is provided. A primer having a 3'-hydroxyl group at one end and having a sequence of nucleotides sufficiently homologous with an identifying sequence of nucleotides in the target DNA is selected. The primer is hybridized to the identifying sequence of nucleotides on the DNA or RNA sequence and a reporter molecule is synthesized on the target sequence by progressively binding complementary nucleotides to the primer, where the complementary nucleotides include nucleotides labeled with a fluorophore. Fluorescence emitted by fluorophores on single reporter molecules is detected to identify the target DNA or RNA sequence.

  1. Nucleotide sequence and genome organization of tomato leaf curl geminivirus.

    PubMed

    Dry, I B; Rigden, J E; Krake, L R; Mullineaux, P M; Rezaian, M A

    1993-01-01

    The genome of tomato leaf curl virus (TLCV) from Australia was cloned and its complete nucleotide sequence determined. It is a single circular ssDNA of 2766 nucleotides containing the consensus nonanucleotide sequence present in all geminiviruses. It has six open reading frames with an organization resembling that of certain other dicotyledonous plant-infecting monopartite geminiviruses, i.e. tomato yellow leaf curl and beet curly top viruses. The regulatory sequences present indicate a bidirectional mode of transcription. A dimeric TLCV DNA clone was constructed in a binary vector and used to agroinoculate three different host species. Typical virus infections were produced, confirming that the single DNA component is sufficient for infectivity. PMID:8423446

  2. R3D-2-MSA: the RNA 3D structure-to-multiple sequence alignment server.

    PubMed

    Cannone, Jamie J; Sweeney, Blake A; Petrov, Anton I; Gutell, Robin R; Zirbel, Craig L; Leontis, Neocles

    2015-07-01

    The RNA 3D Structure-to-Multiple Sequence Alignment Server (R3D-2-MSA) is a new web service that seamlessly links RNA three-dimensional (3D) structures to high-quality RNA multiple sequence alignments (MSAs) from diverse biological sources. In this first release, R3D-2-MSA provides manual and programmatic access to curated, representative ribosomal RNA sequence alignments from bacterial, archaeal, eukaryal and organellar ribosomes, using nucleotide numbers from representative atomic-resolution 3D structures. A web-based front end is available for manual entry and an Application Program Interface for programmatic access. Users can specify up to five ranges of nucleotides and 50 nucleotide positions per range. The R3D-2-MSA server maps these ranges to the appropriate columns of the corresponding MSA and returns the contents of the columns, either for display in a web browser or in JSON format for subsequent programmatic use. The browser output page provides a 3D interactive display of the query, a full list of sequence variants with taxonomic information and a statistical summary of distinct sequence variants found. The output can be filtered and sorted in the browser. Previous user queries can be viewed at any time by resubmitting the output URL, which encodes the search and re-generates the results. The service is freely available with no login requirement at http://rna.bgsu.edu/r3d-2-msa. PMID:26048960

  3. R3D-2-MSA: the RNA 3D structure-to-multiple sequence alignment server

    PubMed Central

    Cannone, Jamie J.; Sweeney, Blake A.; Petrov, Anton I.; Gutell, Robin R.; Zirbel, Craig L.; Leontis, Neocles

    2015-01-01

    The RNA 3D Structure-to-Multiple Sequence Alignment Server (R3D-2-MSA) is a new web service that seamlessly links RNA three-dimensional (3D) structures to high-quality RNA multiple sequence alignments (MSAs) from diverse biological sources. In this first release, R3D-2-MSA provides manual and programmatic access to curated, representative ribosomal RNA sequence alignments from bacterial, archaeal, eukaryal and organellar ribosomes, using nucleotide numbers from representative atomic-resolution 3D structures. A web-based front end is available for manual entry and an Application Program Interface for programmatic access. Users can specify up to five ranges of nucleotides and 50 nucleotide positions per range. The R3D-2-MSA server maps these ranges to the appropriate columns of the corresponding MSA and returns the contents of the columns, either for display in a web browser or in JSON format for subsequent programmatic use. The browser output page provides a 3D interactive display of the query, a full list of sequence variants with taxonomic information and a statistical summary of distinct sequence variants found. The output can be filtered and sorted in the browser. Previous user queries can be viewed at any time by resubmitting the output URL, which encodes the search and re-generates the results. The service is freely available with no login requirement at http://rna.bgsu.edu/r3d-2-msa. PMID:26048960

  4. Local alignment of two-base encoded DNA sequence

    PubMed Central

    Homer, Nils; Merriman, Barry; Nelson, Stanley F

    2009-01-01

    Background DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity. Results We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions. Conclusion The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data. PMID:19508732

  5. Complete nucleotide sequence of the nucleoprotein gene of influenza B virus.

    PubMed Central

    Londo, D R; Davis, A R; Nayak, D P

    1983-01-01

    A DNA copy of influenza B/Singapore/222/79 viral RNA segment 5, containing the gene coding for the nucleoprotein (NP), has been cloned in Escherichia coli plasmid pBR322, and its nucleotide sequence has been determined. The influenza B NP gene contains 1,839 nucleotides and codes for a protein of 560 amino acids with a molecular weight of 61,593. Comparison of the influenza B NP amino acid sequence with that of influenza A NP (A/PR/8/34) reveals 37% direct homology in the aligned regions, indicating a common ancestor. However, influenza B NP has an additional 50 amino acids at its N-terminal end. As is the case with influenza A NP, influenza B NP is a basic protein, with its charged residues relatively evenly distributed rather than clustered. The structural homology suggests functional similarity between the NP of influenza A and B viruses. PMID:6688639

  6. Update of AMmtDB: a database of multi-aligned metazoa mitochondrial DNA sequences.

    PubMed

    Lanave, C; Attimonelli, M; De Robertis, M; Licciulli, F; Liuni, S; Sbisá, E; Saccone, C

    1999-01-01

    The present paper describes AMmtDB, a database collecting the multi-aligned sequences of vertebrate mitochondrial genes coding for proteins and tRNAs, as well as the multiple alignment of the mammalian mtDNA main regulatory region (D-loop) sequences. The genes coding for proteins are multi-aligned based on the translated sequences and both the nucleotide and amino acid multi-alignments are provided. As far as the genes coding for tRNAs are concerned, the multi-alignments based on the primary and the secondary structures are both provided; for the mammalian D-loop multi-alignments we report the conserved regions of the entire D-loop (CSB1, CSB2, CSB3, the central region, ETAS1 and ETAS2) as defined by Sbisà et al. [ Gene (1997), 205, 125-140). A flatfile format for AMmtDB has been designed allowing its implementation in SRS (http://bio-www.ba.cnr.it:8000/BioWWW/#AMMTDB ). Data selected through SRS can be managed using GeneDoc or other programs for the management of multi-aligned data depending on the user's operative system. The multiple alignments have been produced with CLUSTALV and PILEUP programs and then carefully optimized manually. PMID:9847158

  7. A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method

    PubMed Central

    Hatje, Klas; Kollmar, Martin

    2012-01-01

    Phylogenetic analyses reveal the evolutionary derivation of species. A phylogenetic tree can be inferred from multiple sequence alignments of proteins or genes. The alignment of whole genome sequences of higher eukaryotes is a computational intensive and ambitious task as is the computation of phylogenetic trees based on these alignments. To overcome these limitations, we here used an alignment-free method to compare genomes of the Brassicales clade. For each nucleotide sequence a Chaos Game Representation (CGR) can be computed, which represents each nucleotide of the sequence as a point in a square defined by the four nucleotides as vertices. Each CGR is therefore a unique fingerprint of the underlying sequence. If the CGRs are divided by grid lines each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence (Frequency Chaos Game Representation, FCGR). Here, we used distance measures between FCGRs to infer phylogenetic trees of Brassicales species. Three types of data were analyzed because of their different characteristics: (A) Whole genome assemblies as far as available for species belonging to the Malvidae taxon. (B) EST data of species of the Brassicales clade. (C) Mitochondrial genomes of the Rosids branch, a supergroup of the Malvidae. The trees reconstructed based on the Euclidean distance method are in general agreement with single gene trees. The Fitch–Margoliash and Neighbor joining algorithms resulted in similar to identical trees. Here, for the first time we have applied the bootstrap re-sampling concept to trees based on FCGRs to determine the support of the branchings. FCGRs have the advantage that they are fast to calculate, and can be used as additional information to alignment based data and morphological characteristics to improve the phylogenetic classification of species in ambiguous cases. PMID:22952468

  8. DNA Align Editor: DNA Alignment Editor Tool

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The SNPAlignEditor is a DNA sequence alignment editor that runs on Windows platforms. The purpose of the program is to provide an intuitive, user-friendly tool for manual editing of multiple sequence alignments by providing functions for input, editing, and output of nucleotide sequence alignments....

  9. WebGMAP: a web service for mapping and aligning cDNA sequences to genomes

    PubMed Central

    Liang, Chun; Liu, Lin; Ji, Guoli

    2009-01-01

    The genomes of thousands of organisms are being sequenced, often with accompanying sequences of cDNAs or ESTs. One of the great challenges in bioinformatics is to make these genomic sequences and genome annotations accessible in a user-friendly manner to general biologists to address interesting biological questions. We have created an open-access web service called WebGMAP (http://www.bioinfolab.org/software/webgmap) that seamlessly integrates cDNA-genome alignment tools, such as GMAP, with easy-to-use data visualization and mining tools. This web service is intended to facilitate community efforts in improving genome annotation, determining accurate gene structures and their variations, and exploring important biological processes such as alternative splicing and alternative polyadenylation. For routine sequence analysis, WebGMAP provides a web-based sequence viewer with many useful functions, including nucleotide positioning, six-frame translations, sequence reverse complementation, and imperfect motif detection and alignment. WebGMAP also provides users with the ability to sort, filter and search for individual cDNA sequences and cDNA-genome alignments. Our EST-Genome-Browser can display annotated gene structures and cDNA-genome alignments at scales from 100 to 50 000 nt. With its ability to highlight base differences between query cDNAs and the genome, our EST-Genome-Browser allows biologists to discover potential point or insertion-deletion variations from cDNA-genome alignments. PMID:19465381

  10. The primary nucleotide sequence of U4 RNA.

    PubMed

    Reddy, R; Henning, D; Busch, H

    1981-04-10

    U4 RNA is one of the "capped" nuclear snRNAs recently found to be precipitable by anti-Sm antibodies as ribonucleoprotein particles. U4 RNA, along with other snRNAs, has been implicated in hnRNA processing, mRNA transport, or both (Lerner, M. R., Boyle, J., Mount, S., Wolin, S., and Steitz, J. A. (1980) Nature 283, 220-224). Since the proteins bound to different snRNAs appear to be the same, the functions of different snRNPs might be dependent on the RNA components. To help understand the function of U4 RNP, the nucleotide sequence of U4 RNA was determined. The sequence is (formula see text) In addition to the modified nucleotides in the "cap," U4 RNA contains Am at position 63 and m6A at position 98. It also exhibited A-C microheterogeneity at position 97. PMID:6162848

  11. Nucleotide-Specific Contrast for DNA Sequencing by Electron Spectroscopy

    PubMed Central

    Schmid, Andreas K.; Davis, Ronald W.

    2016-01-01

    DNA sequencing by imaging in an electron microscope is an approach that holds promise to deliver long reads with low error rates and without the need for amplification. Earlier work using transmission electron microscopes, which use high electron energies on the order of 100 keV, has shown that low contrast and radiation damage necessitates the use of heavy atom labeling of individual nucleotides, which increases the read error rates. Other prior work using scattering electrons with much lower energy has shown to suppress beam damage on DNA. Here we explore possibilities to increase contrast by employing two methods, X-ray photoelectron and Auger electron spectroscopy. Using bulk DNA samples with monomers of each base, both methods are shown to provide contrast mechanisms that can distinguish individual nucleotides without labels. Both spectroscopic techniques can be readily implemented in a low energy electron microscope, which may enable label-free DNA sequencing by direct imaging. PMID:27149617

  12. Probabilistic sequence alignment of stratigraphic records

    NASA Astrophysics Data System (ADS)

    Lin, Luan; Khider, Deborah; Lisiecki, Lorraine E.; Lawrence, Charles E.

    2014-10-01

    The assessment of age uncertainty in stratigraphically aligned records is a pressing need in paleoceanographic research. The alignment of ocean sediment cores is used to develop mutually consistent age models for climate proxies and is often based on the δ18O of calcite from benthic foraminifera, which records a global ice volume and deep water temperature signal. To date, δ18O alignment has been performed by manual, qualitative comparison or by deterministic algorithms. Here we present a hidden Markov model (HMM) probabilistic algorithm to find 95% confidence bands for δ18O alignment. This model considers the probability of every possible alignment based on its fit to the δ18O data and transition probabilities for sedimentation rate changes obtained from radiocarbon-based estimates for 37 cores. Uncertainty is assessed using a stochastic back trace recursion to sample alignments in exact proportion to their probability. We applied the algorithm to align 35 late Pleistocene records to a global benthic δ18O stack and found that the mean width of 95% confidence intervals varies between 3 and 23 kyr depending on the resolution and noisiness of the record's δ18O signal. Confidence bands within individual cores also vary greatly, ranging from ~0 to >40 kyr. These alignment uncertainty estimates will allow researchers to examine the robustness of their conclusions, including the statistical evaluation of lead-lag relationships between events observed in different cores.

  13. 77 FR 65537 - Requirements for Patent Applications Containing Nucleotide Sequence and/or Amino Acid Sequence...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-10-29

    ... Amino Acid Sequence Disclosures ACTION: Proposed collection; comment request. SUMMARY: The United States....'' SUPPLEMENTARY INFORMATION: I. Abstract Patent applications that contain nucleotide and/or amino acid sequence disclosures must include a copy of the sequence listing in accordance with the requirements in 37 CFR...

  14. Quantifying the Displacement of Mismatches in Multiple Sequence Alignment Benchmarks

    PubMed Central

    Bawono, Punto; van der Velde, Arjan; Abeln, Sanne; Heringa, Jaap

    2015-01-01

    Multiple Sequence Alignment (MSA) methods are typically benchmarked on sets of reference alignments. The quality of the alignment can then be represented by the sum-of-pairs (SP) or column (CS) scores, which measure the agreement between a reference and corresponding query alignment. Both the SP and CS scores treat mismatches between a query and reference alignment as equally bad, and do not take the separation into account between two amino acids in the query alignment, that should have been matched according to the reference alignment. This is significant since the magnitude of alignment shifts is often of relevance in biological analyses, including homology modeling and MSA refinement/manual alignment editing. In this study we develop a new alignment benchmark scoring scheme, SPdist, that takes the degree of discordance of mismatches into account by measuring the sequence distance between mismatched residue pairs in the query alignment. Using this new score along with the standard SP score, we investigate the discriminatory behavior of the new score by assessing how well six different MSA methods perform with respect to BAliBASE reference alignments. The SP score and the SPdist score yield very similar outcomes when the reference and query alignments are close. However, for more divergent reference alignments the SPdist score is able to distinguish between methods that keep alignments approximately close to the reference and those exhibiting larger shifts. We observed that by using SPdist together with SP scoring we were able to better delineate the alignment quality difference between alternative MSA methods. With a case study we exemplify why it is important, from a biological perspective, to consider the separation of mismatches. The SPdist scoring scheme has been implemented in the VerAlign web server (http://www.ibi.vu.nl/programs/veralignwww/). The code for calculating SPdist score is also available upon request. PMID:25993129

  15. Complete nucleotide sequence of Saccharomyces cerevisiae chromosome X.

    PubMed Central

    Galibert, F; Alexandraki, D; Baur, A; Boles, E; Chalwatzis, N; Chuat, J C; Coster, F; Cziepluch, C; De Haan, M; Domdey, H; Durand, P; Entian, K D; Gatius, M; Goffeau, A; Grivell, L A; Hennemann, A; Herbert, C J; Heumann, K; Hilger, F; Hollenberg, C P; Huang, M E; Jacq, C; Jauniaux, J C; Katsoulou, C; Karpfinger-Hartl, L

    1996-01-01

    The complete nucleotide sequence of Saccharomyces cerevisiae chromosome X (745 442 bp) reveals a total of 379 open reading frames (ORFs), the coding region covering approximately 75% of the entire sequence. One hundred and eighteen ORFs (31%) correspond to genes previously identified in S. cerevisiae. All other ORFs represent novel putative yeast genes, whose function will have to be determined experimentally. However, 57 of the latter subset (another 15% of the total) encode proteins that show significant analogy to proteins of known function from yeast or other organisms. The remaining ORFs, exhibiting no significant similarity to any known sequence, amount to 54% of the total. General features of chromosome X are also reported, with emphasis on the nucleotide frequency distribution in the environment of the ATG and stop codons, the possible coding capacity of at least some of the small ORFs (<100 codons) and the significance of 46 non-canonical or unpaired nucleotides in the stems of some of the 24 tRNA genes recognized on this chromosome. Images PMID:8641269

  16. Large-scale detection and application of expressed sequence tag single nucleotide polymorphisms in Nicotiana.

    PubMed

    Wang, Y; Zhou, D; Wang, S; Yang, L

    2015-01-01

    Single nucleotide polymorphisms (SNPs) are widespread in the Nicotiana genome. Using an alignment and variation detection method, we developed 20,607,973 SNPs, based on the expressed sequence tag sequences of 10 Nicotiana species. The replacement rate was much higher than the transversion rate in the SNPs, and SNPs widely exist in the Nicotiana. In vitro verification indicated that all of the SNPs were high quality and accurate. Evolutionary relationships between 15 varieties were investigated by polymerase chain reaction with a special primer; the specific 302 locus of these sequence results clearly indicated the origin of Zhongyan 100. A database of Nicotiana SNPs (NSNP) was developed to store and search for SNPs in Nicotiana. NSNP is a tool for researchers to develop SNP markers of sequence data. PMID:26214460

  17. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm

    PubMed Central

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770

  18. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm.

    PubMed

    Kumar, Manish

    2015-01-01

    One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. PMID:27065770

  19. The complete nucleotide sequence of pelargonium leaf curl virus.

    PubMed

    McGavin, Wendy J; MacFarlane, Stuart A

    2016-05-01

    Investigation of a tombusvirus isolated from tulip plants in Scotland revealed that it was pelargonium leaf curl virus (PLCV) rather than the originally suggested tomato bushy stunt virus. The complete sequence of the PLCV genome was determined for the first time, revealing it to be 4789 nucleotides in size and to have an organization similar to that of the other, previously described tombusviruses. Primers derived from the sequence were used to construct a full-length infectious clone of PLCV that recapitulates the disease symptoms of leaf curling in systemically infected pelargonium plants. PMID:26906694

  20. PhyPA: Phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences.

    PubMed

    Xia, Xuhua

    2016-09-01

    While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing. PMID:27377322

  1. A simple method to control over-alignment in the MAFFT multiple sequence alignment program

    PubMed Central

    Katoh, Kazutaka; Standley, Daron M.

    2016-01-01

    Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact: katoh@ifrec.osaka-u.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27153688

  2. Nucleotide sequences of five anti-lysozyme monoclonal antibodies.

    PubMed Central

    Darsley, M J; Rees, A R

    1985-01-01

    The nucleotide sequences of the heavy and light chain immunoglobulin mRNAs derived from five hybridomas (Gloop 1-5) secreting IgGs specific for the loop region of hen egg lysozyme were determined. These monoclonal antibodies recognise three distinct but overlapping epitopes within the loop region. The sequences of two pairs of antibodies with indistinguishable fine specificities were similar in both chains whereas the sequences of antibodies of non-identical specificities were very different. It is proposed that the D-segments expressed in two of the antibodies (Gloop3 and Gloop4) are the products of one, or perhaps two, previously unidentified germ line D-genes. Gloop1 and Gloop2 use a D-segment previously identified in antibodies specific for the hapten 2-phenyloxazolone; however it is recombined in a different reading frame in the anti-lysozyme antibodies, producing a different amino acid sequence. PMID:2410256

  3. The role of surface energy in guanosine nucleotide alignment: an intriguing scenario.

    PubMed

    Tone, Caterina M; De Santo, Maria P; Ciuchi, Federica

    2014-07-01

    In this paper we report how the confining surfaces and the ionic effects of different concentration of guanosine solution can be used to vary the alignment of liquid crystal phases of guanosine nucleotides. Liquid crystal phases of guanosine 5'-monophosphate ammonium salt and guanosine 5'-monophosphate free acid in pure water, with and without silver sulphate, were studied by polarized optical microscope. A periodic modulation of the texture was observed. This modulation depends on both on the concentration and on the presence of silver ions in the liquid crystal phase. We demonstrate that, according to the surface energy of the alignment layers, it is possible to homeotropically align the guanosine chromonic phase without applying any external magnetic field. Finally, we report the formation of spherical, vesicle-like guanosine 5'-monophosphate aggregates, when the solution was confined between two hydrophobic surfaces containing exposed Si groups. PMID:24832053

  4. An information theoretic approach to macromolecular modeling: I. Sequence alignments.

    PubMed

    Aynechi, Tiba; Kuntz, Irwin D

    2005-11-01

    We are interested in applying the principles of information theory to structural biology calculations. In this article, we explore the information content of an important computational procedure: sequence alignment. Using a reference state developed from exhaustive sequences, we measure alignment statistics and evaluate gap penalties based on first-principle considerations and gap distributions. We show that there are different gap penalties for different alphabet sizes and that the gap penalties can depend on the length of the sequences being aligned. In a companion article, we examine the information content of molecular force fields. PMID:16254389

  5. Bioinformatics comparison of sulfate-reducing metabolism nucleotide sequences

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Nguyen, A.; Cheung, E.; Sullivan, R.; Holden, T.; Lieberman, D.; Cheung, T.

    2015-09-01

    The sulfate-reducing bacteria can be traced back to 3.5 billion years ago. The thermodynamics details of the sulfur cycle have been well documented. A recent sulfate-reducing bacteria report (Robator, Jungbluth, et al , 2015 Jan, Front. Microbiol) with Genbank nucleotide data has been analyzed in terms of the sulfite reductase (dsrAB) via fractal dimension and entropy values. Comparison to oil field sulfate-reducing sequences was included. The AUCG translational mass fractal dimension versus ATCG transcriptional mass fractal dimension for the low temperature dsrB and dsrA sequences reported in Reference Thirteen shows correlation R-sq ~ 0.79 , with a probably of about 3% in simulation. A recent report of using Cystathionine gamma-lyase sequence to produce CdS quantum dot in a biological method, where the sulfur is reduced just like in the H2S production process, was included for comparison. The AUCG mass fractal dimension versus ATCG mass fractal dimension for the Cystathionine gamma-lyase sequences was found to have R-sq of 0.72, similar to the low temperature dissimilatory sulfite reductase dsr group with 3% probability, in contrary to the oil field group having R-sq ~ 0.94, a high probable outcome in the simulation. The other two simulation histograms, namely, fractal dimension versus entropy R-sq outcome values, and di-nucleotide entropy versus mono-nucleotide entropy R-sq outcome values are also discussed in the data analysis focusing on low probability outcomes.

  6. Improving multiple sequence alignment by using better guide trees

    PubMed Central

    2015-01-01

    Progressive sequence alignment is one of the most commonly used method for multiple sequence alignment. Roughly speaking, the method first builds a guide tree, and then aligns the sequences progressively according to the topology of the tree. It is believed that guide trees are very important to progressive alignment; a better guide tree will give an alignment with higher accuracy. Recently, we have proposed an adaptive method for constructing guide trees. This paper studies the quality of the guide trees constructed by such method. Our study showed that our adaptive method can be used to improve the accuracy of many different progressive MSA tools. In fact, we give evidences showing that the guide trees constructed by the adaptive method are among the best. PMID:25859903

  7. Cytochrome b nucleotide sequence variation among the Atlantic Alcidae.

    PubMed

    Friesen, V L; Montevecchi, W A; Davidson, W S

    1993-01-01

    Analysis of cytochrome b nucleotide sequences of the six extant species of Atlantic alcids and a gull revealed an excess of adenines and cytosines and a deficit of guanines at silent sites on the coding strand. Phylogenetic analyses grouped the sequences of the common (Uria aalge) and Brünnich's (U. lomvia) guillemots, followed by the razorbill (Alca torda) and little auk (Alle alle). The black guillemot (Cepphus grylle) sequence formed a sister taxon, and the puffin (Fratercula arctica) fell outside the other alcids. Phylogenetic comparisons of substitutions indicated that mutabilities of bases did not differ, but that C was much more likely to be incorporated than was G. Imbalances in base composition appear to result from a strand bias in replication errors, which may result from selection on secondary RNA structure and/or the energetics of codon-anticodon interactions. PMID:7916741

  8. The nucleotide sequence of the bacteriophage T5 ltf gene.

    PubMed

    Kaliman, A V; Kulshin, V E; Shlyapnikov, M G; Ksenzenko, V N; Kryukov, V M

    1995-06-01

    The nucleotide sequence of the bacteriophage T5 Bg/II-BamHI fragment (4,835 bp in length) known to carry a gene encoding the LTF protein which forms the phage L-shaped tail fibers was determined. It was shown to contain an open reading frame for 1,396 amino acid residues that corresponds to a protein of 147.8 kDa. The coding region of ltf gene is preceded by a typical Shine-Dalgarno sequence. Downstream from the ltf gene there is a strong transcription terminator. Data bank analysis of the LTF protein sequence reveals 55.1% identity to the hypothetical protein ORF 401 of bacteriophage lambda in a segment of 118 amino acids overlap. PMID:7789514

  9. Look-Align: an interactive web-based multiple sequence alignment viewer with polymorphism analysis support

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We have developed Look-Align, an interactive web-based viewer to display pre-computed multiple sequence alignments. Although initially developed to support the visualization needs of the maize diversity website Panzea (http://www.panzea.org), the viewer is a generic stand-alone tool that can be easi...

  10. Refinement by shifting secondary structure elements improves sequence alignments.

    PubMed

    Tong, Jing; Pei, Jimin; Otwinowski, Zbyszek; Grishin, Nick V

    2015-03-01

    Constructing a model of a query protein based on its alignment to a homolog with experimentally determined spatial structure (the template) is still the most reliable approach to structure prediction. Alignment errors are the main bottleneck for homology modeling when the query is distantly related to the template. Alignment methods often misalign secondary structural elements by a few residues. Therefore, better alignment solutions can be found within a limited set of local shifts of secondary structures. We present a refinement method to improve pairwise sequence alignments by evaluating alignment variants generated by local shifts of template-defined secondary structures. Our method SFESA is based on a novel scoring function that combines the profile-based sequence score and the structure score derived from residue contacts in a template. Such a combined score frequently selects a better alignment variant among a set of candidate alignments generated by local shifts and leads to overall increase in alignment accuracy. Evaluation of several benchmarks shows that our refinement method significantly improves alignments made by automatic methods such as PROMALS, HHpred and CNFpred. The web server is available at http://prodata.swmed.edu/sfesa. PMID:25546158

  11. Refinement by shifting secondary structure elements improves sequence alignments

    PubMed Central

    Tong, Jing; Pei, Jimin; Otwinowski, Zbyszek; Grishin, Nick V.

    2015-01-01

    Constructing a model of a query protein based on its alignment to a homolog with experimentally determined spatial structure (the template) is still the most reliable approach to structure prediction. Alignment errors are the main bottleneck for homology modeling when the query is distantly related to the template. Alignment methods often misalign secondary structural elements by a few residues. Therefore, better alignment solutions can be found within a limited set of local shifts of secondary structures. We present a refinement method to improve pairwise sequence alignments by evaluating alignment variants generated by local shifts of template-defined secondary structures. Our method SFESA is based on a novel scoring function that combines the profile-based sequence score and the structure score derived from residue contacts in a template. Such a combined score frequently selects a better alignment variant among a set of candidate alignments generated by local shifts and leads to overall increase in alignment accuracy. Evaluation of several benchmarks shows that our refinement method significantly improves alignments made by automatic methods such as PROMALS, HHpred and CNFpred. The web server is available at http://prodata.swmed.edu/sfesa. PMID:25546158

  12. Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty

    PubMed Central

    Md Mukarram Hossain, A.S.; Blackburne, Benjamin P.; Shah, Abhijeet; Whelan, Simon

    2015-01-01

    Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference. PMID:26139831

  13. Nucleotide sequence and expression of a Drosophila metallothionein.

    PubMed

    Lastowski-Perry, D; Otto, E; Maroni, G

    1985-02-10

    A Drosophila melanogaster cDNA clone was isolated based on its more intense hybridization to RNA sequences from copper-fed larvae than from control larval RNA. This clone showed strong hybridization to mouse metallothionein I cDNA at reduced stringency. Its nucleotide sequence includes an open reading segment which codes for a 40-amino acid protein; this protein is identified as metallothionein based on its similarity to the amino-terminal portion of mammalian and crab metalloproteins. The 10 cysteine residues present occur in five pairs of near vicinal cysteines (Cys-X-Cys). This cDNA sequence hybridized to a 400-nucleotide polyadenylated RNA whose presence in the cells of the alimentary canal of larvae was stimulated by ingestion of cadmium or copper; in other tissues this RNA was present at much lower levels. Mercury, silver, and zinc induced metallothionein to a lesser extent. The level of metallothionein RNA increased very soon after the initiation of metal treatment and reached a maximum after approximately 36 h. PMID:2578462

  14. Nucleotide sequence corresponding to five chemotaxis genes in Escherichia coli.

    PubMed Central

    Mutoh, N; Simon, M I

    1986-01-01

    The nucleotide sequence of DNA which contains five chemotaxis-related genes of Escherichia coli, cheW, cheR, cheB, cheY, and cheZ, and part of the cheA gene was determined. Molecular weights of the polypeptides encoded by these genes were calculated from translated amino acid sequences, and they were 18,100 for cheW, 32,700 for cheR, 37,500 for cheB, 14,100 for cheY, and 24,000 for cheZ. Nucleotide sequences which could act as ribosome-binding sites were found in the upstream region of each gene. After the termination codon of the cheW gene, a typical rho-independent transcription termination signal was observed. There are no other open reading frames long enough to encode polypeptides in this region except those which code for the two previously reported genes tar and tap. PMID:3510184

  15. Protein multiple sequence alignment by hybrid bio-inspired algorithms.

    PubMed

    Cutello, Vincenzo; Nicosia, Giuseppe; Pavone, Mario; Prizzi, Igor

    2011-03-01

    This article presents an immune inspired algorithm to tackle the Multiple Sequence Alignment (MSA) problem. MSA is one of the most important tasks in biological sequence analysis. Although this paper focuses on protein alignments, most of the discussion and methodology may also be applied to DNA alignments. The problem of finding the multiple alignment was investigated in the study by Bonizzoni and Vedova and Wang and Jiang, and proved to be a NP-hard (non-deterministic polynomial-time hard) problem. The presented algorithm, called Immunological Multiple Sequence Alignment Algorithm (IMSA), incorporates two new strategies to create the initial population and specific ad hoc mutation operators. It is based on the 'weighted sum of pairs' as objective function, to evaluate a given candidate alignment. IMSA was tested using both classical benchmarks of BAliBASE (versions 1.0, 2.0 and 3.0), and experimental results indicate that it is comparable with state-of-the-art multiple alignment algorithms, in terms of quality of alignments, weighted Sums-of-Pairs (SP) and Column Score (CS) values. The main novelty of IMSA is its ability to generate more than a single suboptimal alignment, for every MSA instance; this behaviour is due to the stochastic nature of the algorithm and of the populations evolved during the convergence process. This feature will help the decision maker to assess and select a biologically relevant multiple sequence alignment. Finally, the designed algorithm can be used as a local search procedure to properly explore promising alignments of the search space. PMID:21071394

  16. Nucleotide sequence of Bacillus phage Nf terminal protein gene.

    PubMed Central

    Leavitt, M C; Ito, J

    1987-01-01

    The nucleotide sequence of Bacillus phage Nf gene E has been determined. Gene E codes for phage terminal protein which is the primer necessary for the initiation of DNA replication. The deduced amino acid sequence of Nf terminal protein is approximately 66% homologous with the terminal proteins of Bacillus phages PZA and luminal diameter 29, and shows similar hydropathy and secondary structure predictions. A serine which has been identified as the residue which covalently links the protein to the 5' end of the genome in luminal diameter 29, is conserved in all three phages. The hydropathic and secondary structural environment of this serine is similar in these phage terminal proteins and also similar to the linking serine of adenovirus terminal protein. PMID:3601672

  17. Nucleotide sequences specific to Brucella and methods for the detection of Brucella

    DOEpatents

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.

    2009-02-24

    Nucleotide sequences specific to Brucella that serves as a marker or signature for identification of this bacterium were identified. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  18. Nucleotide sequences specific to Francisella tularensis and methods for the detection of Francisella tularensis

    DOEpatents

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.; Vitalis, Elizabeth A

    2007-02-06

    Described herein is the identification of nucleotide sequences specific to Francisella tularensis that serves as a marker or signature for identification of this bacterium. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  19. Nucleotide sequences specific to Francisella tularensis and methods for the detection of Francisella tularensis

    DOEpatents

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.; Vitalis, Elizabeth A

    2009-02-24

    Described herein is the identification of nucleotide sequences specific to Francisella tularensis that serves as a marker or signature for identification of this bacterium. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  20. Nucleotide sequences specific to Yersinia pestis and methods for the detection of Yersinia pestis

    DOEpatents

    McCready, Paula M.; Radnedge, Lyndsay; Andersen, Gary L.; Ott, Linda L.; Slezak, Thomas R.; Kuczmarski, Thomas A.; Motin, Vladinir L.

    2009-02-24

    Nucleotide sequences specific to Yersinia pestis that serve as markers or signatures for identification of this bacterium were identified. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.

  1. Scoring consensus of multiple ECG annotators by optimal sequence alignment.

    PubMed

    Haghpanahi, Masoumeh; Sameni, Reza; Borkholder, David A

    2014-01-01

    Development of ECG delineation algorithms has been an area of intense research in the field of computational cardiology for the past few decades. However, devising evaluation techniques for scoring and/or merging the results of such algorithms, both in the presence or absence of gold standards, still remains as a challenge. This is mainly due to existence of missed or erroneous determination of fiducial points in the results of different annotation algorithms. The discrepancy between different annotators increases when the reference signal includes arrhythmias or significant noise and its morphology deviates from a clean ECG signal. In this work, we propose a new approach to evaluate and compare the results of different annotators under such conditions. Specifically, we use sequence alignment techniques similar to those used in bioinformatics for the alignment of gene sequences. Our approach is based on dynamic programming where adequate mismatch penalties, depending on the type of the fiducial point and the underlying signal, are defined to optimally align the annotation sequences. We also discuss how to extend the algorithm for more than two sequences by using suitable data structures to align multiple annotation sequences with each other. Once the sequences are aligned, different heuristics are devised to evaluate the performance against a gold standard annotation, or to merge the results of multiple annotations when no gold standard exists. PMID:25570339

  2. Generalized Levy-walk model for DNA nucleotide sequences

    NASA Technical Reports Server (NTRS)

    Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Simons, M.; Stanley, H. E.

    1993-01-01

    We propose a generalized Levy walk to model fractal landscapes observed in noncoding DNA sequences. We find that this model provides a very close approximation to the empirical data and explains a number of statistical properties of genomic DNA sequences such as the distribution of strand-biased regions (those with an excess of one type of nucleotide) as well as local changes in the slope of the correlation exponent alpha. The generalized Levy-walk model simultaneously accounts for the long-range correlations in noncoding DNA sequences and for the apparently paradoxical finding of long subregions of biased random walks (length lj) within these correlated sequences. In the generalized Levy-walk model, the lj are chosen from a power-law distribution P(lj) varies as lj(-mu). The correlation exponent alpha is related to mu through alpha = 2-mu/2 if 2 < mu < 3. The model is consistent with the finding of "repetitive elements" of variable length interspersed within noncoding DNA.

  3. Mercury BLASTP: Accelerating Protein Sequence Alignment.

    PubMed

    Jacob, Arpith; Lancaster, Joseph; Buhler, Jeremy; Harris, Brandon; Chamberlain, Roger D

    2008-06-01

    Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results. PMID:19492068

  4. Mercury BLASTP: Accelerating Protein Sequence Alignment

    PubMed Central

    Jacob, Arpith; Lancaster, Joseph; Buhler, Jeremy; Harris, Brandon; Chamberlain, Roger D.

    2008-01-01

    Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results. PMID:19492068

  5. Evaluating the Accuracy and Efficiency of Multiple Sequence Alignment Methods

    PubMed Central

    Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Muhammad; Awan, Ali Raza; Aslam, Naeem; Hussain, Tanveer; Naveed, Nasir; Qadri, Salman; Waheed, Usman; Shoaib, Muhammad

    2014-01-01

    A comparison of 10 most popular Multiple Sequence Alignment (MSA) tools, namely, MUSCLE, MAFFT(L-INS-i), MAFFT (FFT-NS-2), T-Coffee, ProbCons, SATe, Clustal Omega, Kalign, Multalin, and Dialign-TX is presented. We also focused on the significance of some implementations embedded in algorithm of each tool. Based on 10 simulated trees of different number of taxa generated by R, 400 known alignments and sequence files were constructed using indel-Seq-Gen. A total of 4000 test alignments were generated to study the effect of sequence length, indel size, deletion rate, and insertion rate. Results showed that alignment quality was highly dependent on the number of deletions and insertions in the sequences and that the sequence length and indel size had a weaker effect. Overall, ProbCons was consistently on the top of list of the evaluated MSA tools. SATe, being little less accurate, was 529.10% faster than ProbCons and 236.72% faster than MAFFT(L-INS-i). Among other tools, Kalign and MUSCLE achieved the highest sum of pairs. We also considered BALiBASE benchmark datasets and the results relative to BAliBASE- and indel-Seq-Gen-generated alignments were consistent in the most cases. PMID:25574120

  6. Empirical Bayes Estimation of Coalescence Times from Nucleotide Sequence Data.

    PubMed

    King, Leandra; Wakeley, John

    2016-09-01

    We demonstrate the advantages of using information at many unlinked loci to better calibrate estimates of the time to the most recent common ancestor (TMRCA) at a given locus. To this end, we apply a simple empirical Bayes method to estimate the TMRCA. This method is both asymptotically optimal, in the sense that the estimator converges to the true value when the number of unlinked loci for which we have information is large, and has the advantage of not making any assumptions about demographic history. The algorithm works as follows: we first split the sample at each locus into inferred left and right clades to obtain many estimates of the TMRCA, which we can average to obtain an initial estimate of the TMRCA. We then use nucleotide sequence data from other unlinked loci to form an empirical distribution that we can use to improve this initial estimate. PMID:27440864

  7. Recursive dynamic programming for adaptive sequence and structure alignment

    SciTech Connect

    Thiele, R.; Zimmer, R.; Lengauer, T.

    1995-12-31

    We propose a new alignment procedure that is capable of aligning protein sequences and structures in a unified manner. Recursive dynamic programming (RDP) is a hierarchical method which, on each level of the hierarchy, identifies locally optimal solutions and assembles them into partial alignments of sequences and/or structures. In contrast to classical dynamic programming, RDP can also handle alignment problems that use objective functions not obeying the principle of prefix optimality, e.g. scoring schemes derived from energy potentials of mean force. For such alignment problems, RDP aims at computing solutions that are near-optimal with respect to the involved cost function and biologically meaningful at the same time. Towards this goal, RDP maintains a dynamic balance between different factors governing alignment fitness such as evolutionary relationships and structural preferences. As in the RDP method gaps are not scored explicitly, the problematic assignment of gap cost parameters is circumvented. In order to evaluate the RDP approach we analyse whether known and accepted multiple alignments based on structural information can be reproduced with the RDP method.

  8. Complete nucleotide sequence of Nootka lupine vein-clearing virus.

    PubMed

    Robertson, Nancy L; Côté, Fabien; Paré, Christine; Leblanc, Eric; Bergeron, Michel G; Leclerc, Denis

    2007-12-01

    The complete genome sequence of Nootka lupine vein-clearing virus (NLVCV) was determined to be 4,172 nucleotides in length containing four open reading frames (ORFs) with a similar genetic organization of virus species in the genus Carmovirus, family Tombusviridae. The order and gene product size, starting from the 5'-proximal ORF consisted of: (1) polymerase/replicase gene, ORF1 (p27) and ORF1RT (readthrough) (p87), (2) movement proteins ORF2 (p7) and ORF3 (p9), and, (3) the 3'-proximal coat protein ORF4, (p37). The genomic 5'- and 3'-proximal termini contained a short (59 nt) and a relatively longer 405 nt untranslated region, respectively. The longer replicase gene product contained the GDD motif common to RNA-dependent RNA polymerases. Phylogenetically, NLVCV formed a subgroup with the following four carmoviruses when separately comparing the amino acids of the coat protein or replicase protein: Angelonia flower break virus (AnFBV), Carnation mottle virus (CarMV), Pelargonium flower break virus (PFBV), and Saguaro cactus virus (SgCV). Whole genome nucleotide analysis (percent identities) among the carmoviruses with NLVCV suggested a similar pattern. The species demarcation criteria in the genus Carmovirus for the amino acid sequence identity of the polymerase (<52%) and coat (<41%) protein genes restricted NLVCV as a distinct species, and instead, placed it as a tentative strain of CarMV, PFBV, or SgCV when both the polymerase and CP were used as the determining factors. In contrast, the species criteria that included different host ranges with no overlap and lack of serology relatedness between NLVCV and the carmoviruses, suggested that NLVCV was a distinct species. The relatively low cutoff percentages allowed for the polymerase and CP genes to dictate the inclusion/exclusion of a distinct carmovirus species should be reevaluated. Therefore, at this time we have concluded that NLVCV should be classified as a tentative new species in the genus Carmovirus

  9. Image-based temporal alignment of echocardiographic sequences

    NASA Astrophysics Data System (ADS)

    Danudibroto, Adriyana; Bersvendsen, Jørn; Mirea, Oana; Gerard, Olivier; D'hooge, Jan; Samset, Eigil

    2016-04-01

    Temporal alignment of echocardiographic sequences enables fair comparisons of multiple cardiac sequences by showing corresponding frames at given time points in the cardiac cycle. It is also essential for spatial registration of echo volumes where several acquisitions are combined for enhancement of image quality or forming larger field of view. In this study, three different image-based temporal alignment methods were investigated. First, a method based on dynamic time warping (DTW). Second, a spline-based method that optimized the similarity between temporal characteristic curves of the cardiac cycle using 1D cubic B-spline interpolation. Third, a method based on the spline-based method with piecewise modification. These methods were tested on in-vivo data sets of 19 echo sequences. For each sequence, the mitral valve opening (MVO) time was manually annotated. The results showed that the average MVO timing error for all methods are well under the time resolution of the sequences.

  10. A max-margin model for efficient simultaneous alignment and folding of RNA sequences

    PubMed Central

    Do, Chuong B.; Foo, Chuan-Sheng; Batzoglou, Serafim

    2008-01-01

    Motivation: The need for accurate and efficient tools for computational RNA structure analysis has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo non-coding RNA gene prediction. In this work, we present RAF (RNA Alignment and Folding), an efficient algorithm for simultaneous alignment and consensus folding of unaligned RNA sequences. Algorithmically, RAF exploits sparsity in the set of likely pairing and alignment candidates for each nucleotide (as identified by the CONTRAfold or CONTRAlign programs) to achieve an effectively quadratic running time for simultaneous pairwise alignment and folding. RAF's fast sparse dynamic programming, in turn, serves as the inference engine within a discriminative machine learning algorithm for parameter estimation. Results: In cross-validated benchmark tests, RAF achieves accuracies equaling or surpassing the current best approaches for RNA multiple sequence secondary structure prediction. However, RAF requires nearly an order of magnitude less time than other simultaneous folding and alignment methods, thus making it especially appropriate for high-throughput studies. Availability: Source code for RAF is available at:http://contra.stanford.edu/contrafold/ Contact: chuongdo@cs.stanford.edu PMID:18586747

  11. Heuristic reusable dynamic programming: efficient updates of local sequence alignment.

    PubMed

    Hong, Changjin; Tewfik, Ahmed H

    2009-01-01

    Recomputation of the previously evaluated similarity results between biological sequences becomes inevitable when researchers realize errors in their sequenced data or when the researchers have to compare nearly similar sequences, e.g., in a family of proteins. We present an efficient scheme for updating local sequence alignments with an affine gap model. In principle, using the previous matching result between two amino acid sequences, we perform a forward-backward alignment to generate heuristic searching bands which are bounded by a set of suboptimal paths. Given a correctly updated sequence, we initially predict a new score of the alignment path for each contour to select the best candidates among them. Then, we run the Smith-Waterman algorithm in this confined space. Furthermore, our heuristic alignment for an updated sequence shows that it can be further accelerated by using reusable dynamic programming (rDP), our prior work. In this study, we successfully validate "relative node tolerance bound" (RNTB) in the pruned searching space. Furthermore, we improve the computational performance by quantifying the successful RNTB tolerance probability and switch to rDP on perturbation-resilient columns only. In our searching space derived by a threshold value of 90 percent of the optimal alignment score, we find that 98.3 percent of contours contain correctly updated paths. We also find that our method consumes only 25.36 percent of the runtime cost of sparse dynamic programming (sDP) method, and to only 2.55 percent of that of a normal dynamic programming with the Smith-Waterman algorithm. PMID:19875856

  12. The nucleotide sequence of the uvrD gene of E. coli.

    PubMed Central

    Finch, P W; Emmerson, P T

    1984-01-01

    The nucleotide sequence of a cloned section of the E. coli chromosome containing the uvrD gene has been determined. The coding region for the UvrD protein consists of 2,160 nucleotides which would direct the synthesis of a polypeptide 720 amino acids long with a calculated molecular weight of 82 kd. The predicted amino acid sequence of the UvrD protein has been compared with the amino acid sequences of other known adenine nucleotide binding proteins and a common sequence has been identified, thought to contribute towards adenine nucleotide binding. PMID:6379604

  13. A novel approach to multiple sequence alignment using hadoop data grids.

    PubMed

    Sudha Sadasivam, G; Baktavatchalam, G

    2010-01-01

    Multiple alignment of protein sequences helps to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Although dynamic programming algorithms produce accurate alignments, they are computation intensive. In this paper we propose a time efficient approach to sequence alignment that also produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. The principle of block splitting in hadoop coupled with its scalability facilitates alignment of very large sequences. PMID:21224205

  14. Spatially localized generation of nucleotide sequence-specific DNA damage

    PubMed Central

    Oh, Dennis H.; King, Brett A.; Boxer, Steven G.; Hanawalt, Philip C.

    2001-01-01

    Psoralens linked to triplex-forming oligonucleotides (psoTFOs) have been used in conjunction with laser-induced two-photon excitation (TPE) to damage a specific DNA target sequence. To demonstrate that TPE can initiate photochemistry resulting in psoralen–DNA photoadducts, target DNA sequences were incubated with psoTFOs to form triple-helical complexes and then irradiated in liquid solution with pulsed 765-nm laser light, which is half the quantum energy required for conventional one-photon excitation, as used in psoralen + UV A radiation (320–400 nm) therapy. Target DNA acquired strand-specific psoralen monoadducts in a light dose-dependent fashion. To localize DNA damage in a model tissue-like medium, a DNA–psoTFO mixture was prepared in a polyacrylamide gel and then irradiated with a converging laser beam targeting the rear of the gel. The highest number of photoadducts formed at the rear while relatively sparing DNA at the front of the gel, demonstrating spatial localization of sequence-specific DNA damage by TPE. To assess whether TPE treatment could be extended to cells without significant toxicity, cultured monolayers of normal human dermal fibroblasts were incubated with tritium-labeled psoralen without TFO to maximize detectable damage and irradiated by TPE. DNA from irradiated cells treated with psoralen exhibited a 4- to 7-fold increase in tritium activity relative to untreated controls. Functional survival assays indicated that the psoralen–TPE treatment was not toxic to cells. These results demonstrate that DNA damage can be simultaneously manipulated at the nucleotide level and in three dimensions. This approach for targeting photochemical DNA damage may have photochemotherapeutic applications in skin and other optically accessible tissues. PMID:11572980

  15. Spatially localized generation of nucleotide sequence-specific DNA damage.

    PubMed

    Oh, D H; King, B A; Boxer, S G; Hanawalt, P C

    2001-09-25

    Psoralens linked to triplex-forming oligonucleotides (psoTFOs) have been used in conjunction with laser-induced two-photon excitation (TPE) to damage a specific DNA target sequence. To demonstrate that TPE can initiate photochemistry resulting in psoralen-DNA photoadducts, target DNA sequences were incubated with psoTFOs to form triple-helical complexes and then irradiated in liquid solution with pulsed 765-nm laser light, which is half the quantum energy required for conventional one-photon excitation, as used in psoralen + UV A radiation (320-400 nm) therapy. Target DNA acquired strand-specific psoralen monoadducts in a light dose-dependent fashion. To localize DNA damage in a model tissue-like medium, a DNA-psoTFO mixture was prepared in a polyacrylamide gel and then irradiated with a converging laser beam targeting the rear of the gel. The highest number of photoadducts formed at the rear while relatively sparing DNA at the front of the gel, demonstrating spatial localization of sequence-specific DNA damage by TPE. To assess whether TPE treatment could be extended to cells without significant toxicity, cultured monolayers of normal human dermal fibroblasts were incubated with tritium-labeled psoralen without TFO to maximize detectable damage and irradiated by TPE. DNA from irradiated cells treated with psoralen exhibited a 4- to 7-fold increase in tritium activity relative to untreated controls. Functional survival assays indicated that the psoralen-TPE treatment was not toxic to cells. These results demonstrate that DNA damage can be simultaneously manipulated at the nucleotide level and in three dimensions. This approach for targeting photochemical DNA damage may have photochemotherapeutic applications in skin and other optically accessible tissues. PMID:11572980

  16. SRP-RNA sequence alignment and secondary structure.

    PubMed Central

    Larsen, N; Zwieb, C

    1991-01-01

    The secondary structures of the RNAs from the signal recognition particle, termed SRP-RNA, were derived buy comparative analyses of an alignment of 39 sequences. The models are minimal in that only base pairs are included for which there is comparative evidence. The structures represent refinements of earlier versions and include a new short helix. PMID:1707519

  17. IVisTMSA: Interactive Visual Tools for Multiple Sequence Alignments.

    PubMed

    Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Naeem; Naveed, Nasir; Ahmad, Sarfraz; Muhammad, Shah; Qadri, Salman; Shahid, Muhammad; Hussain, Tanveer; Javed, Maryam

    2015-01-01

    IVisTMSA is a software package of seven graphical tools for multiple sequence alignments. MSApad is an editing and analysis tool. It can load 409% more data than Jalview, STRAP, CINEMA, and Base-by-Base. MSA comparator allows the user to visualize consistent and inconsistent regions of reference and test alignments of more than 21-MB size in less than 12 seconds. MSA comparator is 5,200% efficient and more than 40% efficient as compared to BALiBASE c program and FastSP, respectively. MSA reconstruction tool provides graphical user interfaces for four popular aligners and allows the user to load several sequence files at a time. FASTA generator converts seven formats of alignments of unlimited size into FASTA format in a few seconds. MSA ID calculator calculates identity matrix of more than 11,000 sequences with a sequence length of 2,696 base pairs in less than 100 seconds. Tree and Distance Matrix calculation tools generate phylogenetic tree and distance matrix, respectively, using neighbor joining% identity and BLOSUM 62 matrix. PMID:25861209

  18. IVisTMSA: Interactive Visual Tools for Multiple Sequence Alignments

    PubMed Central

    Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Naeem; Naveed, Nasir; Ahmad, Sarfraz; Muhammad, Shah; Qadri, Salman; Shahid, Muhammad; Hussain, Tanveer; Javed, Maryam

    2015-01-01

    IVisTMSA is a software package of seven graphical tools for multiple sequence alignments. MSApad is an editing and analysis tool. It can load 409% more data than Jalview, STRAP, CINEMA, and Base-by-Base. MSA comparator allows the user to visualize consistent and inconsistent regions of reference and test alignments of more than 21-MB size in less than 12 seconds. MSA comparator is 5,200% efficient and more than 40% efficient as compared to BALiBASE c program and FastSP, respectively. MSA reconstruction tool provides graphical user interfaces for four popular aligners and allows the user to load several sequence files at a time. FASTA generator converts seven formats of alignments of unlimited size into FASTA format in a few seconds. MSA ID calculator calculates identity matrix of more than 11,000 sequences with a sequence length of 2,696 base pairs in less than 100 seconds. Tree and Distance Matrix calculation tools generate phylogenetic tree and distance matrix, respectively, using neighbor joining% identity and BLOSUM 62 matrix. PMID:25861209

  19. SDT: A Virus Classification Tool Based on Pairwise Sequence Alignment and Identity Calculation

    PubMed Central

    Muhire, Brejnev Muhizi; Varsani, Arvind; Martin, Darren Patrick

    2014-01-01

    The perpetually increasing rate at which viral full-genome sequences are being determined is creating a pressing demand for computational tools that will aid the objective classification of these genome sequences. Taxonomic classification approaches that are based on pairwise genetic identity measures are potentially highly automatable and are progressively gaining favour with the International Committee on Taxonomy of Viruses (ICTV). There are, however, various issues with the calculation of such measures that could potentially undermine the accuracy and consistency with which they can be applied to virus classification. Firstly, pairwise sequence identities computed based on multiple sequence alignments rather than on multiple independent pairwise alignments can lead to the deflation of identity scores with increasing dataset sizes. Also, when gap-characters need to be introduced during sequence alignments to account for insertions and deletions, methodological variations in the way that these characters are introduced and handled during pairwise genetic identity calculations can cause high degrees of inconsistency in the way that different methods classify the same sets of sequences. Here we present Sequence Demarcation Tool (SDT), a free user-friendly computer program that aims to provide a robust and highly reproducible means of objectively using pairwise genetic identity calculations to classify any set of nucleotide or amino acid sequences. SDT can produce publication quality pairwise identity plots and colour-coded distance matrices to further aid the classification of sequences according to ICTV approved taxonomic demarcation criteria. Besides a graphical interface version of the program for Windows computers, command-line versions of the program are available for a variety of different operating systems (including a parallel version for cluster computing platforms). PMID:25259891

  20. HIVE-Hexagon: High-Performance, Parallelized Sequence Alignment for Next-Generation Sequencing Data Analysis

    PubMed Central

    Santana-Quintero, Luis; Dingerdissen, Hayley; Thierry-Mieg, Jean; Mazumder, Raja; Simonyan, Vahan

    2014-01-01

    Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations. Availability https://hive.biochemistry.gwu.edu/hive/ PMID:24918764

  1. Nucleotide sequence and proposed secondary structure of Columnea latent viroid: a natural mosaic of viroid sequences.

    PubMed Central

    Hammond, R; Smith, D R; Diener, T O

    1989-01-01

    The Columnea latent viroid (CLV) occurs latently in certain Columnea erythrophae plants grown commercially. In potato and tomato, CLV causes potato spindle tuber viroid (PSTV)-like symptoms. Its nucleotide sequence and proposed secondary structure reveal that CLV consists of a single-stranded circular RNA of 370 nucleotides which can assume a rod-like structure with extensive base-pairing characteristic of all known viroids. The electrophoretic mobility of circular CLV under nondenaturing conditions suggests a potential tertiary structure. CLV contains extensive sequence homologies to the PSTV group of viroids but contains a central conserved region identical to that of hop stunt viroid (HSV). CLV also shares some biological properties with each of the two types of viroids. Most probably, CLV is the result of intracellular RNA recombination between an HSV-type and one or more PSTV-type viroids replicating in the same plant. Images PMID:2602114

  2. Nucleotide sequence of a cloned woodchuck hepatitis virus genome: comparison with the hepatitis B virus sequence.

    PubMed Central

    Galibert, F; Chen, T N; Mandart, E

    1982-01-01

    The complete nucleotide sequence of a woodchuck hepatitis virus genome cloned in Escherichia coli was determined by the method of Maxam and Gilbert. This sequence was found to be 3,308 nucleotides long. Potential ATG initiator triplets and nonsense codons were identified and used to locate regions with a substantial coding capacity. A striking similarity was observed between the organization of human hepatitis B virus and woodchuck hepatitis virus. Nucleotide sequences of these open regions in the woodchuck virus were compared with corresponding regions present in hepatitis B virus. This allowed the location of four viral genes on the L strand and indicated the absence of protein coded by the S strand. Evolution rates of the various parts of the genome as well as of the four different proteins coded by hepatitis B virus and woodchuck hepatitis virus were compared. These results indicated that: (i) the core protein has evolved slightly less rapidly than the other proteins; and (ii) when a region of DNA codes for two different proteins, there is less freedom for the DNA to evolve and, moreover, one of the proteins can evolve more rapidly than the other. A hairpin structure, very well conserved in the two genomes, was located in the only region devoid of coding function, suggesting the location of the origin of replication of the viral DNA. Images PMID:7086958

  3. Distributed sequence alignment applications for the public computing architecture.

    PubMed

    Pellicer, S; Chen, G; Chan, K C C; Pan, Y

    2008-03-01

    The public computer architecture shows promise as a platform for solving fundamental problems in bioinformatics such as global gene sequence alignment and data mining with tools such as the basic local alignment search tool (BLAST). Our implementation of these two problems on the Berkeley open infrastructure for network computing (BOINC) platform demonstrates a runtime reduction factor of 1.15 for sequence alignment and 16.76 for BLAST. While the runtime reduction factor of the global gene sequence alignment application is modest, this value is based on a theoretical sequential runtime extrapolated from the calculation of a smaller problem. Because this runtime is extrapolated from running the calculation in memory, the theoretical sequential runtime would require 37.3 GB of memory on a single system. With this in mind, the BOINC implementation not only offers the reduced runtime, but also the aggregation of the available memory of all participant nodes. If an actual sequential run of the problem were compared, a more drastic reduction in the runtime would be seen due to an additional secondary storage I/O overhead for a practical system. Despite the limitations of the public computer architecture, most notably in communication overhead, it represents a practical platform for grid- and cluster-scale bioinformatics computations today and shows great potential for future implementations. PMID:18334454

  4. Complete nucleotide sequence of a maize chlorotic mottle virus isolate from Nebraska

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The complete genome of a maize chlorotic mottle virus isolate from Nebraska (MCMV-NE) was cloned and sequenced. The MCMV-NE genome consists of 4,436 nucleotides and shares 99.5% nucleotide sequence identity with an MCMV isolate from Kansas (MCMV-KS). Of 22 polymorphic sites, most resulted from t...

  5. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus) Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms

    PubMed Central

    Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca

    2015-01-01

    Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450

  6. Next Generation Semiconductor Based Sequencing of the Donkey (Equus asinus) Genome Provided Comparative Sequence Data against the Horse Genome and a Few Millions of Single Nucleotide Polymorphisms.

    PubMed

    Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca

    2015-01-01

    Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450

  7. Rapid automatic detection and alignment of repeats in protein sequences.

    PubMed

    Heger, A; Holm, L

    2000-11-01

    Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self-alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam-A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information. PMID:10966575

  8. Sequence Alignment Tools: One Parallel Pattern to Rule Them All?

    PubMed Central

    2014-01-01

    In this paper, we advocate high-level programming methodology for next generation sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools to their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols, and task scheduling, gaining more possibility for seamless performance tuning. In this work, we show some use cases in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets. PMID:25147803

  9. Complete nucleotide sequence of the temperate bacteriophage LBR48, a new member of the family Myoviridae.

    PubMed

    Jang, Se Hwan; Yoon, Bo Hyun; Chang, Hyo Ihl

    2011-02-01

    The complete genomic sequence of LBR48, a temperate bacteriophage induced from a lysogenic strain of Lactobacillus brevis, was found to be 48,211 nucleotides long and to contain 90 putative open reading frames. Based on structural characteristics obtained from microscopic analysis and nucleic acid sequence determination, phage LBR48 can be classified as a member of the family Myoviridae. Analysis of the genome showed the conserved gene order of previously reported phages of the family Siphoviridae from lactic acid bacteria, despite low nucleotide sequence similarity. Analysis of the attachment sites revealed 15-nucleotide-long core sequences. PMID:20976608

  10. Optimizing Data Intensive GPGPU Computations for DNA Sequence Alignment

    PubMed Central

    Trapnell, Cole; Schatz, Michael C.

    2009-01-01

    MUMmerGPU uses highly-parallel commodity graphics processing units (GPU) to accelerate the data-intensive computation of aligning next generation DNA sequence data to a reference sequence for use in diverse applications such as disease genotyping and personal genomics. MUMmerGPU 2.0 features a new stackless depth-first-search print kernel and is 13× faster than the serial CPU version of the alignment code and nearly 4× faster in total computation time than MUMmerGPU 1.0. We exhaustively examined 128 GPU data layout configurations to improve register footprint and running time and conclude higher occupancy has greater impact than reduced latency. MUMmerGPU is available open-source at http://mummergpu.sourceforge.net. PMID:20161021

  11. Reconfigurable systems for sequence alignment and for general dynamic programming.

    PubMed

    Jacobi, Ricardo P; Ayala-Rincón, Mauricio; Carvalho, Luis G A; Llanos, Carlos H; Hartenstein, Reiner W

    2005-01-01

    Reconfigurable systolic arrays can be adapted to efficiently resolve a wide spectrum of computational problems; parallelism is naturally explored in systolic arrays and reconfigurability allows for redefinition of the interconnections and operations even during run time (dynamically). We present a reconfigurable systolic architecture that can be applied for the efficient treatment of several dynamic programming methods for resolving well-known problems, such as global and local sequence alignment, approximate string matching and longest common subsequence. The dynamicity of the reconfigurability was found to be useful for practical applications in the construction of sequence alignments. A VHDL (VHSIC hardware description language) version of this new architecture was implemented on an APEX FPGA (Field programmable gate array). It would be several magnitudes faster than the software algorithm alternatives. PMID:16342039

  12. Prokaryotic Phylogeny Based on Complete Genomes Without Sequence Alignment

    NASA Astrophysics Data System (ADS)

    Hao, Bailin; Qi, Ji; Wang, Bin

    We present a brief review of a series of on-going work on bacterial phylogeny. We propose a new method to infer relatedness of prokaryotes from their complete genome data without using sequence alignment, leading to results comparable with the bacteriologist's systematics as reflected in the latest 2001 edition of Bergey's Manual of Systematic Bacteriology.1 We only touch on the mathematical aspects of the method. The biological implications of our results will be published elsewhere.

  13. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... acids are not intended to be embraced by this definition. Any amino acid sequence that contains post-translationally modified amino acids may be described as the amino acid sequence that is initially translated... sequence of four or more amino acids or an unbranched sequence of ten or more nucleotides....

  14. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... acids are not intended to be embraced by this definition. Any amino acid sequence that contains post-translationally modified amino acids may be described as the amino acid sequence that is initially translated... sequence of four or more amino acids or an unbranched sequence of ten or more nucleotides....

  15. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... acids are not intended to be embraced by this definition. Any amino acid sequence that contains post-translationally modified amino acids may be described as the amino acid sequence that is initially translated... sequence of four or more amino acids or an unbranched sequence of ten or more nucleotides....

  16. Exploring Dance Movement Data Using Sequence Alignment Methods

    PubMed Central

    Chavoshi, Seyed Hossein; De Baets, Bernard; Neutens, Tijs; De Tré, Guy; Van de Weghe, Nico

    2015-01-01

    Despite the abundance of research on knowledge discovery from moving object databases, only a limited number of studies have examined the interaction between moving point objects in space over time. This paper describes a novel approach for measuring similarity in the interaction between moving objects. The proposed approach consists of three steps. First, we transform movement data into sequences of successive qualitative relations based on the Qualitative Trajectory Calculus (QTC). Second, sequence alignment methods are applied to measure the similarity between movement sequences. Finally, movement sequences are grouped based on similarity by means of an agglomerative hierarchical clustering method. The applicability of this approach is tested using movement data from samba and tango dancers. PMID:26181435

  17. MACSIMS : multiple alignment of complete sequences information management system

    PubMed Central

    Thompson, Julie D; Muller, Arnaud; Waterhouse, Andrew; Procter, Jim; Barton, Geoffrey J; Plewniak, Frédéric; Poch, Olivier

    2006-01-01

    Background In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. Results MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. Conclusion MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at . PMID:16792820

  18. Nucleotide sequence analysis with polynucleotide kinase and nucleotide `mapping' methods. 5′-Terminal sequence of deoxyribonucleic acid from bacteriophages λ and 424

    PubMed Central

    Murray, Kenneth

    1973-01-01

    The polynucleotide kinase reaction was used in analyses of complex mixtures of oligodeoxynucleotides which were fractionated by various two-dimensional nucleotide `mapping' procedures. Parallel ionophoretic analyses on DEAE-cellulose paper, pH2, and AE-cellulose paper, pH3.5, of venom phosphodiesterase partial digests of 5′-terminally labelled oligonucleotides enabled the sequence of the nucleotides to be deduced uniquely. A `diagonal ionophoresis' method has been used with mixtures of nucleotides. Application of these methods to 5′-terminally labelled DNA from bacteriophage λ gave the terminal sequences pA-G-G-T-C-G and pG-G-G-C-G. Identical 5′-terminal sequences were found with DNA from bacteriophage 424. ImagesPLATE 5PLATE 1PLATE 2PLATE 3PLATE 4 PMID:4352720

  19. The nucleotide sequence of the mouse immunoglobulin epsilon gene: comparison with the human epsilon gene sequence.

    PubMed Central

    Ishida, N; Ueda, S; Hayashida, H; Miyata, T; Honjo, T

    1982-01-01

    We have determined the nucleotide sequence of the immunoglobulin epsilon gene cloned from newborn mouse DNA. The epsilon gene sequence allows prediction of the amino acid sequence of the constant region of the epsilon chain and comparison of it with sequences of the human epsilon and other mouse immunoglobulin genes. The epsilon gene was shown to be under the weakest selection pressure at the protein level among the immunoglobulin genes although the divergence at the synonymous position is similar. Our results suggest that the epsilon gene may be dispensable, which is in accord with the fact that IgE has only obscure roles in the immune defense system but has an undesirable role as a mediator of hypersensitivity. The sequence data suggest that the human and murine epsilon genes were derived from different ancestors duplicated a long time ago. The amino acid sequence of the epsilon chain is more homologous to those of the gamma chains than the other mouse heavy chains. Two membrane exons, separated by an 80-base intron, were identified 1.7 kb 3' to the CH4 domain of the epsilon gene and shown to conserve a hydrophobic portion similar to those of other heavy chain genes. RNA blot hybridization showed that the epsilon membrane exons are transcribed into two species of mRNA in an IgE hybridoma. Images Fig. 4. PMID:6329728

  20. Completion of the nucleotide sequence of sunn-hemp mosaic virus: a tobamovirus pathogenic to legumes.

    PubMed

    Silver, S; Quan, S; Deom, C M

    1996-01-01

    Sunn-hemp mosaic virus (SHMV) is a member of the tobamovirus group of plant viruses. The nucleotide sequence of the 5'-untranslated region, the 129 kD protein gene, and a portion of the 186 kD protein gene of SHMV was determined. The 4,683 nucleotides (nts) reported here completes the sequence of the SHMV genome and complements previous work (Meshi, Ohno, and Okada, Nucleic Acids Res. 10, 6111-6117 [1982]; Mol. Gen. Genet. 184, 20-25 [1981]) to provide the first complete nucleotide sequence for a tobamovirus that is pathogenic to leguminous plants. PMID:8938983

  1. Complete nucleotide sequence of the genomic RNA of tobacco mosaic virus strain Cg.

    PubMed

    Yamanaka, T; Komatani, H; Meshi, T; Naito, S; Ishikawa, M; Ohno, T

    1998-01-01

    Tobacco mosaic virus (TMV)-Cg is a crucifer-infecting tobamovirus that was isolated from field-grown garlic. We determined the complete nucleotide sequence of the genomic RNA of TMV-Cg. The genomic RNA of TMV-Cg consists of 6303 nucleotides and encodes four large open reading frames, organized basically in the same way as that of other tobamoviruses. The nucleotide and deduced amino acid sequences are very similar to those of the other crucifer-infecting tobamoviruses that have been sequenced so far. PMID:9608662

  2. Extracting protein alignment models from the sequence database.

    PubMed Central

    Neuwald, A F; Liu, J S; Lipman, D J; Lawrence, C E

    1997-01-01

    Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans ; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences. PMID:9108146

  3. Nucleotide sequence of the genes encoding the canine herpesvirus gB, gC and gD homologues.

    PubMed

    Limbach, K J; Limbach, M P; Conte, D; Paoletti, E

    1994-08-01

    The nucleotide sequence of the genes encoding the canine herpesvirus (CHV) gB, gC and gD homologues was determined. These genes are predicted to encode polypeptides of 879, 459 and 345 amino acids, respectively. Comparison of the predicted amino acid sequences of CHV gB, gC and gD with the homologous sequences from other herpesviruses indicates that CHV is an alphaherpesvirus, a conclusion that is consistent with the previous classification of this virus according to biological properties. Alignment of the homologous gB, gC and gD amino acid sequences indicates that most of the cysteine residues are conserved, suggesting that these glycoproteins possess similar tertiary structures. The nucleotide sequence of the open reading frame downstream from the CHV gC gene was also determined. The predicted amino acid sequence of this putative polypeptide appears to be homologous to a family of proteins encoded downstream from the gC gene in most, although not all, alphaherpesviruses. PMID:7545942

  4. Matrix genes of measles virus and canine distemper virus: cloning, nucleotide sequences, and deduced amino acid sequences.

    PubMed Central

    Bellini, W J; Englund, G; Richardson, C D; Rozenblatt, S; Lazzarini, R A

    1986-01-01

    The nucleotide sequences encoding the matrix (M) proteins of measles virus (MV) and canine distemper virus (CDV) were determined from cDNA clones containing these genes in their entirety. In both cases, single open reading frames specifying basic proteins of 335 amino acid residues were predicted from the nucleotide sequences. Both viral messages were composed of approximately 1,450 nucleotides and contained 400 nucleotides of presumptive noncoding sequences at their respective 3' ends. MV and CDV M-protein-coding regions were 67% homologous at the nucleotide level and 76% homologous at the amino acid level. Only chance homology was observed in the 400-nucleotide trailer sequences. Comparisons of the M protein sequences of MV and CDV with the sequence reported for Sendai virus (B. M. Blumberg, K. Rose, M. G. Simona, L. Roux, C. Giorgi, and D. Kolakofsky, J. Virol. 52:656-663; Y. Hidaka, T. Kanda, K. Iwasaki, A. Nomoto, T. Shioda, and H. Shibuta, Nucleic Acids Res. 12:7965-7973) indicated the greatest homology among these M proteins in the carboxyterminal third of the molecule. Secondary-structure analyses of this shared region indicated a structurally conserved, hydrophobic sequence which possibly interacted with the lipid bilayer. Images PMID:3754588

  5. The complete nucleotide sequence and genome organization of Red clover vein mosaic virus (genus Carlavirus)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Red clover vein mosaic virus (RCVMV) is a serious pathogen of legume crops including pea, chickpea and lentil. The complete nucleotide sequence was generated from an isolate obtained from chickpea in Washington State. The complete genome of RCVMV consists of 8605 nucleotides excluding the poly(A) ...

  6. Genome-wide synteny through highly sensitive sequence alignment: Satsuma

    PubMed Central

    Grabherr, Manfred G.; Russell, Pamela; Meyer, Miriah; Mauceli, Evan; Alföldi, Jessica; Di Palma, Federica; Lindblad-Toh, Kerstin

    2010-01-01

    Motivation: Comparative genomics heavily relies on alignments of large and often complex DNA sequences. From an engineering perspective, the problem here is to provide maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes). Results: Satsuma addresses all three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous ‘battleship’-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine. Availability: Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license from http://www.broadinstitute.org/science/programs/genome-biology/spines/ Contact: grabherr@broadinstitute.org PMID:20208069

  7. Fractal MapReduce decomposition of sequence alignment

    PubMed Central

    2012-01-01

    Background The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required. Results In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming. Conclusions The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing. Availability Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm". PMID:22551205

  8. Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment

    PubMed Central

    Kam, Alfred; Kwak, Daniel; Leung, Clarence; Wu, Chu; Zarour, Eleyine; Sarmenta, Luis; Blanchette, Mathieu; Waldispühl, Jérôme

    2012-01-01

    Background Comparative genomics, or the study of the relationships of genome structure and function across different species, offers a powerful tool for studying evolution, annotating genomes, and understanding the causes of various genetic disorders. However, aligning multiple sequences of DNA, an essential intermediate step for most types of analyses, is a difficult computational task. In parallel, citizen science, an approach that takes advantage of the fact that the human brain is exquisitely tuned to solving specific types of problems, is becoming increasingly popular. There, instances of hard computational problems are dispatched to a crowd of non-expert human game players and solutions are sent back to a central server. Methodology/Principal Findings We introduce Phylo, a human-based computing framework applying “crowd sourcing” techniques to solve the Multiple Sequence Alignment (MSA) problem. The key idea of Phylo is to convert the MSA problem into a casual game that can be played by ordinary web users with a minimal prior knowledge of the biological context. We applied this strategy to improve the alignment of the promoters of disease-related genes from up to 44 vertebrate species. Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered. Conclusions/Significance We demonstrate that, combined with classical algorithms, crowd computing techniques can be successfully used to help improving the accuracy of MSA. More importantly, we show that an NP-hard computational problem can be embedded in casual game that can be easily played by people without significant scientific training. This suggests that citizen science approaches can be used to exploit the billions of “human-brain peta-flops” of computation that are spent every day playing games. Phylo is

  9. Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses

    PubMed Central

    Pightling, Arthur W.; Petronella, Nicholas; Pagotto, Franco

    2014-01-01

    The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should

  10. Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

    PubMed

    Pightling, Arthur W; Petronella, Nicholas; Pagotto, Franco

    2014-01-01

    The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should

  11. Molecular cloning and sequencing of a novel human P2 nucleotide receptor.

    PubMed

    Southey, M C; Hammet, F; Hutchins, A M; Paidhungat, M; Somers, G R; Venter, D J

    1996-11-11

    A novel human P2 nucleotide receptor has been cloned from a T-cell cDNA library. The predicted amino acid sequence shows characteristics of a G-protein-coupled receptor, and shares 88% homology with a recently characterised rat P2 nucleotide receptor sequence. Distinctive features include an extremely short cytoplasmic tail with only one putative protein kinase C phosphorylation site. Northern blot analysis revealed a 1.9 kb transcript expressed in the placenta. PMID:8950181

  12. Prokaryotic Phylogeny Based on Complete Genomes Without Sequence Alignment

    NASA Astrophysics Data System (ADS)

    Hao, Bailin; Qi, Ji; Wang, Bin

    2003-04-01

    This is a brief review of a series of on-going work on bacterial phylogeny. We have proposed a new method to infer relatedness of prokaryotes from their complete genome data without using sequence alignment. It has led to results comparable with the bacteriologists' systematics as reflected in the latest 2001 edition of the Bergey's Manual of Systematic Bacteriology1. In what follows we only touch on the mathematical aspects of the method. The biological implications of our results will be published elsewhere.

  13. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega

    PubMed Central

    Sievers, Fabian; Wilm, Andreas; Dineen, David; Gibson, Toby J; Karplus, Kevin; Li, Weizhong; Lopez, Rodrigo; McWilliam, Hamish; Remmert, Michael; Söding, Johannes; Thompson, Julie D; Higgins, Desmond G

    2011-01-01

    Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam. PMID:21988835

  14. Finding the right coverage: the impact of coverage and sequence quality on single nucleotide polymorphism genotyping error rates.

    PubMed

    Fountain, Emily D; Pauli, Jonathan N; Reid, Brendan N; Palsbøll, Per J; Peery, M Zachariah

    2016-07-01

    Restriction-enzyme-based sequencing methods enable the genotyping of thousands of single nucleotide polymorphism (SNP) loci in nonmodel organisms. However, in contrast to traditional genetic markers, genotyping error rates in SNPs derived from restriction-enzyme-based methods remain largely unknown. Here, we estimated genotyping error rates in SNPs genotyped with double digest RAD sequencing from Mendelian incompatibilities in known mother-offspring dyads of Hoffman's two-toed sloth (Choloepus hoffmanni) across a range of coverage and sequence quality criteria, for both reference-aligned and de novo-assembled data sets. Genotyping error rates were more sensitive to coverage than sequence quality and low coverage yielded high error rates, particularly in de novo-assembled data sets. For example, coverage ≥5 yielded median genotyping error rates of ≥0.03 and ≥0.11 in reference-aligned and de novo-assembled data sets, respectively. Genotyping error rates declined to ≤0.01 in reference-aligned data sets with a coverage ≥30, but remained ≥0.04 in the de novo-assembled data sets. We observed approximately 10- and 13-fold declines in the number of loci sampled in the reference-aligned and de novo-assembled data sets when coverage was increased from ≥5 to ≥30 at quality score ≥30, respectively. Finally, we assessed the effects of genotyping coverage on a common population genetic application, parentage assignments, and showed that the proportion of incorrectly assigned maternities was relatively high at low coverage. Overall, our results suggest that the trade-off between sample size and genotyping error rates be considered prior to building sequencing libraries, reporting genotyping error rates become standard practice, and that effects of genotyping errors on inference be evaluated in restriction-enzyme-based SNP studies. PMID:26946083

  15. Complete nucleotide sequence of a new isolate of passion fruit woodiness virus from Western Australia.

    PubMed

    Fukumoto, Tomohiro; Nakamura, Masayuki; Wylie, Stephen J; Chiaki, Yuya; Iwai, Hisashi

    2013-08-01

    We determined the complete genome sequence of the passion fruit woodiness virus Gld-1 isolate (PWV-Gld-1) from Australia and compared it with that of PWV-MU-2, another Australian isolate of PWV. The genomes shared high sequence identity in both the complete nucleotide sequence and the ORF amino acid sequence. All of the cleavage sites of each protein were identical to those of MU-2, and the sequence identity for the individual proteins ranged from 97.2 % to 100.0 %. However, the 5' untranslated region (5'UTR) of the Gld-1 isolate shared only 46.8 % sequence identity with that of PWV-MU-2 and was 177 nucleotides shorter. Re-sequencing of the 5'UTR of MU-2 revealed that the 5' end of the original sequence includes an artifact generated by deep sequencing. PMID:23508550

  16. Implied alignment: a synapomorphy-based multiple-sequence alignment method and its use in cladogram search

    NASA Technical Reports Server (NTRS)

    Wheeler, Ward C.

    2003-01-01

    A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.

  17. Implied alignment: a synapomorphy-based multiple-sequence alignment method and its use in cladogram search.

    PubMed

    Wheeler, Ward C

    2003-06-01

    A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed. PMID:12901383

  18. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison

    PubMed Central

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  19. Evaluation of intra- and interspecific divergence of satellite DNA sequences by nucleotide frequency calculation and pairwise sequence comparison.

    PubMed

    Kato, Mikio

    2003-01-01

    Satellite DNA sequences are known to be highly variable and to have been subjected to concerted evolution that homogenizes member sequences within species. We have analyzed the mode of evolution of satellite DNA sequences in four fishes from the genus Diplodus by calculating the nucleotide frequency of the sequence array and the phylogenetic distances between member sequences. Calculation of nucleotide frequency and pairwise sequence comparison enabled us to characterize the divergence among member sequences in this satellite DNA family. The results suggest that the evolutionary rate of satellite DNA in D. bellottii is about two-fold greater than the average of the other three fishes, and that the sequence homogenization event occurred in D. puntazzo more recently than in the others. The procedures described here are effective to characterize mode of evolution of satellite DNA. PMID:12734555

  20. Bio-samtools: Ruby bindings for SAMtools, a library for accessing BAM files containing high-throughput sequence alignments

    PubMed Central

    2012-01-01

    Background The SAMtools utilities comprise a very useful and widely used suite of software for manipulating files and alignments in the SAM and BAM format, used in a wide range of genetic analyses. The SAMtools utilities are implemented in C and provide an API for programmatic access, to help make this functionality available to programmers wishing to develop in the high level Ruby language we have developed bio-samtools, a Ruby binding to the SAMtools library. Results The utility of SAMtools is encapsulated in 3 main classes, Bio::DB::Sam, representing the alignment files and providing access to the data in them, Bio::DB::Alignment, representing the individual read alignments inside the files and Bio::DB::Pileup, representing the summarised nucleotides of reads over a single point in the nucleotide sequence to which the reads are aligned. Conclusions Bio-samtools is a flexible and easy to use interface that programmers of many levels of experience can use to access information in the popular and common SAM/BAM format. PMID:22640879

  1. Nucleotide sequence of Neurospora crassa cytoplasmic initiator tRNA.

    PubMed Central

    Gillum, A M; Hecker, L I; Silberklang, M; Schwartzbach, S D; RajBhandary, U L; Barnett, W E

    1977-01-01

    Initiator methionine tRNA from the cytoplasm of Neurospora crassa has been purified and sequenced. The sequence is: pAGCUGCAUm1GGCGCAGCGGAAGCGCM22GCY*GGGCUCAUt6AACCCGGAGm7GU (or D) - CACUCGAUCGm1AAACGAG*UUGCAGCUACCAOH. Similar to initiator tRNAs from the cytoplasm of other eukaryotes, this tRNA also contains the sequence -AUCG- instead of the usual -TphiCG (or A)- found in loop IV of other tRNAs. The sequence of the N. crassa cytoplasmic initiator tRNA is quite different from that of the corresponding mitochondrial initiator tRNA. Comparison of the sequence of N. crassa cytoplasmic initiator tRNA to those of yeast, wheat germ and vertebrate cytoplasmic initiator tRNA indicates that the sequences of the two fungal tRNAs are no more similar to each other than they are to those of other initiator tRNAs. Images PMID:146192

  2. Nucleotide sequence of the gene for human prothrombin

    SciTech Connect

    Degen, S.J.F.; Davie, E.W.

    1987-09-22

    A human genomic DNA library was screened for the gene coding for human prothrombin with a cDNA coding for the human protein. Eighty-one positive lambda phage were identified, and three were chosen for further characterization. These three phage hybridized with 5' and/or 3' probes prepared from the prothrombin cDNA. The complete DNA sequence of 21 kilobases of the human prothrombin gene was determined and included a 4.9-kilobase region that was previously sequenced. The gene for human prothrombin contains 14 exons separated by 13 intervening sequences. The exons range in size from 25 to 315 base pairs, while the introns range from 84 to 9447 base pairs. Ninety percent of the gene is composed of intervening sequence. All the intron splice junctions are consistent with sequences found in other eukaryotic genes, except for the presence of GC rather than GT on the 5' end of intervening sequence L. Thirty copies of Alu repetitive DNA and two copies of partial KpnI repeats were identified in clusters within several of the intervening sequences, and these repeats represent 40% of the DNA sequence of the gene. The size, distribution, and sequence homology of the introns within the gene were the compared to those of the genes for the other vitamin K dependent proteins and several other serine proteases.

  3. Nucleotide Sequencing and SNP Detection of Toll-Like Receptor-4 Gene in Murrah Buffalo (Bubalus bubalis)

    PubMed Central

    Mitra, M.; Taraphder, S.; Sonawane, G. S.; Verma, A.

    2012-01-01

    Toll-like receptor-4 (TLR-4) has an important pattern recognition receptor that recognizes endotoxins associated with gram negative bacterial infections. The present investigation was carried out to study nucleotide sequencing and SNP detection by PCR-RFLP analysis of the TLR-4 gene in Murrah buffalo. Genomic DNA was isolated from 102 lactating Murrah buffalo from NDRI herd. The amplified PCR fragments of TLR-4 comprised of exon 1, exon 2, exon 3.1, and exon 3.2 were examined to RFLP. PCR products were obtained with sizes of 165, 300, 478, and 409 bp. TLR-4 gene of investigated Murrah buffaloes was highly polymorphic with AA, AB, and BB genotypes as revealed by PCR-RFLP analysis using Dra I, Hae III, and Hinf I REs. Nucleotide sequencing of the amplified fragment of TLR-4 gene of Murrah buffalo was done. Twelve SNPs were identified. Six SNPs were nonsynonymous resulting in change in amino acids. Murrah is an indigenous Buffalo breed and the presence of the nonsynonymous SNP is indicative of its unique genomic architecture. Sequence alignment and homology across species using BLAST analysis revealed 97%, 97%, 99%, 98%, and 80% sequence homology with Bos taurus, Bos indicus, Ovis aries, Capra hircus, and Homo sapiens, respectively.

  4. MSA-PAD: DNA multiple sequence alignment framework based on PFAM accessed domain information.

    PubMed

    Balech, Bachir; Vicario, Saverio; Donvito, Giacinto; Monaco, Alfonso; Notarangelo, Pasquale; Pesole, Graziano

    2015-08-01

    Here we present the MSA-PAD application, a DNA multiple sequence alignment framework that uses PFAM protein domain information to align DNA sequences encoding either single or multiple protein domains. MSA-PAD has two alignment options: gene and genome mode. PMID:25819080

  5. Nucleotide sequence of 3' untranslated portion of human alpha globin mRNA.

    PubMed Central

    Wilson, J T; deRiel, J K; Forget, B G; Marotta, C A; Weissman, S M

    1977-01-01

    We have determined the nucleotide sequence of 75 nucleotides of the 3'-untranslated portion of normal human alpha globin mRNA which corresponds to the elongated amino acid sequence of the chain termination mutant Hb Constant Spring. This was accomplished by sequence analysis of cDNA fragments obtained by restriction endonuclease or T4 endonuclease IV cleavage of human globin cDNA synthesized from globin mRNA by use of viral reverse transcriptase. Analysis of cRNA synthesized from cDNA by use of RNA polymerase provided additional confirmatory sequence information. Possible polymorphism has been identified at one site of the sequence. Our sequence overlaps with, and extends the sequence of 43 nucleotides determined by Proudfood and coworkers for the very 3'-terminal portion of human alpha globin mRNA. The complete 3'-untranslated sequence of human alpha globin mRNA (112 nucleotides including termination codon) shows little homology to that of the human or rabbit beta globin mRNAs except for the presence of the hexanucleotide sequence AAUAAA which is found in most eukaryotic mRNAs near the 3'-terminal poly (A). Images PMID:909779

  6. Nucleotide sequences of 5S ribosomal RNA from four oomycete and chytrid water molds.

    PubMed

    Walker, W F; Doolittle, W F

    1982-09-25

    The nucleotide sequences of the 5S rRNAs of the oomycete water molds Saprolegnia ferax and Pythium hydnosporum and of the chytrid water molds Blastocladiella simplex and Phlyctochytrium irregulare were determined by chemical and enzymatic partial degradation of 3' and 5' end-labelled molecules, followed by gel sequence analysis. The two oomycete sequences differed in 24 positions and the two chytrid sequences differed in 27 positions. These pairs differed in a mean of 44 positions. The chytrid sequences clearly most resemble the sequence from the zygomycete Phycomyces, while the oomycete sequences appear to be allied with those from protozoa and slime molds. PMID:6890670

  7. CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases.

    PubMed

    Grillo, G; Attimonelli, M; Liuni, S; Pesole, G

    1996-02-01

    A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections free from redundancy is undoubtedly very useful, both in performing statistical analyses and accelerating extensive database searching on nucleotide sequences. Indeed, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-significant patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus necessary to analyse sequence data that have been purged of redundancy. Given that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative description of redundancy will be used, based on the measure of sequence similarity. A sequence is considered redundant if it shows a degree of similarity and overlapping with a longer sequence in the database greater than a threshold fixed by the user. In this paper we present a new algorithm based on an "approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database and to generate automatically nucleotide sequence collections free from redundancies. PMID:8670613

  8. Nucleotide sequences of 5S rRNAs from four jellyfishes.

    PubMed

    Hori, H; Ohama, T; Kumazaki, T; Osawa, S

    1982-11-25

    The nucleotide sequences of 5S rRNAs from four jellyfishes, Spirocodon saltatrix, Nemopsis dofleini, Aurelia aurita and Chrysaora quinquecirrha have been determined. The sequences are highly similar to each other. A fairly high similarity was also found between these jellyfishes and a sea anemone, Anthopleura japonica. PMID:6130512

  9. PROMALS3D web server for accurate multiple protein sequence and structure alignments.

    PubMed

    Pei, Jimin; Tang, Ming; Grishin, Nick V

    2008-07-01

    Multiple sequence alignments are essential in computational sequence and structural analysis, with applications in homology detection, structure modeling, function prediction and phylogenetic analysis. We report PROMALS3D web server for constructing alignments for multiple protein sequences and/or structures using information from available 3D structures, database homologs and predicted secondary structures. PROMALS3D shows higher alignment accuracy than a number of other advanced methods. Input of PROMALS3D web server can be FASTA format protein sequences, PDB format protein structures and/or user-defined alignment constraints. The output page provides alignments with several formats, including a colored alignment augmented with useful information about sequence grouping, predicted secondary structures and consensus sequences. Intermediate results of sequence and structural database searches are also available. The PROMALS3D web server is available at: http://prodata.swmed.edu/promals3d/. PMID:18503087

  10. Multiple sequence alignment based on combining genetic algorithm with chaotic sequences.

    PubMed

    Gao, C; Wang, B; Zhou, C J; Zhang, Q

    2016-01-01

    In bioinformatics, sequence alignment is one of the most common problems. Multiple sequence alignment is an NP (nondeterministic polynomial time) problem, which requires further study and exploration. The chaos optimization algorithm is a type of chaos theory, and a procedure for combining the genetic algorithm (GA), which uses ergodicity, and inherent randomness of chaotic iteration. It is an efficient method to solve the basic premature phenomenon of the GA. Applying the Logistic map to the GA and using chaotic sequences to carry out the chaotic perturbation can improve the convergence of the basic GA. In addition, the random tournament selection and optimal preservation strategy are used in the GA. Experimental evidence indicates good results for this process. PMID:27420977

  11. The nucleotide sequence of the tnpA gene completes the sequence of the Pseudomonas transposon Tn501.

    PubMed Central

    Brown, N L; Winnie, J N; Fritzinger, D; Pridmore, R D

    1985-01-01

    The nucleotide sequence of the gene (tnpA) which codes for the transposase of transposon Tn501 has been determined. It contains an open reading frame for a polypeptide of Mr = 111,500, which terminates within the inverted repeat sequence of the transposon. The reading frame would be transcribed in the same direction as the mercury-resistance genes and the tnpR gene. The amino acid sequence predicted from this reading frame shows 32% identity with that of the transposase of the related transposon Tn3. The C-terminal regions of these two polypeptides show slightly greater homology than the N-terminal regions when conservative amino acid substitutions are considered. With this sequence determination, the nucleotide sequence of Tn501 is fully defined. The main features of the sequence are briefly presented. PMID:2994007

  12. Cardiac motion estimation by joint alignment of tagged MRI sequences.

    PubMed

    Oubel, E; De Craene, M; Hero, A O; Pourmorteza, A; Huguet, M; Avegliano, G; Bijnens, B H; Frangi, A F

    2012-01-01

    Image registration has been proposed as an automatic method for recovering cardiac displacement fields from tagged Magnetic Resonance Imaging (tMRI) sequences. Initially performed as a set of pairwise registrations, these techniques have evolved to the use of 3D+t deformation models, requiring metrics of joint image alignment (JA). However, only linear combinations of cost functions defined with respect to the first frame have been used. In this paper, we have applied k-Nearest Neighbors Graphs (kNNG) estimators of the α-entropy (H(α)) to measure the joint similarity between frames, and to combine the information provided by different cardiac views in an unified metric. Experiments performed on six subjects showed a significantly higher accuracy (p<0.05) with respect to a standard pairwise alignment (PA) approach in terms of mean positional error and variance with respect to manually placed landmarks. The developed method was used to study strains in patients with myocardial infarction, showing a consistency between strain, infarction location, and coronary occlusion. This paper also presents an interesting clinical application of graph-based metric estimators, showing their value for solving practical problems found in medical imaging. PMID:22000567

  13. Diverse nucleotide compositions and sequence fluctuation in Rubisco protein genes

    NASA Astrophysics Data System (ADS)

    Holden, Todd; Dehipawala, S.; Cheung, E.; Bienaime, R.; Ye, J.; Tremberger, G., Jr.; Schneider, P.; Lieberman, D.; Cheung, T.

    2011-10-01

    The Rubisco protein-enzyme is arguably the most abundance protein on Earth. The biology dogma of transcription and translation necessitates the study of the Rubisco genes and Rubisco-like genes in various species. Stronger correlation of fractal dimension of the atomic number fluctuation along a DNA sequence with Shannon entropy has been observed in the studied Rubisco-like gene sequences, suggesting a more diverse evolutionary pressure and constraints in the Rubisco sequences. The strategy of using metal for structural stabilization appears to be an ancient mechanism, with data from the porphobilinogen deaminase gene in Capsaspora owczarzaki and Monosiga brevicollis. Using the chi-square distance probability, our analysis supports the conjecture that the more ancient Rubisco-like sequence in Microcystis aeruginosa would have experienced very different evolutionary pressure and bio-chemical constraint as compared to Bordetella bronchiseptica, the two microbes occupying either end of the correlation graph. Our exploratory study would indicate that high fractal dimension Rubisco sequence would support high carbon dioxide rate via the Michaelis- Menten coefficient; with implication for the control of the whooping cough pathogen Bordetella bronchiseptica, a microbe containing a high fractal dimension Rubisco-like sequence (2.07). Using the internal comparison of chi-square distance probability for 16S rRNA (~ E-22) versus radiation repair Rec-A gene (~ E-05) in high GC content Deinococcus radiodurans, our analysis supports the conjecture that high GC content microbes containing Rubisco-like sequence are likely to include an extra-terrestrial origin, relative to Deinococcus radiodurans. Similar photosynthesis process that could utilize host star radiation would not compete with radiation resistant process from the biology dogma perspective in environments such as Mars and exoplanets.

  14. Complete nucleotide sequence of Alfalfa mosaic virus isolated from alfalfa (Medicago sativa L.) in Argentina.

    PubMed

    Trucco, Verónica; de Breuil, Soledad; Bejerman, Nicolás; Lenardon, Sergio; Giolitti, Fabián

    2014-06-01

    The complete nucleotide sequence of an Alfalfa mosaic virus (AMV) isolate infecting alfalfa (Medicago sativa L.) in Argentina, AMV-Arg, was determined. The virus genome has the typical organization described for AMV, and comprises 3,643, 2,593, and 2,038 nucleotides for RNA1, 2 and 3, respectively. The whole genome sequence and each encoding region were compared with those of other four isolates that have been completely sequenced from China, Italy, Spain and USA. The nucleotide identity percentages ranged from 95.9 to 99.1 % for the three RNAs and from 93.7 to 99 % for the protein 1 (P1), protein 2 (P2), movement protein and coat protein (CP) encoding regions, whereas the amino acid identity percentages of these proteins ranged from 93.4 to 99.5 %, the lowest value corresponding to P2. CP sequences of AMV-Arg were compared with those of other 25 available isolates, and the phylogenetic analysis based on the CP gene was carried out. The highest percentage of nucleotide sequence identity of the CP gene was 98.3 % with a Chinese isolate and 98.6 % at the amino acid level with four isolates, two from Italy, one from Brazil and the remaining one from China. The phylogenetic analysis showed that AMV-Arg is closely related to subgroup I of AMV isolates. To our knowledge, this is the first report of a complete nucleotide sequence of AMV from South America and the first worldwide report of complete nucleotide sequence of AMV isolated from alfalfa as natural host. PMID:24510307

  15. Antibody-specific model of amino acid substitution for immunological inferences from alignments of antibody sequences.

    PubMed

    Mirsky, Alexander; Kazandjian, Linda; Anisimova, Maria

    2015-03-01

    Antibodies are glycoproteins produced by the immune system as a dynamically adaptive line of defense against invading pathogens. Very elegant and specific mutational mechanisms allow B lymphocytes to produce a large and diversified repertoire of antibodies, which is modified and enhanced throughout all adulthood. One of these mechanisms is somatic hypermutation, which stochastically mutates nucleotides in the antibody genes, forming new sequences with different properties and, eventually, higher affinity and selectivity to the pathogenic target. As somatic hypermutation involves fast mutation of antibody sequences, this process can be described using a Markov substitution model of molecular evolution. Here, using large sets of antibody sequences from mice and humans, we infer an empirical amino acid substitution model AB, which is specific to antibody sequences. Compared with existing general amino acid models, we show that the AB model provides significantly better description for the somatic evolution of mice and human antibody sequences, as demonstrated on large next generation sequencing (NGS) antibody data. General amino acid models are reflective of conservation at the protein level due to functional constraints, with most frequent amino acids exchanges taking place between residues with the same or similar physicochemical properties. In contrast, within the variable part of antibody sequences we observed an elevated frequency of exchanges between amino acids with distinct physicochemical properties. This is indicative of a sui generis mutational mechanism, specific to antibody somatic hypermutation. We illustrate this property of antibody sequences by a comparative analysis of the network modularity implied by the AB model and general amino acid substitution models. We recommend using the new model for computational studies of antibody sequence maturation, including inference of alignments and phylogenetic trees describing antibody somatic hypermutation in

  16. Nucleotide sequence of a human tRNA gene heterocluster

    SciTech Connect

    Chang, Y.N.; Pirtle, I.L.; Pirtle, R.M.

    1986-05-01

    Leucine tRNA from bovine liver was used as a hybridization probe to screen a human gene library harbored in Charon-4A of bacteriophage lambda. The human DNA inserts from plaque-pure clones were characterized by restriction endonuclease mapping and Southern hybridization techniques, using both (3'-/sup 32/P)-labeled bovine liver leucine tRNA and total tRNA as hybridization probes. An 8-kb Hind III fragment of one of these ..gamma..-clones was subcloned into the Hind III site of pBR322. Subsequent fine restriction mapping and DNA sequence analysis of this plasmid DNA indicated the presence of four tRNA genes within the 8-kb DNA fragment. A leucine tRNA gene with an anticodon of AAG and a proline tRNA gene with an anticodon of AGG are in a 1.6-kb subfragment. A threonine tRNA gene with an anticodon of UGU and an as yet unidentified tRNA gene are located in a 1.1-kb subfragment. These two different subfragments are separated by 2.8 kb. The coding regions of the three sequenced genes contain characteristic internal split promoter sequences and do not have intervening sequences. The 3'-flanking region of these three genes have typical RNA polymerase III termination sites of at least four consecutive T residues.

  17. Methods for making nucleotide probes for sequencing and synthesis

    DOEpatents

    Church, George M; Zhang, Kun; Chou, Joseph

    2014-07-08

    Compositions and methods for making a plurality of probes for analyzing a plurality of nucleic acid samples are provided. Compositions and methods for analyzing a plurality of nucleic acid samples to obtain sequence information in each nucleic acid sample are also provided.

  18. Fast single-pass alignment and variant calling using sequencing data

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Sequencing research requires efficient computation. Few programs use already known information about DNA variants when aligning sequence data to the reference map. New program findmap.f90 reads the previous variant list before aligning sequence, calling variant alleles, and summing the allele counts...

  19. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and 3-dimensional structural information

    PubMed Central

    Pei, Jimin; Grishin, Nick V.

    2015-01-01

    SUMMARY Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of 3-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  20. PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information.

    PubMed

    Pei, Jimin; Grishin, Nick V

    2014-01-01

    Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of three-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D Web server and package are available at http://prodata.swmed.edu/PROMALS3D. PMID:24170408

  1. Nucleotide sequence of the tobacco (Nicotiana tabacum) anionic peroxidase gene

    SciTech Connect

    Diaz-De-Leon, F.; Klotz, K.L.; Lagrimini, L.M. )

    1993-03-01

    Peroxidases have been implicated in numerous physiological processes including lignification (Grisebach, 1981), wound-healing (Espelie et al., 1986), phenol oxidation (Lagrimini, 1991), pathogen defense (Ye et al., 1990), and the regulation of cell elongation through the formation of interchain covalent bonds between various cell wall polymers (Fry, 1986; Goldberg et al., 1986; Bradley et al., 1992). However, a complete description of peroxidase action in vivo is not available because of the vast number of potential substrates and the existence of multiple isoenzymes. The tobacco anionic peroxidase is one of the better-characterized isoenzymes. This enzyme has been shown to oxidize a number of significant plant secondary compounds in vitro including cinnamyl alcohols, phenolic acids, and indole-3-acetic acid (Maeder, 1980; Lagrimini, 1991). A cDNA encoding the enzyme has been obtained, and this enzyme was shown to be expressed at the highest levels in lignifying tissues (xylem and tracheary elements) and also in epidermal tissue (Lagrimini et al., 1987). It was shown at this time that there were four distinct copies of the anionic peroxidase gene in tobacco (Nicotiana tabacum). A tobacco genomic DNA library was constructed in the [lambda]-phase EMBL3, from which two unique peroxidase genes were sequenced. One of these clones, [lambda]POD1, was designated as a pseudogene when the exonic sequences were found to differ from the cDNA sequences by 1%, and several frame shifts in the coding sequences indicated a dysfunctional gene (the authors' unpublished results). The other clone, [lambda]POD3, described in this manuscript, was designated as the functional tobacco anionic peroxidase gene because of 100% homology with the cDNA. Significant structural elements include an AS-2 box indicated in shoot-specific expression (Lam and Chua, 1989), a TATA box, and two intervening sequences. 10 refs., 1 tab.

  2. Cloning and nucleotide sequence of the Lactobacillus casei lactate dehydrogenase gene.

    PubMed Central

    Kim, S F; Baek, S J; Pack, M Y

    1991-01-01

    An allosteric L-(+)-lactate dehydrogenase gene of Lactobacillus casei ATCC 393 was cloned in Escherichia coli, and the nucleotide sequence of the gene was determined. The gene was composed of an open reading frame of 981 bp, starting with a GTG codon and ending with a TAA codon. The sequences for the promoter and ribosome binding site were identified, and a sequence for a structure resembling a rho-independent transcription terminator was also found. Images PMID:1768113

  3. Nucleotide Sequence Diversity and Linkage Disequilibrium of Four Nuclear Loci in Foxtail Millet (Setaria italica)

    PubMed Central

    He, Shui-lian; Yang, Yang; Morrell, Peter L.; Yi, Ting-shuang

    2015-01-01

    Foxtail millet (Setaria italica (L.) Beauv) is one of the earliest domesticated grains, which has been cultivated in northern China by 8,700 years before present (YBP) and across Eurasia by 4,000 YBP. Owing to a small genome and diploid nature, foxtail millet is a tractable model crop for studying functional genomics of millets and bioenergy grasses. In this study, we examined nucleotide sequence diversity, geographic structure, and levels of linkage disequilibrium at four nuclear loci (ADH1, G3PDH, IGS1 and TPI1) in representative samples of 311 landrace accessions across its cultivated range. Higher levels of nucleotide sequence and haplotype diversity were observed in samples from China relative to other sampled regions. Genetic assignment analysis classified the accessions into seven clusters based on nucleotide sequence polymorphisms. Intralocus LD decayed rapidly to half the initial value within ~1.2 kb or less. PMID:26325578

  4. Nucleotide sequence heterogeneity of alpha satellite repetitive DNA: a survey of alphoid sequences from different human chromosomes.

    PubMed Central

    Waye, J S; Willard, H F

    1987-01-01

    The human alpha satellite DNA family is composed of diverse, tandemly reiterated monomer units of approximately 171 basepairs localized to the centromeric region of each chromosome. These sequences are organized in a highly chromosome-specific manner with many, if not all human chromosomes being characterized by individually distinct alphoid subsets. Here, we compare the nucleotide sequences of 153 monomer units, representing alphoid components of at least 12 different human chromosomes. Based on the analysis of sequence variation at each position within the 171 basepair monomer, we have derived a consensus sequence for the monomer unit of human alpha satellite DNA which we suggest may reflect the monomer sequence from which different chromosomal subsets have evolved. Sequence heterogeneity is evident at each position within the consensus monomer unit and there are no positions of strict nucleotide sequence conservation, although some regions are more variable than others. A substantial proportion of the overall sequence variation may be accounted for by nucleotide changes which are characteristic of monomer components of individual chromosomal subsets or groups of subsets which have a common evolutionary history. PMID:3658703

  5. A comparative genomics strategy for targeted discovery of single-nucleotide polymorphisms and conserved-noncoding sequences in orphan crops.

    PubMed

    Feltus, F A; Singh, H P; Lohithaswa, H C; Schulze, S R; Silva, T D; Paterson, A H

    2006-04-01

    Completed genome sequences provide templates for the design of genome analysis tools in orphan species lacking sequence information. To demonstrate this principle, we designed 384 PCR primer pairs to conserved exonic regions flanking introns, using Sorghum/Pennisetum expressed sequence tag alignments to the Oryza genome. Conserved-intron scanning primers (CISPs) amplified single-copy loci at 37% to 80% success rates in taxa that sample much of the approximately 50-million years of Poaceae divergence. While the conserved nature of exons fostered cross-taxon amplification, the lesser evolutionary constraints on introns enhanced single-nucleotide polymorphism detection. For example, in eight rice (Oryza sativa) genotypes, polymorphism averaged 12.1 per kb in introns but only 3.6 per kb in exons. Curiously, among 124 CISPs evaluated across Oryza, Sorghum, Pennisetum, Cynodon, Eragrostis, Zea, Triticum, and Hordeum, 23 (18.5%) seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Furthermore, we identified 487 conserved-noncoding sequence motifs in 129 CISP loci. A large CISP set (6,062 primer pairs, amplifying introns from 1,676 genes) designed using an automated pipeline showed generally higher abundance in recombinogenic than in nonrecombinogenic regions of the rice genome, thus providing relatively even distribution along genetic maps. CISPs are an effective means to explore poorly characterized genomes for both DNA polymorphism and noncoding sequence conservation on a genome-wide or candidate gene basis, and also provide anchor points for comparative genomics across a diverse range of species. PMID:16607031

  6. An Integrated System for DNA Sequencing by Synthesis Using Novel Nucleotide Analogues

    PubMed Central

    Guo, Jia; Yu, Lin; Turro, Nicholas J.; Ju, Jingyue

    2010-01-01

    Conspectus The Human Genome Project has concluded, but its successful completion has increased, rather than decreased, the need for high-throughput DNA sequencing technologies. The possibility of clinically screening a full genome for an individual's mutations offers tremendous benefits, both for pursuing personalized medicine as well as uncovering the genomic contributions to diseases. The Sanger sequencing method—although enormously productive for more than 30 years—requires an electrophoretic separation step that, unfortunately, remains a key technical obstacle for achieving economically acceptable full-genome results. Alternative sequencing approaches thus focus on innovations that can reduce costs. The DNA sequencing by synthesis (SBS) approach has shown great promise as a new sequencing platform, with particular progress reported recently. The general fluorescent SBS approach involves (i) incorporation of nucleotide analogs bearing fluorescent reporters, (ii) identification of the incorporated nucleotide by its fluorescent emissions, and (iii) cleavage of the fluorophore, along with the reinitiation of the polymerase reaction for continuing sequence determination. In this Account, we review the construction of a DNA-immobilized chip and the development of novel nucleotide reporters for the SBS sequencing platform. Click chemistry, with its high selectivity and coupling efficiency, was explored for surface immobilization of DNA. The first generation (G-1) modified nucleotides for SBS feature a small chemical moiety capping the 3′-OH and a fluorophore tethered to the base through a chemically cleavable linker; the design ensures that the nucleotide reporters are good substrates for the polymerase. The 3′-capping moiety and the fluorophore on the DNA extension products, generated by the incorporation of the G-1 modified nucleotides, are cleaved simultaneously to reinitiate the polymerase reaction. The sequence of a DNA template immobilized on a surface

  7. B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC.

    PubMed

    Cui, Yingbo; Liao, Xiangke; Zhu, Xiaoqian; Wang, Bingqiang; Peng, Shaoliang

    2016-03-01

    Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision. PMID:26358141

  8. Score distributions of gapped multiple sequence alignments down to the low-probability tail.

    PubMed

    Fieth, Pascal; Hartmann, Alexander K

    2016-08-01

    Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of finite sequence lengths. Here we extend the studies to multiple sequence alignments with gaps, which are much more relevant for practical applications in molecular biology. We study the distributions of scores over a large range of the support, reaching probabilities as small as 10^{-160}, for global and local (sum-of-pair scores) multiple alignments. We find that even after suitable rescaling, eliminating the sequence-length dependence, the distributions for multiple alignment differ from the pairwise alignment case. Furthermore, we also show that the previously discussed Gaussian correction to the Gumbel distribution needs to be refined, also for the case of pairwise alignments. PMID:27627266

  9. Complete nucleotide sequence of the human corticotropin-beta-lipotropin precursor gene.

    PubMed Central

    Takahashi, H; Hakamata, Y; Watanabe, Y; Kikuno, R; Miyata, T; Numa, S

    1983-01-01

    The nucleotide sequence of an 8658-base-pair human genomic DNA segment containing the entire corticotropin-beta-lipotropin precursor gene has been determined, and some sequence features of the gene and its flanking regions have been analysed. The gene is composed of 7665 base pairs including two introns of 3708 and 2886 base pairs. Comparison of the 5'-flanking sequences of the human, bovine and mouse corticotropin-beta-lipotropin precursor genes reveals the presence of a highly conserved region, which contains sequences of 14-15 base pairs homologous with sequences located upstream of the mRNA start site of other glucocorticoid-regulated genes. PMID:6314261

  10. GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality.

    PubMed

    Wu, Thomas D; Reeder, Jens; Lawrence, Michael; Becker, Gabe; Brauer, Matthew J

    2016-01-01

    The programs GMAP and GSNAP, for aligning RNA-Seq and DNA-Seq datasets to genomes, have evolved along with advances in biological methodology to handle longer reads, larger volumes of data, and new types of biological assays. The genomic representation has been improved to include linear genomes that can compare sequences using single-instruction multiple-data (SIMD) instructions, compressed genomic hash tables with fast access using SIMD instructions, handling of large genomes with more than four billion bp, and enhanced suffix arrays (ESAs) with novel data structures for fast access. Improvements to the algorithms have included a greedy match-and-extend algorithm using suffix arrays, segment chaining using genomic hash tables, diagonalization using segmental hash tables, and nucleotide-level dynamic programming procedures that use SIMD instructions and eliminate the need for F-loop calculations. Enhancements to the functionality of the programs include standardization of indel positions, handling of ambiguous splicing, clipping and merging of overlapping paired-end reads, and alignments to circular chromosomes and alternate scaffolds. The programs have been adapted for use in pipelines by integrating their usage into R/Bioconductor packages such as gmapR and HTSeqGenie, and these pipelines have facilitated the discovery of numerous biological phenomena. PMID:27008021

  11. Nucleotide sequence of a small cryptic plasmid from Acidithiobacillus ferrooxidans strain A-6

    SciTech Connect

    F. Roberto

    2003-10-01

    A 2.1 kb cryptic plasmid from Acidithiobacillus ferrooxidans strain A-6 was isolated and cloned into the E. coli vector plasmid, pUC128. The cloned plasmid was mapped by restriction enzyme fragment analysis and subsequently sequenced. At this time over half the plasmid sequence has been determined and compared to sequences in the GenBank nucleotide and protein sequence databases. Much of the plasmid remains cryptic, but substantial nucleotide and protein sequence similarities have been observed to the putative replication protein, RepA, of the small cryptic plasmids pAYS and pAYL found in the ammonia-oxidizing Nitrosomonas sp. Strain ENI-11. These results suggest an entirely new class of plasmid is maintained in at least one strain of Acidithiobacillus ferrooxidans and other acidophilic bacteria, and raises interesting questions about the origin of this plasmid in acidic environments.

  12. The complete nucleotide sequence and genomic characterization of tropical soda apple mosaic virus.

    PubMed

    Fillmer, Kornelia; Adkins, Scott; Pongam, Patchara; D'Elia, Tom

    2016-08-01

    We report the first complete genome sequence of tropical soda apple mosaic virus (TSAMV), a tobamovirus originally isolated from tropical soda apple (Solanum viarum) collected in Okeechobee, Florida. The complete genome of TSAMV is 6,350 nucleotides long and contains four open reading frames encoding the following proteins: i) 126-kDa methyltransferase/helicase (3354 nt), ii) 183-kDa polymerase (4839 nt), iii) movement protein (771 nt) and iv) coat protein (483 nt). The complete genome sequence of TSAMV shares 80.4 % nucleotide sequence identity with pepper mild mottle virus (PMMoV) and 71.2-74.2 % identity with other tobamoviruses naturally infecting members of the Solanaceae plant family. Phylogenetic analysis of the deduced amino acid sequences of the 126-kDa and 183-kDa proteins and the complete genome sequence place TSAMV in a subcluster with PMMoV within the Solanaceae-infecting subgroup of tobamoviruses. PMID:27169599

  13. Complete nucleotide sequence and transcriptional analysis of snakehead fish retrovirus.

    PubMed Central

    Hart, D; Frerichs, G N; Rambaut, A; Onions, D E

    1996-01-01

    The complete genome of the snakehead fish retrovirus has been cloned and sequenced, and its transcriptional profile in cell culture has been determined. The 11.2-kb provirus displays a complex expression pattern capable of encoding accessory proteins and is unique in the predicted location of the env initiation codon and signal peptide upstream of gag and the common splice donor site. The virus is distinguishable from all known retrovirus groups by the presence of an arginine tRNA primer binding site. The coding regions are highly divergent and show a number of unusual characteristics, including a large Gag coiled-coil region, a Pol domain of unknown function, and a long, lentiviral-like, Env cytoplasmic domain. Phylogenetic analysis of the Pol sequence emphasizes the divergent nature of the virus from the avian and mammalian retroviruses. The snakehead virus is also distinct from a previously characterized complex fish retrovirus, suggesting that discrete groups of these viruses have yet to be identified in the lower vertebrates. PMID:8648695

  14. Determining Word Sequence Variation Patterns in Clinical Documents using Multiple Sequence Alignment

    PubMed Central

    Meng, Frank; Morioka, Craig A.; El-Saden, Suzie

    2011-01-01

    Sentences and phrases that represent a certain meaning often exhibit patterns of variation where they differ from a basic structural form by one or two words. We present an algorithm that utilizes multiple sequence alignments (MSAs) to generate a representation of groups of phrases that possess the same semantic meaning but also share in common the same basic word sequence structure. The MSA enables the determination not only of the words that compose the basic word sequence, but also of the locations within the structure that exhibit variation. The algorithm can be utilized to generate patterns of text sequences that can be used as the basis for a pattern-based classifier, as a starting point to bootstrap the pattern building process for a regular expression-based classifiers, or serve to reveal the variation characteristics of sentences and phrases within a particular domain. PMID:22195152

  15. Nucleotide composition of CO1 sequences in Chelicerata (Arthropoda): detecting new mitogenomic rearrangements.

    PubMed

    Arabi, Juliette; Judson, Mark L I; Deharveng, Louis; Lourenço, Wilson R; Cruaud, Corinne; Hassanin, Alexandre

    2012-02-01

    Here we study the evolution of nucleotide composition in third codon-positions of CO1 sequences of Chelicerata, using a phylogenetic framework, based on 180 taxa and three markers (CO1, 18S, and 28S rRNA; 5,218 nt). The analyses of nucleotide composition were also extended to all CO1 sequences of Chelicerata found in GenBank (1,701 taxa). The results show that most species of Chelicerata have a positive strand bias in CO1, i.e., in favor of C nucleotides, including all Amblypygi, Palpigradi, Ricinulei, Solifugae, Uropygi, and Xiphosura. However, several taxa show a negative strand bias, i.e., in favor of G nucleotides: all Scorpiones, Opisthothelae spiders and several taxa within Acari, Opiliones, Pseudoscorpiones, and Pycnogonida. Several reversals of strand-specific bias can be attributed to either a rearrangement of the control region or an inversion of a fragment containing the CO1 gene. Key taxa for which sequencing of complete mitochondrial genomes will be necessary to determine the origin and nature of mtDNA rearrangements involved in the reversals are identified. Acari, Opiliones, Pseudoscorpiones, and Pycnogonida were found to show a strong variability in nucleotide composition. In addition, both mitochondrial and nuclear genomes have been affected by higher substitution rates in Acari and Pseudoscorpiones. The results therefore indicate that these two orders are more liable to fix mutations of all types, including base substitutions, indels, and genomic rearrangements. PMID:22362465

  16. [Nucleotide sequence determination of yeast mitochondrial phenylalanine-tRNA].

    PubMed

    Martin, R; Sibler, A P; Schneller, J M; Keith, G; Stahl, A J; Dirheimer, G

    1978-10-01

    The primary structure of mitochondrial tRNAPhe from Saccharomyces cerevisiae, purified by two-dimensional polyacrylamide gel electrophoresis, was determined using, standard procedures on in vivo 32P-labeled tRNA, as well as the new 5'-end postlabeling techniques. We propose a cloverleaf model which allows for tertiary interaction between cytosine in position 46 and guanine in position 15 and maximizes base pairing in the psi C stem, thus excluding the uracile in position 50 from base pairing in the psi C stem. Comparison of the primary structure of this tRNA with all other known procaryotic, chloroplastic or cytoplasmic tRNAsPhe sequences does not lead to any conclusion about the endosymbiotic theory of mitochondria evolution. PMID:103657

  17. Complete nucleotide sequence of the new potexvirus "Alstroemeria virus X". Brief report.

    PubMed

    Fuji, S; Shinoda, K; Ikeda, M; Furuya, H; Naito, H; Fukumoto, F

    2005-11-01

    A flexuous virus was isolated in Japan from an alstroemeria plant showing mosaic symptoms. The virus had a broad host range but had systemically latent infectivity in alstroemeria. The virus was assigned to the genus Potexvirus based on morphology and physical properties and on an analysis of the complete nucleotide sequence. The genomic RNA of the virus was 7,009 nucleotides in length, excluding the 3'-terminal poly (A) tail. It contained five open reading frames (ORFs), which was consistent with other members of the genus Potexvirus. Although nucleotide sequences of the ORFs differ from previously reported potexviruses, a phylogenetic analysis placed it phylogenetically close to Narcissus mosaic virus and Scallion virus X. Therefore, we propose that this virus should be designated as Alstroemeria virus X (AlsVX). PMID:15986173

  18. Quantitative analysis of the relationship between nucleotide sequence and functional activity.

    PubMed Central

    Stormo, G D; Schneider, T D; Gold, L

    1986-01-01

    Matrices can be used to evaluate sequences for functional activity. Multiple regression can solve for the matrix that gives the best fit between sequence evaluations and quantitative activities. This analysis shows that the best model for context effects on suppression by su2 involves primarily the two nucleotides 3' to the amber codon, and that their contributions are independent and additive. Context effects on 2AP mutagenesis also involve the two nucleotides 3' to the 2AP insertion, but their effects are not independent. In a construct for producing beta-galactosidase, the effects on translational yields of the tri-nucleotide 5' to the initiation codon are dependent on the entire triplet. Models based on these quantitative results are presented for each of the examples. PMID:3092188

  19. Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications

    PubMed Central

    Cvicek, Vaclav; Goddard, William A.; Abrol, Ravinder

    2016-01-01

    The understanding of G-protein coupled receptors (GPCRs) is undergoing a revolution due to increased information about their signaling and the experimental determination of structures for more than 25 receptors. The availability of at least one receptor structure for each of the GPCR classes, well separated in sequence space, enables an integrated superfamily-wide analysis to identify signatures involving the role of conserved residues, conserved contacts, and downstream signaling in the context of receptor structures. In this study, we align the transmembrane (TM) domains of all experimental GPCR structures to maximize the conserved inter-helical contacts. The resulting superfamily-wide GpcR Sequence-Structure (GRoSS) alignment of the TM domains for all human GPCR sequences is sufficient to generate a phylogenetic tree that correctly distinguishes all different GPCR classes, suggesting that the class-level differences in the GPCR superfamily are encoded at least partly in the TM domains. The inter-helical contacts conserved across all GPCR classes describe the evolutionarily conserved GPCR structural fold. The corresponding structural alignment of the inactive and active conformations, available for a few GPCRs, identifies activation hot-spot residues in the TM domains that get rewired upon activation. Many GPCR mutations, known to alter receptor signaling and cause disease, are located at these conserved contact and activation hot-spot residue positions. The GRoSS alignment places the chemosensory receptor subfamilies for bitter taste (TAS2R) and pheromones (Vomeronasal, VN1R) in the rhodopsin family, known to contain the chemosensory olfactory receptor subfamily. The GRoSS alignment also enables the quantification of the structural variability in the TM regions of experimental structures, useful for homology modeling and structure prediction of receptors. Furthermore, this alignment identifies structurally and functionally important residues in all human GPCRs

  20. Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications.

    PubMed

    Cvicek, Vaclav; Goddard, William A; Abrol, Ravinder

    2016-03-01

    The understanding of G-protein coupled receptors (GPCRs) is undergoing a revolution due to increased information about their signaling and the experimental determination of structures for more than 25 receptors. The availability of at least one receptor structure for each of the GPCR classes, well separated in sequence space, enables an integrated superfamily-wide analysis to identify signatures involving the role of conserved residues, conserved contacts, and downstream signaling in the context of receptor structures. In this study, we align the transmembrane (TM) domains of all experimental GPCR structures to maximize the conserved inter-helical contacts. The resulting superfamily-wide GpcR Sequence-Structure (GRoSS) alignment of the TM domains for all human GPCR sequences is sufficient to generate a phylogenetic tree that correctly distinguishes all different GPCR classes, suggesting that the class-level differences in the GPCR superfamily are encoded at least partly in the TM domains. The inter-helical contacts conserved across all GPCR classes describe the evolutionarily conserved GPCR structural fold. The corresponding structural alignment of the inactive and active conformations, available for a few GPCRs, identifies activation hot-spot residues in the TM domains that get rewired upon activation. Many GPCR mutations, known to alter receptor signaling and cause disease, are located at these conserved contact and activation hot-spot residue positions. The GRoSS alignment places the chemosensory receptor subfamilies for bitter taste (TAS2R) and pheromones (Vomeronasal, VN1R) in the rhodopsin family, known to contain the chemosensory olfactory receptor subfamily. The GRoSS alignment also enables the quantification of the structural variability in the TM regions of experimental structures, useful for homology modeling and structure prediction of receptors. Furthermore, this alignment identifies structurally and functionally important residues in all human GPCRs

  1. Single nucleotide polymorphism mining and nucleotide sequence analysis of Mx1 gene in exonic regions of Japanese quail

    PubMed Central

    Niraj, Diwesh Kumar; Kumar, Pushpendra; Mishra, Chinmoy; Narayan, Raj; Bhattacharya, Tarun Kumar; Shrivastava, Kush; Bhushan, Bharat; Tiwari, Ashok Kumar; Saxena, Vishesh; Sahoo, Nihar Ranjan; Sharma, Deepak

    2015-01-01

    Aim: An attempt has been made to study the Myxovirus resistant (Mx1) gene polymorphism in Japanese quail. Materials and Methods: In the present, investigation four fragments viz. Fragment I of 185 bp (Exon 3 region), Fragment II of 148 bp (Exon 5 region), Fragment III of 161 bp (Exon 7 region), and Fragment IV of 176 bp (Exon 13 region) of Mx1 gene were amplified and screened for polymorphism by polymerase chain reaction-single-strand conformation polymorphism technique in 170 Japanese quail birds. Results: Out of the four fragments, one fragment (Fragment II) was found to be polymorphic. Remaining three fragments (Fragment I, III, and IV) were found to be monomorphic which was confirmed by custom sequencing. Overall nucleotide sequence analysis of Mx1 gene of Japanese quail showed 100% homology with common quail and more than 80% homology with reported sequence of chicken breeds. Conclusion: The Mx1 gene is mostly conserved in Japanese quail. There is an urgent need of comprehensive analysis of other regions of Mx1 gene along with its possible association with the traits of economic importance in Japanese quail. PMID:27047057

  2. The organization of repeated nucleotide sequences in the replicons of mammalian DNA.

    PubMed Central

    Mattern, M R; Painter, R B

    1977-01-01

    Chinese hamster ovary cells were irradiated with 100-5,000 rads of X-rays and inhibition of the initiation of replicons after irradiation was demonstrated by analyzing nascent DNA sedimented in alkaline sucrose gradients. The renaturation kinetics of DNA synthesized during 60 min of incubation after irradiation was compared with that of DNA synthesized during the 60 min after sham irradiation and with that of parental DNA. Nascent DNA from cells whose replicon initiation was inhibited renatured faster than nascent DNA from control cells in the COt range of repeated nucleotide sequences, suggesting that regions of the replicon not close to origins are enriched in repeated sequences and that regions close to origins are enriched in unique sequences. A class of repeated nucleotide sequences may be involved in the regulation of replicon initiation. PMID:880330

  3. On the feasibility of using the intrinsic fluorescence of nucleotides for DNA sequencing.

    SciTech Connect

    Chowdhury, M. H.; Ray, K.; Johnson, R. L.; Gray, S. K.; Pond, J.; Lakowicz, J. R.; Univ. of Maryland; Univ. of Virginia; Lumerical Solutions, Inc.

    2010-04-29

    There is presently a worldwide effort to increase the speed and decrease the cost of DNA sequencing as exemplified by the goal of the National Human Genome Research Institute (NHGRI) to sequence a human genome for under $1000. Several high throughput technologies are under development. Among these, single strand sequencing using exonuclease appear very promising. However, this approach requires complete labeling of at least two bases at a time, with extrinsic high quantum yield probes. This is necessary because nucleotides absorb in the deep ultraviolet (UV) and emit with extremely low quantum yields. Hence intrinsic emission from DNA and nucleotides is not being exploited for DNA sequencing. In the present paper we consider the possibility of identifying single nucleotides using their intrinsic emission. We used the finite-difference time-domain (FDTD) method to calculate the effects of aluminum nanoparticles on nearby fluorophores that emit in the UV. We find that the radiated power of UV fluorophores is significantly increased when they are in close proximity to aluminum nanostructures. We show that there will be increased localized excitation near aluminum particles at wavelengths used to excite intrinsic nucleotide emission. Using FDTD simulation we show that a typical DNA base when coupled to appropriate aluminum nanostructures leads to highly directional emission. Additionally we present experimental results showing that a thin film of nucleotides show enhanced emission when in close proximity to aluminum nanostructures. Finally we provide Monte Carlo simulations that predict high levels of base calling accuracy for an assumed number of photons that is derived from the emission spectra of the intrinsic fluorescence of the bases. Our results suggest that single nucleotides can be detected and identified using aluminum nanostructures that enhance their intrinsic emission. This capability would be valuable for the ongoing efforts toward the $1000 genome.

  4. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library

    Technology Transfer Automated Retrieval System (TEKTRAN)

    BACKGROUND: To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (...

  5. Complete Nucleotide Sequence of a Citrobacter freundii Plasmid Carrying KPC-2 in a Unique Genetic Environment

    PubMed Central

    Yao, Yancheng; Imirzalioglu, Can; Hain, Torsten; Kaase, Martin; Gatermann, Soeren; Exner, Martin; Mielke, Martin; Hauri, Anja; Dragneva, Yolanta; Bill, Rita; Wendt, Constanze; Wirtz, Angela; Chakraborty, Trinad

    2014-01-01

    The complete and annotated nucleotide sequence of a 54,036-bp plasmid harboring a blaKPC-2 gene that is clonally present in Citrobacter isolates from different species is presented. The plasmid belongs to incompatibility group N (IncN) and harbors the class A carbapenemase KPC-2 in a unique genetic environment. PMID:25395635

  6. The complete nucleotide sequence of Pepper mottle virus-Florida RNA.

    PubMed

    Warren, C E; Murphy, J F

    2003-01-01

    The Pepper mottle virus-Florida (PepMoV-FL) RNA genome was cloned and sequenced, and shown to consist of 9,717 nucleotides (nt) excluding the poly (A) tail. A single open reading frame was identified beginning at nucleotide position 169 encoding a polyprotein of 3068 amino acids. Phylogenetic sequence analysis revealed that of 44 full-length viral RNA genomes analyzed within the family Potyviridae, PepMoV-FL was most closely related to PepMoV-California (PepMoV-CA), Potato virus Y-H (PVY-H), PVY-N, PVY(o) and Potato virus V-DV42 (PVV-DV42). Using the PepMoV-FL sequence as a basis for comparison, the overall nucleotide sequence identity was highest between PepMoV-FL and PepMoV-CA at 93%, while the relationship was more distant with PVV-DV42 at 64% and for the PVY strains at 61%. A unique direct repeat sequence of 76 nucleotides was identified in the PepMoV-FL 3'-untranslated region (UTR), and this repeat sequence was confirmed not to occur in the PepMoV-CA sequence. Since the Florida isolate was among the first of the PepMoV isolates described, extensive biological and serological data on this isolate are available, and it has now been cloned and sequenced, we recommend that PepMoV-FL be recognized as the PepMoV type strain. PMID:12536304

  7. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB

    PubMed Central

    Pruesse, Elmar; Quast, Christian; Knittel, Katrin; Fuchs, Bernhard M.; Ludwig, Wolfgang; Peplies, Jörg; Glöckner, Frank Oliver

    2007-01-01

    Sequencing ribosomal RNA (rRNA) genes is currently the method of choice for phylogenetic reconstruction, nucleic acid based detection and quantification of microbial diversity. The ARB software suite with its corresponding rRNA datasets has been accepted by researchers worldwide as a standard tool for large scale rRNA analysis. However, the rapid increase of publicly available rRNA sequence data has recently hampered the maintenance of comprehensive and curated rRNA knowledge databases. A new system, SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains. All sequences are checked for anomalies, carry a rich set of sequence associated contextual information, have multiple taxonomic classifications, and the latest validly described nomenclature. Furthermore, two precompiled sequence datasets compatible with ARB are offered for download on the SILVA website: (i) the reference (Ref) datasets, comprising only high quality, nearly full length sequences suitable for in-depth phylogenetic analysis and probe design and (ii) the comprehensive Parc datasets with all publicly available rRNA sequences longer than 300 nucleotides suitable for biodiversity analyses. The latest publicly available database release 91 (August 2007) hosts 547 521 sequences split into 461 823 small subunit and 85 689 large subunit rRNAs. PMID:17947321

  8. The nucleotide sequence and genome organization of strawberry mild yellow edge-associated potexvirus.

    PubMed

    Jelkmann, W; Maiss, E; Martin, R R

    1992-02-01

    The nucleotide sequence (5966 nucleotides) of cDNA clones of strawberry mild yellow edge-associated potexvirus was determined. The genome contains six open reading frames (ORFs) encoding putative proteins with Mrs of 149,423, 25,344, 11,576, 8079, 25,714 and 11,216. In the first three putative proteins and the coat protein considerable similarity was found to comparable polypeptides of the potexviruses potato virus X, clover yellow mosaic virus, narcissus mosaic virus, papaya mosaic virus, white clover mosaic virus and lily virus X. PMID:1339469

  9. Nucleotide sequence and genome organization of Dweet mottle virus and its relationship to members of the family Betaflexiviridae

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The nucleotide sequence of Dweet mottle virus (DMV) was determined and compared to sequences of members of the family Alpha- and Beta-flexiviridae. The DMV genome has 8747 nucleotides (nt) excluding the poly-(A) tail at the 3’ end of the genome. The overall G+C content of DMV genomic RNA is 40%. D...

  10. Nucleotide sequence of a cloned woodchuck hepatitis virus genome: evolutional relationship between hepadnaviruses.

    PubMed Central

    Kodama, K; Ogasawara, N; Yoshikawa, H; Murakami, S

    1985-01-01

    We have determined the complete nucleotide sequence of a cloned DNA of woodchuck hepatitis virus (WHV), the most oncogenic virus among hepadnaviruses. The genome, designated WHV2, is 3,320 base pairs long and contains four major open reading frames (ORFs) coded on the same strand of nucleotide sequence as in the human hepatitis B virus (HBV) genome. Comparison of the nucleotide sequence and amino acid sequences deduced from it among the genomes of various hepadnaviruses demonstrates that each protein shows an intrinsic property in conserving its amino acid sequence. A parameter, the ratio of the number of triplets with one-letter change but no amino acid substitution to the total number of triplets in which one-letter change occurred, was introduced to measure the intrinsic properties quantitatively. For each ORF, the parameter gave characteristic values in all combinations. Therefore, the relative evolutional distance between these hepadnaviruses can be measured by the amino acid substitution rate of any ORF. These comparisons suggest that (i) the difference between two WHV clones, WHV1 and WHV2, corresponds to that among clones of a HBV subtype, HBVadr, and (ii) WHV and ground squirrel hepatitis virus can be categorized in a way similar to the subgroups of HBV. PMID:3855246

  11. Nucleotide sequence and genome organization of atractylodes mottle virus, a new member of the genus Carlavirus.

    PubMed

    Zhao, Fumei; Igori, Davaajargal; Lim, Seungmo; Yoo, Ran Hee; Lee, Su-Heon; Moon, Jae Sun

    2015-11-01

    The complete genome sequence of a member of a distinct species of the genus Carlavirus in the family Betaflexiviridae, tentatively named atractylodes mottle virus (AtrMoV), has been determined. Analysis of its genomic organization indicates that it has a single-stranded, positive-sense genomic RNA of 8866 nucleotides, excluding the poly(A) tail, and consists of six open reading frames typical of members of the genus Carlavirus. The individual open reading frames of AtrMoV show moderately low sequence similarity to those of other carlaviruses at the nucleotide and amino acid sequence levels. Pairwise comparison and phylogenetic analysis suggest that AtrMoV is most closely related to chrysanthemum virus B. PMID:26264403

  12. A likelihood method for the detection of selection and recombination using nucleotide sequences.

    PubMed

    Grassly, N C; Holmes, E C

    1997-03-01

    Different regions along nucleotide sequences are often subject to different evolutionary forces. Recombination will result in regions having different evolutionary histories, while selection can cause regions to evolve at different rates. This paper presents a statistical method based on likelihood for detecting such processes by identifying the regions which do not fit with a single phylogenetic topology and nucleotide substitution process along the entire sequence. Subsequent reanalysis of these anomalous regions may then be possible. The method is tested using simulations, and its application is demonstrated using the primate psi eta-globin pseudogene, the V3 region of the envelope gene of HIV-1, and argF sequences from Neisseria bacteria. Reanalysis of anomalous regions is shown to reveal possible immune selection in HIV-1 and recombination in Neisseria. A computer program which implements the method is available. PMID:9066792

  13. Slider—maximum use of probability information for alignment of short sequence reads and SNP detection

    PubMed Central

    Malhis, Nawar; Butterfield, Yaron S. N.; Ester, Martin; Jones, Steven J. M.

    2009-01-01

    Motivation: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this article, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base's probabilities given in the prb files. Results: Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNP predictions are more accurate than other SNP prediction approaches used today that start from the most probable sequence, including those using base quality. Contact: nmalhis@bcgsc.ca Supplementary information and availability: http://www.bcgsc.ca/platform/bioinfo/software/slider PMID:18974170

  14. Complete nucleotide sequence of the structural gene for alkaline proteinase from Pseudomonas aeruginosa IFO 3455.

    PubMed Central

    Okuda, K; Morihara, K; Atsumi, Y; Takeuchi, H; Kawamoto, S; Kawasaki, H; Suzuki, K; Fukushima, J

    1990-01-01

    The DNA-encoding alkaline proteinase (AP) of Pseudomonas aeruginosa IFO 3455 was cloned, and its complete nucleotide sequence was determined. When the cloned gene was ligated to pUC18, the Escherichia coli expression vector, the gene-incorporated bacteria expressed high levels of both AP activity and AP antigens. The amino acid sequence deduced from the nucleotide sequence revealed that the mature AP consists of 467 amino acids with a relative molecular weight of 49,507. The amino acid composition predicted from the DNA sequence was similar to the chemically determined composition of purified AP reported previously. The amino acid sequence analysis revealed that both the N-terminal side sequence of the purified AP and several internal lysyl peptide fragments were identical to the deduced amino acid sequences. The percent homology of amino acid sequences between AP and Serratia protease was about 55%. The zinc ligands and an active site of the AP were predicted by comparing the structure of the enzyme with of Serratia protease, thermolysin, Bacillus subtilis neutral protease, and Pseudomonas elastase. PMID:2123832

  15. QGRS Mapper: a web-based server for predicting G-quadruplexes in nucleotide sequences

    PubMed Central

    Kikin, Oleg; D'Antonio, Lawrence; Bagga, Paramjeet S

    2006-01-01

    The quadruplex structures formed by guanine-rich nucleic acid sequences have received significant attention recently because of growing evidence for their role in important biological processes and as therapeutic targets. G-quadruplex DNA has been suggested to regulate DNA replication and may control cellular proliferation. Sequences capable of forming G-quadruplexes in the RNA have been shown to play significant roles in regulation of polyadenylation and splicing events in mammalian transcripts. Whether quadruplex structure directly plays a role in regulating RNA processing requires investigation. Computational approaches to study G-quadruplexes allow detailed analysis of mammalian genomes. There are no known easily accessible user-friendly tools that can compute G-quadruplexes in the nucleotide sequences. We have developed a web-based server, QGRS Mapper, that predicts quadruplex forming G-rich sequences (QGRS) in nucleotide sequences. It is a user-friendly application that provides many options for defining and studying G-quadruplexes. It performs analysis of the user provided genomic sequences, e.g. promoter and telomeric regions, as well as RNA sequences. It is also useful for predicting G-quadruplex structures in oligonucleotides. The program provides options to search and retrieve desired gene/nucleotide sequence entries from NCBI databases for mapping G-quadruplexes in the context of RNA processing sites. This feature is very useful for investigating the functional relevance of G-quadruplex structure, in particular its role in regulating the gene expression by alternative processing. In addition to providing data on composition and locations of QGRS relative to the processing sites in the pre-mRNA sequence, QGRS Mapper features interactive graphic representation of the data. The user can also use the graphics module to visualize QGRS distribution patterns among all the alternative RNA products of a gene simultaneously on a single screen. QGRS Mapper can be

  16. Progressive structure-based alignment of homologous proteins: Adopting sequence comparison strategies.

    PubMed

    Joseph, Agnel Praveen; Srinivasan, Narayanaswamy; de Brevern, Alexandre G

    2012-09-01

    Comparison of multiple protein structures has a broad range of applications in the analysis of protein structure, function and evolution. Multiple structure alignment tools (MSTAs) are necessary to obtain a simultaneous comparison of a family of related folds. In this study, we have developed a method for multiple structure comparison largely based on sequence alignment techniques. A widely used Structural Alphabet named Protein Blocks (PBs) was used to transform the information on 3D protein backbone conformation as a 1D sequence string. A progressive alignment strategy similar to CLUSTALW was adopted for multiple PB sequence alignment (mulPBA). Highly similar stretches identified by the pairwise alignments are given higher weights during the alignment. The residue equivalences from PB based alignments are used to obtain a three dimensional fit of the structures followed by an iterative refinement of the structural superposition. Systematic comparisons using benchmark datasets of MSTAs underlines that the alignment quality is better than MULTIPROT, MUSTANG and the alignments in HOMSTRAD, in more than 85% of the cases. Comparison with other rigid-body and flexible MSTAs also indicate that mulPBA alignments are superior to most of the rigid-body MSTAs and highly comparable to the flexible alignment methods. PMID:22676903

  17. Complete nucleotide sequence of the Streptomyces lividans plasmid pIJ101 and correlation of the sequence with genetic properties.

    PubMed Central

    Kendall, K J; Cohen, S N

    1988-01-01

    The complete nucleotide sequence of the multicopy Streptomyces plasmid pIJ101 has been determined and correlated with previously published genetic data. The circular DNA molecule is 8,830 nucleotides in length and has a G+C composition of 72.98%. The use of a computer program, FRAME, enabled identification in the sequence of seven open reading frames, four of which, tra (621 amino acids [aa]), spdA (146 aa), spdB (274 aa), and kilB (177 aa), appear to be genes involved in plasmid transfer. At least two of the above genes are predicted to be transcribed by known promoters that are regulated in trans by the products of the korA (241 aa) and korB (80 aa) loci on the plasmid. The segment of the plasmid capable of autonomous replication contains one large open reading frame (rep; 450 aa) and a noncoding region presumed to be the origin of replication. Four other small (less than 90 aa) open reading frames are also present on the plasmid, although no function can be attributed to them. The sequence of the pIJ101 replication segment present in several widely used cloning vectors (e.g., pIJ350 and pIJ702) has also been determined, so that the complete nucleotide sequences of these vectors are now known. PMID:3170481

  18. A comparative study of 2',3'-cyclic-nucleotide 3'-phosphodiesterase in vertebrates: cDNA cloning and amino acid sequences for chicken and bullfrog enzymes.

    PubMed

    Kasama-Yoshida, H; Tohyama, Y; Kurihara, T; Sakuma, M; Kojima, H; Tamai, Y

    1997-10-01

    In mammalian brain, two 2',3'-cyclic-nucleotide 3'-phosphodiesterase (EC 3.1.4.37) isoforms, CNP1 and CNP2, are translated, respectively, from the two mRNAs, which have been transcribed and processed by alternative use of the two transcription start points and by differential splicing. In the present study, the cDNAs encoding chicken CNP2 and bullfrog CNP1, respectively, were isolated, and the amino acid sequences of chicken CNP2 and bullfrog CNP1 were deduced. Western blot analysis showed that chicken brain contains a major CNP2-type protein together with a minor unidentified isoform, and bullfrog brain contains only a CNP1-type protein. All available amino acid sequences of vertebrate 2',3'-cyclic-nucleotide 3'-phosphodiesterases were aligned and compared. Three conserved motif sequences were noted: (a) an ATP-binding site near the amino terminus, (b) an isoprenylation site at the carboxyl terminus, and (c) a probable catalytic site resembling the active site of beta-ketoacyl synthase (EC 2.3.1.41). The second and the third motifs are conserved also in goldfish RICH (regeneration-induced 2',3'-cyclic-nucleotide 3'-phosphodiesterase homologue), which has been shown recently to have 2',3'-cyclic-nucleotide 3'-phosphodiesterase activity. The third motif (probably catalytic site) was assigned for the first time in the present report. PMID:9326261

  19. Sequence selective naked-eye detection of DNA harnessing extension of oligonucleotide-modified nucleotides.

    PubMed

    Verga, Daniela; Welter, Moritz; Marx, Andreas

    2016-02-01

    DNA polymerases can efficiently and sequence selectively incorporate oligonucleotide (ODN)-modified nucleotides and the incorporated oligonucleotide strand can be employed as primer in rolling circle amplification (RCA). The effective amplification of the DNA primer by Φ29 DNA polymerase allows the sequence-selective hybridisation of the amplified strand with a G-quadruplex DNA sequence that has horse radish peroxidase-like activity. Based on these findings we develop a system that allows DNA detection with single-base resolution by naked eye. PMID:26774580

  20. The nucleotide sequence at the termini of adenovirus type 5 DNA.

    PubMed Central

    Steenbergh, P H; Maat, J; van Ormondt, H; Sussenbach, J S

    1977-01-01

    The sequences of the first 194 base pairs at both termini of adenovirus type 5 (Ad5) DNA have been determined, using the chemical degradation technique developed by Maxam and Gilbert (Proc. Nat. Acad. Sci. USA 74 (1977), pp. 560-564). The nucleotide sequences 1-75 were confirmed by analysis of labeled RNA transcribed from the terminal HhaI fragments in vitro. The sequence data show that Ad5 DNA has a perfect inverted terminal repetition of 103 base pairs long. Images PMID:600799

  1. Complete nucleotide sequence of a subviral DNA molecule of porcine circovirus type 2.

    PubMed

    Wen, Han

    2016-07-01

    Porcine circovirus type 2 (PCV2) is a member of the genus Circovirus in the family Circoviridae. Most subgenomic molecules of PCV2 have been mapped. Here, the first full-length sequence of a subviral molecule of PCV2 (CH-IVT12) containing a reverse complement sequence of the PCV2 genome was determined by sequencing DNA extracted from PK15 cells infected with PCV2. The circular CH-IVT12 DNA consists of 1136 nucleotides and contains one major open reading frame. PMID:27084550

  2. Nucleotide sequence of 5S ribosomal RNA from Aspergillus nidulans and Neurospora crassa.

    PubMed Central

    Piechulla, B; Hahn, U; McLaughlin, L W; Küntzel, H

    1981-01-01

    The nucleotide sequences of 5S rRNA molecules isolated from the cytosol and the mitochondria of the ascomycetes A. nidulans and N. crassa were determined by partial chemical cleavage of 3'-terminally labelled RNA. The sequence identity of the cytosolic and mitochondrial RNA preparations confirms the absence of mitochondrion-specific 5S rRNA in these fungi. The sequences of the two organisms differ in 35 positions, and each sequence differs from yeast 5S rRNA in 44 positions. Both molecules contain the sequence GCUC in place of GAAC or GAUY found in all other 5S rRNAs, indicating that this region is not universally involved in base-pairing to the invariant GTpsiC sequence of tRNAs. Images PMID:6453331

  3. Nucleotide sequence specifying the glycoprotein gene, gB, of herpes simplex virus type 1.

    PubMed

    Bzik, D J; Fox, B A; DeLuca, N A; Person, S

    1984-03-01

    The nucleotide sequence thought to specify the glycoprotein gene, gB, of the KOS strain of herpes simplex virus type 1 (HSV-1) has been determined. A 3.1-kilobase (kb), viral-specified RNA was mapped to the left half of the BamHI-G fragment (0.345 to 0.399 map units). TATA, CAT-box, and possible mRNA start sequences characteristic of HSV-1 genes are found near 0.368 map units. The first available ATG codon is at 0.366 and the first in-phase chain terminator at 0.348 map units. A polyA-addition signal (AATAAA) occurs 17 nucleotides past the chain terminator. Translation of these sequences would yield a 100.3-kilodalton (kDa) polypeptide characterized by a 5' signal sequence, nine N-linked saccharide addition sites, a strongly hydrophobic membrane-spanning sequence, and a highly charged 3' cytoplasmic anchor sequence. Two mutants of KOS, tsJ12 and tsJ20, that are temperature-sensitive for viral growth and for the production of gB, have been physically mapped to 0.357 to 0.360 and 0.360 to 0.364 map units, respectively (DeLuca et al., in preparation). The nucleotide sequence of the mutants was determined in these regions. In both cases a single amino acid replacement within the 100.3-kDa polypeptide is predicted from the sequence analysis. PMID:6324454

  4. Linking the human cytogenetic map with nucleotide sequence: the CCAP clone set.

    PubMed

    Jang, Wonhee; Yonescu, Raluca; Knutsen, Turid; Brown, Theresa; Reppert, Tricia; Sirotkin, Karl; Schuler, Gregory D; Ried, Thomas; Kirsch, Ilan R

    2006-07-15

    We present the completed dataset and clone repository of the Cancer Chromosome Aberration Project (CCAP), an initiative developed and funded through the intramural program of the U.S. National Cancer Institute, to provide seamless linkage of human cytogenetic markers with the primary nucleotide sequence of the human genome. Spaced at 1-2 Mb intervals across the human genome, 1,339 bacterial artificial chromosome (BAC) clones have been localized to chromosomal bands through high-resolution fluorescence in situ hybridization (FISH) mapping. Of these clones, 99.8% can be positioned on the primary human genome sequence and 95% are placed at or close to their precise nucleotide starts and stops. This dataset can be studied and manipulated within generally available public Web sites. The clones are available from a commercial repository. The CCAP BAC clone set provides anchors for the interrogation of gene and sequence involvement in oncogenic and developmental disorders when the starting point is the recognition of a structural, numerical, or interstitial chromosomal aberration. This dataset also provides a current view of the quality and coherence of the available genome sequence and insight into the nucleotide and three-dimensional structures that manifest as Giemsa light and dark chromosomal banding patterns. PMID:16843097

  5. Complete nucleotide sequence of cherry virus A (CVA) infecting sweet cherry in India.

    PubMed

    Noorani, M S; Awasthi, P; Singh, Rahul Mohan; Ram, Raja; Sharma, M P; Singh, S R; Ahmed, N; Hallan, V; Zaidi, A A

    2010-12-01

    Cherry virus A (CVA) is a graft-transmissible member of the genus Capillovirus that infects different stone fruits. Sweet cherry (Prunus avium L; family Rosaceae) is an important deciduous temperate fruit crop in the Western Himalayan region of India. In order to determine the health status of cherry plantations and the incidence of the virus in India, cherry orchards in the states of Jammu and Kashmir (J&K) and Himachal Pradesh (H.P.) were surveyed during the months of May and September 2009. The incidence of CVA was found to be 28 and 13% from J&K and H.P., respectively, by RT-PCR. In order to characterize the virus at the molecular level, the complete genome was amplified by RT-PCR using specific primers. The amplicon of about 7.4 kb was sequenced and was found to be 7,379 bp long, with sequence specificity to CVA. The genome organization was similar to that of isolates characterized earlier, coding for two ORFs, in which ORF 2 is nested in ORF1. The complete sequence was 81 and 84% similar to that of the type isolate at the nucleotide and amino acid level, respectively, with 5' and 3' UTRs of 54 and 299 nucleotides, respectively. This is the first report of the complete nucleotide sequence of cherry virus A infecting sweet cherry in India. PMID:20938696

  6. Nucleotide sequence discrepancies within the GA strain of Marek's disease virus

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Comparative genomics between 9 gallid herpesvirus type 2 strains have singled out the virulent (v) prototype strain GA as phylogenetically distant from other v pathotypes. Multiple amino acid alignments of otherwise highly conserved unique long (UL) genes have indicated sequence discrepancies within...

  7. Complete nucleotide sequence of the hypervirulent CFH strain of beet curly top virus.

    PubMed

    Stenger, D C

    1994-01-01

    The complete nucleotide sequence of the hypervirulent CFH strain of beet curly top geminivirus (BCTV) has been determined. The circular DNA genome of BCTV-CFH consists of 2,927 nucleotides and shares extensive sequence homology with the biologically distinct California strain of BCTV. Analysis of the CFH nucleotide sequence indicated that the rightward open reading frames (ORFs) R1, R2, and R3 are highly conserved (> 95% amino acid chemical similarity) in the CFH and California strains, although CFH ORF R2 was extended by 24 carboxy-terminal amino acid residues not present in the California strain. The CFH leftward ORFs L1, L2, and L3 shared varying levels of amino acid chemical similarity with the corresponding ORFs of the California strain (78.8, 66.5, and 86.7%, respectively). CFH ORF L4 was the least conserved ORF present in both strains, encoding a 9.9-kDa protein of 87 amino acid residues, which shares 57.6% chemical similarity with 85 carboxy-terminal amino acid residues of the 19.4-kDa ORF L4 of the California strain. The CFH DNA sequence also contained a unique 12.5-kDa ORF (R4); however, there is no evidence to suggest that R4 is expressed. Comparison of the CFH and California strain nucleotide sequences indicates that certain regions of the BCTV genome have diverged, and this divergence may account for differences in the pathogenic properties of the two strains. PMID:8167369

  8. BarraCUDA - a fast short read sequence aligner using graphics processing units

    PubMed Central

    2012-01-01

    Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497

  9. Linking GPS and travel diary data using sequence alignment in a study of children's independent mobility

    PubMed Central

    2011-01-01

    Background Global positioning systems (GPS) are increasingly being used in health research to determine the location of study participants. Combining GPS data with data collected via travel/activity diaries allows researchers to assess where people travel in conjunction with data about trip purpose and accompaniment. However, linking GPS and diary data is problematic and to date the only method has been to match the two datasets manually, which is time consuming and unlikely to be practical for larger data sets. This paper assesses the feasibility of a new sequence alignment method of linking GPS and travel diary data in comparison with the manual matching method. Methods GPS and travel diary data obtained from a study of children's independent mobility were linked using sequence alignment algorithms to test the proof of concept. Travel diaries were assessed for quality by counting the number of errors and inconsistencies in each participant's set of diaries. The success of the sequence alignment method was compared for higher versus lower quality travel diaries, and for accompanied versus unaccompanied trips. Time taken and percentage of trips matched were compared for the sequence alignment method and the manual method. Results The sequence alignment method matched 61.9% of all trips. Higher quality travel diaries were associated with higher match rates in both the sequence alignment and manual matching methods. The sequence alignment method performed almost as well as the manual method and was an order of magnitude faster. However, the sequence alignment method was less successful at fully matching trips and at matching unaccompanied trips. Conclusions Sequence alignment is a promising method of linking GPS and travel diary data in large population datasets, especially if limitations in the trip detection algorithm are addressed. PMID:22142322

  10. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties

    PubMed Central

    Neuwald, Andrew F.; Altschul, Stephen F.

    2016-01-01

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a “top-down” strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins’ structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO’s superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/. PMID

  11. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.

    PubMed

    Neuwald, Andrew F; Altschul, Stephen F

    2016-05-01

    We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/. PMID:27192614

  12. The nucleotide sequence at the 5' end of foot and mouth disease virus RNA.

    PubMed Central

    Harris, T J

    1979-01-01

    Foot and mouth disease virus RNA has been treated with RNase H in the presence of oligo (dG) specifically to digest the poly(C) tract which lies near the 5' end of the molecule (10). The short (S) fragment containing the 5' end of the RNA was separated from the remainder of the RNA (L fragment) by gel electrophoresis. RNA ligase mediated labelling of the 3' end of S fragment showed that the RNase H digestion gave rise to molecules that differed only in the number of cytidylic acid residues remaining at their 3' ends and did not leave the unique 3' end necessary for fast sequence analysis. As the 5' end of S fragment prepared form virus RNA is blocked by VPg, S fragment was prepared from virus specific messenger RNA which does not contain this protein. This RNA was labelled at the 5' end using polynucleotide kinase and the sequence of 70 nucleotides at the 5' end determined by partial enzyme digestion sequencing on polyacrylamide gels. Some of this sequence was confirmed from an analysis of the oligonucleotides derived by RNase T1 digestion of S fragment. The sequence obtained indicates that there is a stable hairpin loop at the 5' terminus of the RNA before an initiation codon 33 nucleotides from the 5' end. In addition, the RNase T1 analysis suggests that there are short repeated sequences in S fragment and that an eleven nucleotide inverted complementary repeat of a sequence near the 3' end of the RNA is present at the junction of S fragment and the poly(C) tract. Images PMID:231762

  13. Upcoming challenges for multiple sequence alignment methods in the high-throughput era

    PubMed Central

    Kemena, Carsten; Notredame, Cedric

    2009-01-01

    This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches. Contact: cedric.notredame@crg.es PMID:19648142

  14. Quadfinder: server for identification and analysis of quadruplex-forming motifs in nucleotide sequences

    PubMed Central

    Scaria, Vinod; Hariharan, Manoj; Arora, Amit; Maiti, Souvik

    2006-01-01

    G-quadruplex secondary structures, which play a structural role in repetitive DNA such as telomeres, may also play a functional role at other genomic locations as targetable regulatory elements which control gene expression. The recent interest in application of quadruplexes in biological systems prompted us to develop a tool for the identification and analysis of quadruplex-forming nucleotide sequences especially in the RNA. Here we present Quadfinder, an online server for prediction and bioinformatics of uni-molecular quadruplex-forming nucleotide sequences. The server is designed to be user-friendly and needs minimal intervention by the user, while providing flexibility of defining the variants of the motif. The server is freely available at URL . PMID:16845097

  15. Nucleotide sequence of cDNA clones of the murine myb proto-oncogene.

    PubMed Central

    Gonda, T J; Gough, N M; Dunn, A R; de Blaquiere, J

    1985-01-01

    We have isolated cDNA clones of murine c-myb mRNA which contain approximately 2.8 kb of the 3.9-kb mRNA sequence. Nucleotide sequencing has shown that these clones extend both 5' and 3' to sequences homologous to the v-myb oncogenes of avian myeloblastosis virus and avian leukemia virus E26. The sequence contains an open reading frame of 1944 nucleotides, and could encode a protein which is both highly homologous, and of similar size (71 kd), to the chicken c-myb protein. Examination of the deduced amino acid sequence of the murine c-myb protein revealed the presence of a 3-fold tandem repeat of 52 residues near the N terminus of the protein, and has enabled prediction of some of the likely structural features of the protein. These include a high alpha-helix content, a basic region toward the N terminus of the protein and an overall globular configuration. The arrangement of genomic c-myb sequences, detected using the cDNA clones as probes, was compared with the reported structure of rearranged c-myb in certain tumour cells. This comparison suggested that the rearranged c-myb gene may encode a protein which, like the v-myb protein, lacks the N-terminal region of c-myb. Images Fig. 5. PMID:2998780

  16. Multimodal phylogeny for taxonomy: integrating information from nucleotide and amino acid sequences.

    PubMed

    Bicego, Manuele; Dellaglio, Franco; Felis, Giovanna E

    2007-10-01

    The crucial role played by the analysis of microbial diversity in biotechnology-based innovations has increased the interest in the microbial taxonomy research area. Phylogenetic sequence analyses have contributed significantly to the advances in this field, also in the view of the large amount of sequence data collected in recent years. Phylogenetic analyses could be realized on the basis of protein-encoding nucleotide sequences or encoded amino acid molecules: these two mechanisms present different peculiarities, still starting from two alternative representations of the same information. This complementarity could be exploited to achieve a multimodal phylogenetic scheme that is able to integrate gene and protein information in order to realize a single final tree. This aspect has been poorly addressed in the literature. In this paper, we propose to integrate the two phylogenetic analyses using basic schemes derived from the multimodality fusion theory (or multiclassifier systems theory), a well-founded and rigorous branch for which its powerfulness has already been demonstrated in other pattern recognition contexts. The proposed approach could be applied to distance matrix-based phylogenetic techniques (like neighbor joining), resulting in a smart and fast method. The proposed methodology has been tested in a real case involving sequences of some species of lactic acid bacteria. With this dataset, both nucleotide sequence- and amino acid sequence-based phylogenetic analyses present some drawbacks, which are overcome with the multimodal analysis. PMID:17933011

  17. Common nucleotide sequence of structural gene encoding fibroblast growth factor 4 in eight cattle derived from three breeds.

    PubMed

    Sato, Sho; Takahashi, Toshikiyo; Nishinomiya, Hiroshi; Katoh, Makiko; Itoh, Ryu; Yokoo, Masaki; Yokoo, Mari; Iha, Momoe; Mori, Yuki; Kasuga, Kano; Kojima, Ikuo; Kobayashi, Masayuki

    2012-03-01

    Fibroblast growth factor 4 (FGF4) is considered as a crucial gene for the proper development of bovine embryos. However, the complete nucleotide sequences of the structural genes encoding FGF4 in identified breeds are still unknown. In the present study, direct sequencing of PCR products derived from genomic DNA samples obtained from three Japanese Black, two Japanese Shorthorn and three Holstein cattle, revealed that the nucleotide sequences of the structural gene encoding FGF4 matched completely among these eight cattle. On the other hand, differences in the nucleotide sequences, leading to substitutions, insertions or deletions of amino acid residues were detected when compared with the already reported sequence from unidentified breeds. We cannot rule out a possibility that the structural gene elucidated in the present study is widely distributed in cattle. To the best of our knowledge, this is the first determination of the complete nucleotide sequence of the structural gene encoding bovine FGF4 in identified breeds. PMID:22435631

  18. Nucleotide sequence, heterologous expression and novel purification of DNA ligase from Bacillus stearothermophilus(1).

    PubMed

    Brannigan, J A; Ashford, S R; Doherty, A J; Timson, D J; Wigley, D B

    1999-07-13

    The gene for DNA ligase (EC 6.5.1.2) from thermophilic bacterium Bacillus stearothermophilus NCA1503 has been cloned and the complete nucleotide sequence determined. The ligase gene encodes a protein 670 amino acids in length. The gene was overexpressed in Escherichia coli and the enzyme has been purified to homogeneity. Preliminary characterisation confirms that it is a thermostable, NAD(+)-dependent DNA ligase. PMID:10407164

  19. Complete Nucleotide Sequence of a French Isolate of Maize rough dwarf virus, a Fijivirus Member in the Family Reoviridae

    PubMed Central

    Svanella-Dumas, L.; Marais, A.; Faure, C.; Theil, S.; Thibord, J. B.

    2016-01-01

    The complete nucleotide sequence of a French isolate of Maize rough dwarf virus (MRDV) was determined by next-generation sequencing and compared with the single available complete sequence and with the partial sequences of two additional isolates available in online databases. PMID:27445367

  20. Nucleotide sequence of the hypervariable region of the human C2 gene

    SciTech Connect

    Zhu, Z.B.; Volanakis, J.V. )

    1991-03-15

    It has been previously suggested that the multiallelic Bam H1/Sst I RFLPs of the human C2 gene arose through deletion/insertion of a tandemly-repeated minisatellite region. In this study the authors subcloned and sequenced the Sst I polymorphic fragment of the b haplotype of the C2 gene. This restriction fragment is 2,450 bp long and maps 1,550 bp 3{prime} of exon 3. Its nucleotide sequence is characterized by the presence of at least 4 different repeated regions varying in size from 18 to 58 bp. One of these regions starting at position 1,413 is 48 bp long and is repeated five times. The first 3 repeats are in tandem and are separated by 72 bp from two additional tandem repeats. Sequence homology among the 5 repeats ranges between 93 and 98%. Eighty three percent of the nucleotides of the repeated-region are G or C. It seems likely that this nucleotide repeat resulted in the multiallelic RFLPs through a mechanism of unequal recombination or replication slippage.

  1. Efficient mapping of genomic sequences to optimize multiple pairwise alignment in hybrid cluster platforms.

    PubMed

    Montañola, Alberto; Roig, Concepció; Hernández, Porfidio

    2014-01-01

    Multiple sequence alignment (MSA), used in biocomputing to study similarities between different genomic sequences, is known to require important memory and computation resources. Nowadays, researchers are aligning thousands of these sequences, creating new challenges in order to solve the problem using the available resources efficiently. Determining the efficient amount of resources to allocate is important to avoid waste of them, thus reducing the economical costs required in running for example a specific cloud instance. The pairwise alignment is the initial key step of the MSA problem, which will compute all pair alignments needed. We present a method to determine the optimal amount of memory and computation resources to allocate by the pairwise alignment, and we will validate it through a set of experimental results for different possible inputs. These allow us to determine the best parameters to configure the applications in order to use effectively the available resources of a given system. PMID:25339085

  2. Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model

    PubMed Central

    Neuwald, Andrew F; Liu, Jun S

    2004-01-01

    Background Certain protein families are highly conserved across distantly related organisms and belong to large and functionally diverse superfamilies. The patterns of conservation present in these protein sequences presumably are due to selective constraints maintaining important but unknown structural mechanisms with some constraints specific to each family and others shared by a larger subset or by the entire superfamily. To exploit these patterns as a source of functional information, we recently devised a statistically based approach called contrast hierarchical alignment and interaction network (CHAIN) analysis, which infers the strengths of various categories of selective constraints from co-conserved patterns in a multiple alignment. The power of this approach strongly depends on the quality of the multiple alignments, which thus motivated development of theoretical concepts and strategies to improve alignment of conserved motifs within large sets of distantly related sequences. Results Here we describe a hidden Markov model (HMM), an algebraic system, and Markov chain Monte Carlo (MCMC) sampling strategies for alignment of multiple sequence motifs. The MCMC sampling strategies are useful both for alignment optimization and for adjusting position specific background amino acid frequencies for alignment uncertainties. Associated statistical formulations provide an objective measure of alignment quality as well as automatic gap penalty optimization. Improved alignments obtained in this way are compared with PSI-BLAST based alignments within the context of CHAIN analysis of three protein families: Giα subunits, prolyl oligopeptidases, and transitional endoplasmic reticulum (p97) AAA+ ATPases. Conclusion While not entirely replacing PSI-BLAST based alignments, which likewise may be optimized for CHAIN analysis using this approach, these motif-based methods often more accurately align very distantly related sequences and thus can provide a better measure of

  3. Nucleotide sequence variation of the envelope protein gene identifies two distinct genotypes of yellow fever virus.

    PubMed Central

    Chang, G J; Cropp, B C; Kinney, R M; Trent, D W; Gubler, D J

    1995-01-01

    The evolution of yellow fever virus over 67 years was investigated by comparing the nucleotide sequences of the envelope (E) protein genes of 20 viruses isolated in Africa, the Caribbean, and South America. Uniformly weighted parsimony algorithm analysis defined two major evolutionary yellow fever virus lineages designated E genotypes I and II. E genotype I contained viruses isolated from East and Central Africa. E genotype II viruses were divided into two sublineages: IIA viruses from West Africa and IIB viruses from America, except for a 1979 virus isolated from Trinidad (TRINID79A). Unique signature patterns were identified at 111 nucleotide and 12 amino acid positions within the yellow fever virus E gene by signature pattern analysis. Yellow fever viruses from East and Central Africa contained unique signatures at 60 nucleotide and five amino acid positions, those from West Africa contained unique signatures at 25 nucleotide and two amino acid positions, and viruses from America contained such signatures at 30 nucleotide and five amino acid positions in the E gene. The dissemination of yellow fever viruses from Africa to the Americas is supported by the close genetic relatedness of genotype IIA and IIB viruses and genetic evidence of a possible second introduction of yellow fever virus from West Africa, as illustrated by the TRINID79A virus isolate. The E protein genes of American IIB yellow fever viruses had higher frequencies of amino acid substitutions than did genes of yellow fever viruses of genotypes I and IIA on the basis of comparisons with a consensus amino acid sequence for the yellow fever E gene. The great variation in the E proteins of American yellow fever virus probably results from positive selection imposed by virus interaction with different species of mosquitoes or nonhuman primates in the Americas. PMID:7637022

  4. Cloning and nucleotide sequence of the gene encoding the Ecal DNA methyltransferase.

    PubMed Central

    Brenner, V; Venetianer, P; Kiss, A

    1990-01-01

    The gene coding for the GGTNACC specific Ecal DNA methyltransferase (M.Ecal) has been cloned in E. coli from Enterobacter cloacae and its nucleotide sequence has been determined. The ecalM gene codes for a protein of 452 amino acids (Mr: 51,111). It was determined that M.Ecal is an adenine methyltransferase. M.Ecal shows limited amino acid sequence similarity to other adenine methyltransferases. A clone that expresses Ecal methyltransferase at high level was constructed. Images PMID:2183182

  5. Nucleotide sequencing and characterization of the genes encoding benzene oxidation enzymes of Pseudomonas putida

    SciTech Connect

    Irie, S.; Doi, S.; Yorifuji, T.; Takagi, M.; Yano, K.

    1987-11-01

    The nucleotide sequence of the genes from Pseudomonas putida encoding oxidation of benzene to catechol was determined. Five open reading frames were found in the sequence. Four corresponding protein molecules were detected by a DNA-directed in vitro translation system. Escherichia coli cells containing the fragment with the four open reading frames transformed benzene to cis-benzene glycol, which is an intermediate of the oxidation of benzene to catechol. The relation between the product of each cistron and the components of the benzene oxidation enzyme system is discussed.

  6. A Rank-Based Sequence Aligner with Applications in Phylogenetic Analysis

    PubMed Central

    2014-01-01

    Recent tools for aligning short DNA reads have been designed to optimize the trade-off between correctness and speed. This paper introduces a method for assigning a set of short DNA reads to a reference genome, under Local Rank Distance (LRD). The rank-based aligner proposed in this work aims to improve correctness over speed. However, some indexing strategies to speed up the aligner are also investigated. The LRD aligner is improved in terms of speed by storing -mer positions in a hash table for each read. Another improvement, that produces an approximate LRD aligner, is to consider only the positions in the reference that are likely to represent a good positional match of the read. The proposed aligner is evaluated and compared to other state of the art alignment tools in several experiments. A set of experiments are conducted to determine the precision and the recall of the proposed aligner, in the presence of contaminated reads. In another set of experiments, the proposed aligner is used to find the order, the family, or the species of a new (or unknown) organism, given only a set of short Next-Generation Sequencing DNA reads. The empirical results show that the aligner proposed in this work is highly accurate from a biological point of view. Compared to the other evaluated tools, the LRD aligner has the important advantage of being very accurate even for a very low base coverage. Thus, the LRD aligner can be considered as a good alternative to standard alignment tools, especially when the accuracy of the aligner is of high importance. Source code and UNIX binaries of the aligner are freely available for future development and use at http://lrd.herokuapp.com/aligners. The software is implemented in C++ and Java, being supported on UNIX and MS Windows. PMID:25133391

  7. Nucleotide sequence of an exceptionally long 5.8S ribosomal RNA from Crithidia fasciculata.

    PubMed

    Schnare, M N; Gray, M W

    1982-03-25

    In Crithidia fasciculata, a trypanosomatid protozoan, the large ribosomal subunit contains five small RNA species (e, f, g, i, j) in addition to 5S rRNA [Gray, M.W. (1981) Mol. Cell. Biol. 1, 347-357]. The complete primary sequence of species i is shown here to be pAACGUGUmCGCGAUGGAUGACUUGGCUUCCUAUCUCGUUGA ... AGAmACGCAGUAAAGUGCGAUAAGUGGUApsiCAAUUGmCAGAAUCAUUCAAUUACCGAAUCUUUGAACGAAACGG ... CGCAUGGGAGAAGCUCUUUUGAGUCAUCCCCGUGCAUGCCAUAUUCUCCAmGUGUCGAA(C)OH. This sequence establishes that species i is a 5.8S rRNA, despite its exceptional length (171-172 nucleotides). The extra nucleotides in C. fasciculata 5.8S rRNA are located in a region whose primary sequence and length are highly variable among 5.8S rRNAs, but which is capable of forming a stable hairpin loop structure (the "G+C-rich hairpin"). The sequence of C. fasciculata 5.8S rRNA is no more closely related to that of another protozoan, Acanthamoeba castellanii, than it is to representative 5.8S rRNA sequences from the other eukaryotic kingdoms, emphasizing the deep phylogenetic divisions that seem to exist within the Kingdom Protista. PMID:7079176

  8. Overlapping open reading frames revealed by complete nucleotide sequencing of turnip yellow mosaic virus genomic RNA.

    PubMed Central

    Morch, M D; Boyer, J C; Haenni, A L

    1988-01-01

    The complete nucleotide sequence of turnip yellow mosaic virus (TYMV) genomic RNA has been determined on a set of overlapping cDNA clones using a sequential sequencing strategy. The RNA is 6318 nucleotides long, excluding the cap structure. The genome organization deduced from the sequence confirms previous results of in vitro translation. A novel open reading frame (ORF) putatively encoding a Pro-rich and very basic 69K (K = kilodalton) protein is detected at the 5' end of the genome. It is initiated at the first AUG codon on the RNA and overlaps the major ORF that encodes the non structural 206K (previously referred to as 195K) protein of TYMV; its function is unknown. Several amino acid consensus sequences already described among plant and animal viruses are also found in the TYMV-encoded polypeptides. A comparison with other viruses whose RNA sequence is known leads to the conclusion that TYMV belongs to the "Sindbis-like" supergroup of viruses and could be related to Semliki forest virus. PMID:3399388

  9. Nucleotide sequences of immunoglobulin eta genes of chimpanzee and orangutan: DNA molecular clock and hominoid evolution

    SciTech Connect

    Sakoyama, Y.; Hong, K.J.; Byun, S.M.; Hisajima, H.; Ueda, S.; Yaoita, Y.; Hayashida, H.; Miyata, T.; Honjo, T.

    1987-02-01

    To determine the phylogenetic relationships among hominoids and the dates of their divergence, the complete nucleotide sequences of the constant region of the immunoglobulin eta-chain (C/sub eta1/) genes from chimpanzee and orangutan have been determined. These sequences were compared with the human eta-chain constant-region sequence. A molecular clock (silent molecular clock), measured by the degree of sequence divergence at the synonymous (silent) positions of protein-encoding regions, was introduced for the present study. From the comparison of nucleotide sequences of ..cap alpha../sub 1/-antitrypsin and ..beta..- and delta-globulin genes between humans and Old World monkeys, the silent molecular clock was calibrated: the mean evolutionary rate of silent substitution was determined to be 1.56 x 10/sup -9/ substitutions per site per year. Using the silent molecular clock, the mean divergence dates of chimpanzee and orangutan from the human lineage were estimated as 6.4 +/- 2.6 million years and 17.3 +/- 4.5 million years, respectively. It was also shown that the evolutionary rate of primate genes is considerably slower than those of other mammalian genes.

  10. Cloning and nucleotide sequence of the Salmonella typhimurium dcp gene encoding dipeptidyl carboxypeptidase.

    PubMed Central

    Hamilton, S; Miller, C G

    1992-01-01

    Plasmids carrying the Salmonella typhimurium dcp gene were isolated from a pBR328 library of Salmonella chromosomal DNA by screening for complementation of a peptide utilization defect conferred by a dcp mutation. Strains carrying these plasmids overproduced dipeptidyl carboxypeptidase approximately 50-fold. The nucleotide sequence of a 2.8-kb region of one of these plasmids contained an open reading frame coding for a protein of 77,269 Da, in agreement with the 80-kDa size for dipeptidyl carboxypeptidase (determined by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and gel filtration). The N-terminal amino acid sequence of dipeptidyl carboxypeptidase purified from an overproducer strain agreed with that predicted by the nucleotide sequence. Northern (RNA) blot data indicated that dcp is not cotranscribed with other genes, and primer extension analysis showed the start of transcription to be 22 bases upstream of the translational start. The amino acid sequence of dcp was not similar to that of a mammalian dipeptidyl carboxypeptidase, angiotensin I-converting enzyme, but showed striking similarities to the amino acid sequence of another S. typhimurium peptidase encoded by the opdA (formerly optA) gene. Images PMID:1537804

  11. Remote access to ACNUC nucleotide and protein sequence databases at PBIL.

    PubMed

    Gouy, Manolo; Delmotte, Stéphane

    2008-04-01

    The ACNUC biological sequence database system provides powerful and fast query and extraction capabilities to a variety of nucleotide and protein sequence databases. The collection of ACNUC databases served by the Pôle Bio-Informatique Lyonnais includes the EMBL, GenBank, RefSeq and UniProt nucleotide and protein sequence databases and a series of other sequence databases that support comparative genomics analyses: HOVERGEN and HOGENOM containing families of homologous protein-coding genes from vertebrate and prokaryotic genomes, respectively; Ensembl and Genome Reviews for analyses of prokaryotic and of selected eukaryotic genomes. This report describes the main features of the ACNUC system and the access to ACNUC databases from any internet-connected computer. Such access was made possible by the definition of a remote ACNUC access protocol and the implementation of Application Programming Interfaces between the C, Python and R languages and this communication protocol. Two retrieval programs for ACNUC databases, Query_win, with a graphical user interface and raa_query, with a command line interface, are also described. Altogether, these bioinformatics tools provide users with either ready-to-use means of querying remote sequence databases through a variety of selection criteria, or a simple way to endow application programs with an extensive access to these databases. Remote access to ACNUC databases is open to all and fully documented (http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html). PMID:17825976

  12. Pattern recognition and probabilistic measures in alignment-free sequence analysis.

    PubMed

    Schwende, Isabel; Pham, Tuan D

    2014-05-01

    With the massive production of genomic and proteomic data, the number of available biological sequences in databases has reached a level that is not feasible anymore for exact alignments even when just a fraction of all sequences is used. To overcome this inevitable time complexity, ultrafast alignment-free methods are studied. Within the past two decades, a broad variety of nonalignment methods have been proposed including dissimilarity measures on classical representations of sequences like k-words or Markov models. Furthermore, articles were published that describe distance measures on alternative representations such as compression complexity, spectral time series or chaos game representation. However, alignments are still the standard method for real world applications in biological sequence analysis, and the time efficient alignment-free approaches are usually applied in cases when the accustomed algorithms turn out to fail or be too inconvenient. PMID:24096012

  13. Nucleotide sequence analysis of a cloned DNA fragment from human cells reveals homology to retrotransposons.

    PubMed Central

    Flügel, R M; Maurer, B; Bannert, H; Rethwilm, A; Schnitzler, P; Darai, G

    1987-01-01

    During molecular cloning of proviral DNA of human spumaretrovirus, various recombinant clones were established and analyzed. Blot hybridization revealed that one of the recombinant plasmids had the characteristic features of a member of the long interspersed repetitive sequences family. The DNA element was analyzed by restriction mapping and nucleotide sequencing. It showed a high degree of amino acid sequence homology of 54.3% when compared with the 5'-terminal part of the pol gene product of the murine retrotransposon LIMd. The 3' region of the cloned DNA element encodes proteins with an even higher degree of homology of 67.4% in comparison to the corresponding parts of a member of the primate KpnI sequence family. Images PMID:3031462

  14. Nucleotide sequence of the L1 ribosomal protein gene of Xenopus laevis: remarkable sequence homology among introns.

    PubMed Central

    Loreni, F; Ruberti, I; Bozzoni, I; Pierandrei-Amaldi, P; Amaldi, F

    1985-01-01

    Ribosomal protein L1 is encoded by two genes in Xenopus laevis. The comparison of two cDNA sequences shows that the two L1 gene copies (L1a and L1b) have diverged in many silent sites and very few substitution sites; moreover a small duplication occurred at the very end of the coding region of the L1b gene which thus codes for a product five amino acids longer than that coded by L1a. Quantitatively the divergence between the two L1 genes confirms that a whole genome duplication took place in Xenopus laevis approximately 30 million years ago. A genomic fragment containing one of the two L1 gene copies (L1a), with its nine introns and flanking regions, has been completely sequenced. The 5' end of this gene has been mapped within a 20-pyridimine stretch as already found for other vertebrate ribosomal protein genes. Four of the nine introns have a 60-nucleotide sequence with 80% homology; within this region some boxes, one of which is 16 nucleotides long, are 100% homologous among the four introns. This feature of L1a gene introns is interesting since we have previously shown that the activity of this gene is regulated at a post-transcriptional level and it involves the block of the normal splicing of some intron sequences. Images Fig. 3. Fig. 5. PMID:3841512

  15. Escherichia coli thymidylate kinase: molecular cloning, nucleotide sequence, and genetic organization of the corresponding tmk locus.

    PubMed Central

    Reynes, J P; Tiraby, M; Baron, M; Drocourt, D; Tiraby, G

    1996-01-01

    Thymidylate kinase (dTMP kinase; EC 2.7.4.9) catalyzes the phosphorylation of dTMP to form dTDP in both de novo and salvage pathways of dTTP synthesis. The nucleotide sequence of the tmk gene encoding this essential Escherichia coli enzyme is the last one among all the E. coli nucleoside and nucleotide kinase genes which has not yet been reported. By subcloning the 24.0-min region where the tmk gene has been previously mapped from the lambda phage 236 (E9G1) of the Kohara E. coli genomic library (Y. Kohara, K. Akiyama, and K. Isono, Cell 50:495-508, 1987), we precisely located tmk between acpP and holB genes. Here we report the nucleotide sequence of tmk, including the end portion of an upstream open reading frame (ORF 340) of unknown function that may be cotranscribed with the pabC gene. The tmk gene was located clockwise of and just upstream of the holB gene. Our sequencing data allowed the filling in of the unsequenced gap between the acpP and holB genes within the 24-min region of the E. coli chromosome. Identification of this region as the E. coli tmk gene was confirmed by functional complementation of a yeast dTMP kinase temperature-sensitive mutant and by in vitro enzyme assay of the thymidylate kinase activity in cell extracts of E. coli by use of tmk-overproducing plasmids. The deduced amino acid sequence of the E. coli tmk gene showed significant similarity to the sequences of the thymidylate kinases of vertebrates, yeasts, and viruses as well as two uncharacterized proteins of bacteria belonging to Bacillus and Haemophilus species. PMID:8631667

  16. Nucleotide sequence and characterization of the pyrF operon of Escherichia coli K12.

    PubMed

    Turnbough, C L; Kerr, K H; Funderburg, W R; Donahue, J P; Powell, F E

    1987-07-25

    The pyrF gene of Escherichia coli K12, which encodes the pyrimidine biosynthetic enzyme orotidine-5'-monophosphate (OMP) decarboxylase, is part of an operon that includes a downstream gene designated orfF. The orfF gene product is a small polypeptide of unknown function. The nucleotide sequence of a 1549-base pair chromosomal fragment containing this operon was determined. An open reading frame capable of encoding the 27-kDa OMP decarboxylase subunit was identified and shown to be the pyrF structural gene by purifying and characterizing OMP decarboxylase. The subunit molecular weight (Mr = 26,350), amino-terminal amino acid sequence, and amino acid composition of the polypeptide predicted from the nucleotide sequence are in excellent agreement with those properties determined for the purified enzyme. The orfF structural gene was tentatively identified and apparently encodes an 11,396-dalton polypeptide. The orfF translational initiation codon overlaps the pyrF termination codon, which may indicate translational coupling in the expression of these genes. The pyrF promoter was mapped by primer extension of in vivo transcripts. The primary transcriptional initiation site is 51 base pairs upstream of the pyrF structural gene. The level of pyrF transcription and OMP decarboxylase synthesis was found to be coordinately derepressed by pyrimidine limitation, indicating that regulation of pyrF gene expression occurs at the transcriptional level. Inspection of the nucleotide sequence indicates that pyrF gene expression is not regulated by an attenuation control mechanism similar to that described for the pyrBI operon or pyrE gene. Finally, we compared the amino acid sequences of the OMP decarboxylases from E. coli, Saccharomyces cerevisiae, Neurospora crassa, and Ehrlich ascites cells to identify conserved regions. PMID:2956254

  17. Nucleotide sequence analysis of the hypervariable region III of mitochondrial DNA in Thais.

    PubMed

    Thongngam, Punlop; Leewattanapasuk, Worraanong; Bhoopat, Tanin; Sangthong, Padchanee

    2016-07-01

    This study analyzed the nucleotide sequences of the hypervariable region III (HVRIII) of mitochondrial DNA in Thai individuals. Buccal swab samples were randomly obtained from 100 healthy, unrelated, adult (18-60 years old), volunteer donors living in Thailand. Eighteen different haplotypes were found, of which 11 haplotypes were unique. The most frequent haplotypes observed were 522D-523D. Nucleotide transition from Thymine (T) to Cytosine (C) at position 489 (43%) was the most frequent substitution. Nucleotide transversions were also observed at position 433 (Adenine (A) to C, 1%) and position 499 (Guanine (G) to C, 1%). Fifty-three samples presented nucleotide insertion and deletion of C and A (CA) at position 514-523. Insertion of 1AC (3%) and 2AC (2%) were observed. Deletion of 1CA (53%) and 2CA (2%) at position 514-523 were revealed. The deletion of T at position 459 was observed. The haplotype diversity, random match probability, and discrimination power were calculated to be 0.7770, 0.2308, and 0.7692, respectively. PMID:27107562

  18. Nucleotide sequence variation of chitin synthase genes among ectomycorrhizal fungi and its potential use in taxonomy.

    PubMed Central

    Mehmann, B; Brunner, I; Braus, G H

    1994-01-01

    DNA sequences of single-copy genes coding for chitin synthases (UDP-N-acetyl-D-glucosamine:chitin 4-beta-N-acetylglucosaminyltransferase; EC 2.4.1.16) were used to characterize ectomycorrhizal fungi. Degenerate primers deduced from short, completely conserved amino acid stretches flanking a region of about 200 amino acids of zymogenic chitin synthases allowed the amplification of DNA fragments of several members of this gene family. Different DNA band patterns were obtained from basidiomycetes because of variation in the number and length of amplified fragments. Cloning and sequencing of the most prominent DNA fragments revealed that these differences were due to various introns at conserved positions. The presence of introns in basidiomycetous fungi therefore has a potential use in identification of genera by analyzing PCR-generated DNA fragment patterns. Analyses of the nucleotide sequences of cloned fragments revealed variations in nucleotide sequences from 4 to 45%. By comparison of the deduced amino acid sequences, the majority of the DNA fragments were identified as members of genes for chitin synthase class II. The deduced amino acid sequences from species of the same genus differed only in one amino acid residue, whereas identity between the amino acid sequences of ascomycetous and basidiomycetous fungi within the same taxonomic class was found to be approximately 43 to 66%. Phylogenetic analysis of the amino acid sequence of class II chitin synthase-encoding gene fragments by using parsimony confirmed the current taxonomic groupings. In addition, our data revealed a fourth class of putative zymogenic chitin synthesis. Images PMID:7944356

  19. CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences

    PubMed Central

    2012-01-01

    Background It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations. Results We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP. Conclusions CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar PMID

  20. Predicting and improving the protein sequence alignment quality by support vector regression

    PubMed Central

    Lee, Minho; Jeong, Chan-seok; Kim, Dongsup

    2007-01-01

    Background For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment. Results In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs. Conclusion The present work demonstrates that the alignment quality can be

  1. Studying long 16S rDNA sequences with ultrafast-metagenomic sequence classification using exact alignments (Kraken).

    PubMed

    Valenzuela-González, Fabiola; Martínez-Porchas, Marcel; Villalpando-Canchola, Enrique; Vargas-Albores, Francisco

    2016-03-01

    Ultrafast-metagenomic sequence classification using exact alignments (Kraken) is a novel approach to classify 16S rDNA sequences. The classifier is based on mapping short sequences to the lowest ancestor and performing alignments to form subtrees with specific weights in each taxon node. This study aimed to evaluate the classification performance of Kraken with long 16S rDNA random environmental sequences produced by cloning and then Sanger sequenced. A total of 480 clones were isolated and expanded, and 264 of these clones formed contigs (1352 ± 153 bp). The same sequences were analyzed using the Ribosomal Database Project (RDP) classifier. Deeper classification performance was achieved by Kraken than by the RDP: 73% of the contigs were classified up to the species or variety levels, whereas 67% of these contigs were classified no further than the genus level by the RDP. The results also demonstrated that unassembled sequences analyzed by Kraken provide similar or inclusively deeper information. Moreover, sequences that did not form contigs, which are usually discarded by other programs, provided meaningful information when analyzed by Kraken. Finally, it appears that the assembly step for Sanger sequences can be eliminated when using Kraken. Kraken cumulates the information of both sequence senses, providing additional elements for the classification. In conclusion, the results demonstrate that Kraken is an excellent choice for use in the taxonomic assignment of sequences obtained by Sanger sequencing or based on third generation sequencing, of which the main goal is to generate larger sequences. PMID:26812576

  2. A statistical physics perspective on alignment-independent protein sequence comparison

    PubMed Central

    Chattopadhyay, Amit K.; Nasiev, Diar; Flower, Darren R.

    2015-01-01

    Motivation: Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Results: Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from ‘first passage probability distribution’ to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. Contact: d.r.flower@aston.ac.uk PMID:25810434

  3. The nucleotide sequence of an infectious clone of the geminivirus beet curly top virus.

    PubMed

    Stanley, J; Markham, P G; Callis, R J; Pinner, M S

    1986-08-01

    A number of infectious clones of a Californian isolate of the leafhopper-transmitted geminivirus beet curly top virus (BCTV) have been constructed from virus-specific double-stranded DNA isolated from infected Beta vulgaris and used to demonstrate a single component genome. The nucleotide sequence of one infectious clone has been determined (2993 nucleotides). Comparison with other geminiviruses has shown that the organisation of the genome closely resembles DNA 1 of the whitefly-transmitted members. The four conserved coding regions of DNA 1 have highly homologous counterparts in BCTV with the exception of the putative coat protein which is more closely related to those of the leafhopper-transmitted geminiviruses suggesting a strong interrelationship between coat protein and insect vector. A BCTV component equivalent to DNA 2 is not required for virus infection or transmission and has not been isolated from infected plants. PMID:16453696

  4. Nucleotide sequence analysis of the L gene of Newcastle disease virus: homologies with Sendai and vesicular stomatitis viruses.

    PubMed Central

    Yusoff, K; Millar, N S; Chambers, P; Emmerson, P T

    1987-01-01

    The nucleotide sequence of the L gene of the Beaudette C strain of Newcastle disease virus (NDV) has been determined. The L gene is 6704 nucleotides long and encodes a protein of 2204 amino acids with a calculated molecular weight of 248822. Mung bean nuclease mapping of the 5' terminus of the L gene mRNA indicates that the transcription of the L gene is initiated 11 nucleotides upstream of the translational start site. Comparison with the amino acid sequences of the L genes of Sendai virus and vesicular stomatitis virus (VSV) suggests that there are several regions of homology between the sequences. These data provide further evidence for an evolutionary relationship between the Paramyxoviridae and the Rhabdoviridae. A non-coding sequence of 46 nucleotides downstream of the presumed polyadenylation site of the L gene may be part of a negative strand leader RNA. Images PMID:3035486

  5. Developing Single Nucleotide Polymorphism (SNP) markers from transcriptome sequences for the identification of longan (Dimocarpus longan) germplasm

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Longan (Dimocarpus longan Lour.) is an important tropical fruit tree crop. Accurate varietal identification is essential for germplasm management and breeding. Using longan transcriptome sequences from public databases, we developed single nucleotide polymorphism (SNP) markers; validated 60 SNPs in...

  6. Conservation of nucleotide sequences for molecular diagnosis of Middle East respiratory syndrome coronavirus, 2015.

    PubMed

    Furuse, Yuki; Okamoto, Michiko; Oshitani, Hitoshi

    2015-11-01

    Infection due to the Middle East respiratory syndrome coronavirus (MERS-CoV) is widespread. The present study was performed to assess the protocols used for the molecular diagnosis of MERS-CoV by analyzing the nucleotide sequences of viruses detected between 2012 and 2015, including sequences from the large outbreak in eastern Asia in 2015. Although the diagnostic protocols were established only 2 years ago, mismatches between the sequences of primers/probes and viruses were found for several of the assays. Such mismatches could lead to a lower sensitivity of the assay, thereby leading to false-negative diagnosis. A slight modification in the primer design is suggested. Protocols for the molecular diagnosis of viral infections should be reviewed regularly after they are established, particularly for viruses that pose a great threat to public health such as MERS-CoV. PMID:26432410

  7. Nucleotide sequence of the tcml gene (ribosomal protein L3) of Saccharomyces cerevisiae.

    PubMed Central

    Schultz, L D; Friesen, J D

    1983-01-01

    The yeast tcml gene, which codes for ribosomal protein L3, has been isolated by using recombinant DNA and genetic complementation. The DNA fragment carrying this gene has been subcloned and we have determined its DNA sequence. The 20 amino acid residues at the amino terminus as inferred from the nucleotide sequence agreed exactly with the amino acid sequence data. The amino acid composition of the encoded protein agreed with that determined for purified ribosomal protein L3. Codon usage in the tcml gene was strongly biased in the direction found for several other abundant Saccharomyces cerevisiae proteins. The tcml gene has no introns, which appears to be atypical of ribosomal protein structural genes. PMID:6305925

  8. Identification of shark species in seafood products by forensically informative nucleotide sequencing (FINS).

    PubMed

    Blanco, M; Pérez-Martín, R I; Sotelo, C G

    2008-11-12

    The identification of commercial shark species is a relevant issue to ensure the correct labeling of seafood products, to maintain consumer confidence in seafood, and to enhance the knowledge of the species and volumes that are at present being captured, thus improving the management of shark fisheries. The polymerase chain reaction was employed to obtain a 423 bp amplicon from the mitochondrial cytochrome b gene. The sequences from this fragment, belonging to 63 authentic individuals of 23 species, were analyzed using a genetic distance method. Nine different samples of commercial fresh, frozen, and convenience food were obtained in local and international markets to validate the methodology. These samples were analyzed, and sequences were employed for species identification, showing that forensically informative nucleotide sequencing (FINS) is a suitable technique for identification of processed seafood containing shark as an ingredient. The results also showed that incorrect labeling practices may occur regarding shark products, probably because of incorrect labeling at the production point. PMID:18831561

  9. Fast multiple alignment of ungapped DNA sequences using information theory and a relaxation method.

    PubMed

    Schneider, Thomas D; Mastronarde, David N

    1996-12-01

    An information theory based multiple alignment ("Malign") method was used to align the DNA binding sequences of the OxyR and Fis proteins, whose sequence conservation is so spread out that it is difficult to identify the sites. In the algorithm described here, the information content of the sequences is used as a unique global criterion for the quality of the alignment. The algorithm uses look-up tables to avoid recalculating computationally expensive functions such as the logarithm. Because there are no arbitrary constants and because the results are reported in absolute units (bits), the best alignment can be chosen without ambiguity. Starting from randomly selected alignments, a hill-climbing algorithm can track through the immense space of s(n) combinations where s is the number of sequences and n is the number of positions possible for each sequence. Instead of producing a single alignment, the algorithm is fast enough that one can afford to use many start points and to classify the solutions. Good convergence is indicated by the presence of a single well-populated solution class having higher information content than other classes. The existence of several distinct classes for the Fis protein indicates that those binding sites have self-similar features. PMID:19953199

  10. SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

    PubMed Central

    2016-01-01

    Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license. PMID:27182962

  11. SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

    PubMed

    Abuín, José M; Pichel, Juan C; Pena, Tomás F; Amigo, Jorge

    2016-01-01

    Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license. PMID:27182962

  12. Nucleotide and predicted amino acid sequences of cloned human and mouse preprocathepsin B cDNAs.

    PubMed Central

    Chan, S J; San Segundo, B; McCormick, M B; Steiner, D F

    1986-01-01

    Cathepsin B is a lysosomal thiol proteinase that may have additional extralysosomal functions. To further our investigations on the structure, mode of biosynthesis, and intracellular sorting of this enzyme, we have determined the complete coding sequences for human and mouse preprocathepsin B by using cDNA clones isolated from human hepatoma and kidney phage libraries. The nucleotide sequences predict that the primary structure of preprocathepsin B contains 339 amino acids organized as follows: a 17-residue NH2-terminal prepeptide sequence followed by a 62-residue propeptide region, 254 residues in mature (single chain) cathepsin B, and a 6-residue extension at the COOH terminus. A comparison of procathepsin B sequences from three species (human, mouse, and rat) reveals that the homology between the propeptides is relatively conserved with a minimum of 68% sequence identity. In particular, two conserved sequences in the propeptide that may be functionally significant include a potential glycosylation site and the presence of a single cysteine at position 59. Comparative analysis of the three sequences also suggests that processing of procathepsin B is a multistep process, during which enzymatically active intermediate forms may be generated. The availability of the cDNA clones will facilitate the identification of possible active or inactive intermediate processive forms as well as studies on the transcriptional regulation of the cathepsin B gene. PMID:3463996

  13. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference

    PubMed Central

    Tan, Ge; Muffato, Matthieu; Ledergerber, Christian; Herrero, Javier; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe

    2015-01-01

    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms. PMID:26031838

  14. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference.

    PubMed

    Tan, Ge; Muffato, Matthieu; Ledergerber, Christian; Herrero, Javier; Goldman, Nick; Gil, Manuel; Dessimoz, Christophe

    2015-09-01

    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms. PMID:26031838

  15. Alignment-free comparison of genome sequences by a new numerical characterization.

    PubMed

    Huang, Guohua; Zhou, Houqing; Li, Yongfan; Xu, Lixin

    2011-07-21

    In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison. PMID:21536050

  16. Cloning and nucleotide sequence of the simian rotavirus gene 6 that codes for the major inner capsid protein.

    PubMed Central

    Estes, M K; Mason, B B; Crawford, S; Cohen, J

    1984-01-01

    The nucleotide sequence of the gene that codes for the major inner capsid protein of the simian rotavirus SA11 has been determined. A DNA copy of mRNA from gene 6 was cloned in the E. coli plasmid pBR322. The full-length gene is 1357 nucleotides long with a 5'-noncoding region of 23 nucleotides and a 3'-noncoding region of 140 nucleotides. The gene contains a single, long, open reading-frame of 1194 nucleotides capable of coding for a protein of 397 amino acids with a molecular weight of 44,816. The predicted protein product is relatively proline-rich with a net charge at neutral pH of -3.5. One stretch of 53 amino acids (encoded by nucleotides 327-485) is basic. Images PMID:6322125

  17. Nucleotide sequence of the gene encoding the repressor for the histidine utilization genes of Pseudomonas putida.

    PubMed Central

    Allison, S L; Phillips, A T

    1990-01-01

    The hutC gene of Pseudomonas putida encodes a repressor which, in combination with the inducer urocanate, regulates expression of the five structural genes necessary for conversion of histidine to glutamate, ammonia, and formate. The nucleotide sequence of the hutC region was determined and found to contain two open reading frames which overlapped by one nucleotide. The first open reading frame (ORF1) appeared to encode a 27,648-dalton protein of 248 amino acids whose sequence strongly resembled that of the hut repressor of Klebsiella aerogenes (A. Schwacha and R. A. Bender, J. Bacteriol. 172:5477-5481, 1990) and contained a helix-turn-helix motif that could be involved in operator binding. The gene was preceded by a sequence which was nearly identical to that of the operator site located upstream of hutU which controls transcription of the hutUHIG genes. The operator near hutC would presumably allow the hut repressor to regulate its own synthesis as well as the expression of the divergent hutF gene. A second open reading frame (ORF2) would encode a 21,155-dalton protein, but because this region could be deleted with only a slight effect on repressor activity, it is not likely to be involved in repressor function or structure. PMID:2203753

  18. Structure and nucleotide sequence of the rat intestinal vitamin D-dependent calcium binding protein gene.

    PubMed Central

    Krisinger, J; Darwish, H; Maeda, N; DeLuca, H F

    1988-01-01

    The vitamin D-dependent intestinal calcium binding protein (ICaBP, 9 kDa) is under transcriptional regulation by 1,25-dihydroxyvitamin D3 [1,25-(OH)2D3], the hormonal active form of the vitamin. To study the mechanism of gene regulation by 1,25-(OH)2D3, we isolated the rat ICaBP gene by using a cDNA probe. Its nucleotide sequence revealed 3 exons separated by 2 introns within approximately 3 kilobases. The first exon represents only noncoding sequences, while the second and third encode the two calcium binding domains of the protein. The gene contains a 15-base-pair imperfect palindrome in the first intron that shows high homology to the estrogen-responsive element. This sequence may represent the vitamin D-responsive element involved in the regulation of the ICaBP gene. The second intron shows an 84-base-pair-long simple nucleotide repeat that implicates Z-DNA formation. Genomic Southern analysis shows that the rat gene is represented as a single copy. Images PMID:3194402

  19. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments

    DOE PAGESBeta

    Daily, Jeffrey A.

    2016-02-10

    Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less

  20. SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes

    PubMed Central

    Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver

    2012-01-01

    Motivation: In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. Results: In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Availability: Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license. Contact: epruesse@mpi-bremen.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22556368

  1. Nucleotide sequence of the mRNA encoding the pre-alpha-subunit of mouse thyrotropin.

    PubMed Central

    Chin, W W; Kronenberg, H M; Dee, P C; Maloof, F; Habener, J F

    1981-01-01

    We have constructed and cloned in bacteria recombinant DNA molecules containing DNA sequences coding for the precursor of the alpha subunit of thyrotropin (pre-TSH-alpha). Double-stranded DNA complementary to total poly(A)+RNA derived from a mouse pituitary thyrotropic tumor was prepared enzymatically, inserted into the Pst I site of the plasmid pBR322 by using poly(dC).poly(dG) homopolymeric extensions, and cloned in Escherichia coli chi 1776. Cloned cDNAs encoding pre-TSH-alpha were identified by their hybridization to pre-TSH-alpha mRNA as determined by cell-free translations of hybrid-selected and hybrid-arrested RNA. The nucleotide sequences of two cDNAs (510 and 480 base pairs) were determined with chemical methods and corresponded to much of the region coding for the alpha subunit and the 3' untranslated region of pre-TSH-alpha mRNA. The sequence of the 5' end of the mRNA was determined from cDNA synthesized by using total mRNA as template and a restriction enzyme DNA fragment as primer. Together these sequences represented greater than 90% of the coding and noncoding regions of full-length pre-TSH-alpha mRNA, which was determined to be 800 bases long. The amino acid sequence of the pre-TSH-alpha deduced from the nucleotide sequence showed a NH2-terminal leader sequence of 24 amino acids followed by the 96-amino-acid sequence of the apoprotein of TSH-alpha. There is greater than 90% homology in the amino acid sequences among the murine, ruminant, and porcine alpha subunits and 75-80% homology among the murine, equine, and human alpha subunits. Several regions of the sequence remain absolutely conserved among all species, suggesting that these particular regions are essential for the biological function of the subunit. The successful cloning of the alpha subunit of TSH will permit further studies of the organization of the genes coding for the glycoprotein hormone subunits and the regulation of their expression. Images PMID:6272299

  2. A distributed system for fast alignment of next-generation sequencing data

    PubMed Central

    Srimani, Jaydeep K.; Wu, Po-Yen; Phan, John H.; Wang, May D.

    2016-01-01

    We developed a scalable distributed computing system using the Berkeley Open Interface for Network Computing (BOINC) to align next-generation sequencing (NGS) data quickly and accurately. NGS technology is emerging as a promising platform for gene expression analysis due to its high sensitivity compared to traditional genomic microarray technology. However, despite the benefits, NGS datasets can be prohibitively large, requiring significant computing resources to obtain sequence alignment results. Moreover, as the data and alignment algorithms become more prevalent, it will become necessary to examine the effect of the multitude of alignment parameters on various NGS systems. We validate the distributed software system by (1) computing simple timing results to show the speed-up gained by using multiple computers, (2) optimizing alignment parameters using simulated NGS data, and (3) computing NGS expression levels for a single biological sample using optimal parameters and comparing these expression levels to that of a microarray sample. Results indicate that the distributed alignment system achieves approximately a linear speed-up and correctly distributes sequence data to and gathers alignment results from multiple compute clients.

  3. Complete Nucleotide Sequence of a Conjugative Plasmid Carrying blaPER-1

    PubMed Central

    Li, Ruichao; Zhou, Yuanjie; Chan, Edward Wai-chi

    2015-01-01

    The nucleotide sequence of a self-transmissible plasmid pVPH1 harboring blaPER-1 from Vibrio parahaemolyticus was determined. pVPH1 was 183,730 bp in size and shared a backbone similar to pAQU1 and pAQU2, differing mainly in an ∼40-kb multidrug resistance (MDR) region. A complex class 1 integron was identified together with ISCR1 and blaPER-1 (ISCR1-blaPER-1-gst-abct-qacEΔ1-sul1), which was shown to form a circular intermediate playing an important role in the dissemination of blaPER-1. PMID:25779581

  4. Nucleotide sequence and organization of copper resistance genes from Pseudomonas syringae pv. tomato

    SciTech Connect

    Mellano, M.A.; Cooksey, D.A.

    1988-06-01

    The nucleotide sequence of a 4.5-kilobase copper resistance determinant from Pseudomonas syringae pv. tomato revealed four open reading frames (ORFs) in the same orientation. Deletion and site-specific mutational analyses indicated that the first two ORFs were essential for copper resistance; the last two ORFs were required for full resistance, but low-level resistance could be conferred in their absence. Five highly conserved, direct 24-base repeats were found near the beginning of the second ORF, and a similar, but less conserved, repeated region was found in the middle of the first ORF.

  5. Within-Host Nucleotide Diversity of Virus Populations: Insights from Next-Generation Sequencing

    PubMed Central

    Nelson, Chase W.; Hughes, Austin L.

    2014-01-01

    Next-generation sequencing (NGS) technology offers new opportunities for understanding the evolution and dynamics of viral populations within individual hosts over the course of infection. We review simple methods for estimating synonymous and nonsynonymous nucleotide diversity in viral genes from NGS data without the need for inferring linkage. We discuss the potential usefulness of these data for addressing questions of both practical and theoretical interest, including fundamental questions regarding the effective population sizes of within-host viral populations and the modes of natural selection acting on them. PMID:25481279

  6. The Complete Nucleotide Sequence of the Mitochondrial Genome of Bactrocera minax (Diptera: Tephritidae)

    PubMed Central

    Zhang, Bin; Nardi, Francesco; Hull-Sanders, Helen; Wan, Xuanwu; Liu, Yinghong

    2014-01-01

    The complete 16,043 bp mitochondrial genome (mitogenome) of Bactrocera minax (Diptera: Tephritidae) has been sequenced. The genome encodes 37 genes usually found in insect mitogenomes. The mitogenome information for B. minax was compared to the homologous sequences of Bactrocera oleae, Bactrocera tryoni, Bactrocera philippinensis, Bactrocera carambolae, Bactrocera papayae, Bactrocera dorsalis, Bactrocera correcta, Bactrocera cucurbitae and Ceratitis capitata. The analysis indicated the structure and organization are typical of, and similar to, the nine closely related species mentioned above, although it contains the lowest genome-wide A+T content (67.3%). Four short intergenic spacers with a high degree of conservation among the nine tephritid species mentioned above and B. minax were observed, which also have clear counterparts in the control regions (CRs). Correlation analysis among these ten tephritid species revealed close positive correlation between the A+T content of zero-fold degenerate sites (P0FD), the ratio of nucleotide substitution frequency at P0FD sites to all degenerate sites (zero-fold degenerate sites, two-fold degenerate sites and four-fold degenerate sites) and amino acid sequence distance (ASD) were found. Further, significant positive correlation was observed between the A+T content of four-fold degenerate sites (P4FD) and the ratio of nucleotide substitution frequency at P4FD sites to all degenerate sites; however, we found significant negative correlation between ASD and the A+T content of P4FD, and the ratio of nucleotide substitution frequency at P4FD sites to all degenerate sites. A higher nucleotide substitution frequency at non-synonymous sites compared to synonymous sites was observed in nad4, the first time that has been observed in an insect mitogenome. A poly(T) stretch at the 5′ end of the CR followed by a [TA(A)]n-like stretch was also found. In addition, a highly conserved G+A-rich sequence block was observed in front of the

  7. Nucleotide sequence of the gene encoding the two-subunit pilin of Bacteroides nodosus 265.

    PubMed Central

    Elleman, T C; Hoyne, P A; McKern, N M; Stewart, D J

    1986-01-01

    The nucleotide sequence of the gene encoding pilin from Bacteroides nodosus 265 has been determined. The pilin is encoded by a single-copy gene, from which can be predicted a prepilin comprising a single protein chain of Mr 16,637. The prepilin sequence differs in several respects from the mature protein sequence. Seven additional N-terminal amino acid residues are present in prepilin, whereas residue 8, phenylalanine, undergoes posttranslational modification to become the N-methylated amino-terminal residue of mature pilin. In addition, further processing occurs through internal cleavage to produce two noncovalently linked subunits characteristic of pilins from serogroup H of B. nodosus, of which strain 265 is a member. The position of cleavage has been identified between alanine residues at positions 72 and 73 of the mature 149-residue pilin protein. The predicted pilin sequence of B. nodosus 265 shows extensive N-terminal amino acid sequence homology with other pilins of the N-methylphenylalanine type. In addition this sequence also shows homology with these N-methylphenylalanine-type pilins in the C-terminal region of the molecule, especially with pilin from Pseudomonas aeruginosa PAK. Images PMID:2873127

  8. The complete nucleotide sequence and structure of the gene encoding bovine phenylethanolamine N-methyltransferase.

    PubMed

    Batter, D K; D'Mello, S R; Turzai, L M; Hughes, H B; Gioio, A E; Kaplan, B B

    1988-03-01

    A cDNA clone for bovine adrenal phenylethanolamine N-methyltransferase (PNMT) was used to screen a Charon 28 genomic library. One phage was identified, designated lambda P1, which included the entire PNMT gene. Construction of a restriction map, with subsequent Southern blot analysis, allowed the identification of exon-containing fragments. Dideoxy sequence analysis of these fragments, and several more further upstream, indicates that the bovine PNMT gene is 1,594 base pairs in length, consisting of three exons and two introns. The transcription initiation site was identified by two independent methods and is located approximately 12 base pairs upstream from the ATG translation start site. The 3' untranslated region is 88 base pairs in length and contains the expected polyadenylation signal (AATAAA). A putative promoter sequence (TATA box) is located about 25 base pairs upstream from the transcription initiation site. Computer comparison of the nucleotide sequence data with the consensus sequences of known regulatory elements revealed potential binding sites for glucocorticoid receptors and the Sp1 regulatory protein in the 5' flanking region of the gene. Additionally, comparison of the sequence of the exons of the PNMT gene with cDNA sequences for other enzymes involved in biogenic amine synthesis revealed no significant homology, indicating that PNMT is not a member of a multigene family of catecholamine biosynthetic enzymes. PMID:3379652

  9. Nucleotide sequence of the gene for the b subunit of human factor XIII

    SciTech Connect

    Bottenus, R.E.; Ichinose, A.; Davie, E.W. )

    1990-12-01

    Factor XIII (M{sub r} 320 000) is a blood coagulation factor that stabilizes and strengthens the fibrin clot. It circulates in blood as a tetramer composed of two a subunits (M{sub r} 75 000 each) and two b subunits (M{sub r} 80 000 each). The b subunit consists of 641 amino acids and includes 10 tandem repeats of 60 amino acids known as GP-I structures, short consensus repeats (SCR), or sushi domains. In the present study, the human gene for the b subunit has been isolated from three different genomic libraries prepared in {lambda} phage. Fifteen independent phage with inserts coding for the entire gene were isolated and characterized by restriction mapping, Southern blotting, and DNA sequencing. The gene was found to be 28 kilobases in length and consisted of 12 exons (I-XII) separated by 11 intervening sequences. The leader sequence was encoded by exon I, while the carbonyl-terminal region of the protein was encoded by exon XII. Exons II-XI each coded for a single sushi domain, suggesting that the gene evolved through exon shuffling and duplication. The 12 exons in the gene ranged in size from 64 to 222 base pairs, while the introns ranged in size from 87 to 9970 nucleotides and made up 92{percent} of the gene. One nucleotide change was found in the coding region of the gene when its sequence was compared to that of the cDNA. This difference, however, did not result in a change in the amino acid sequence of the protein.

  10. PyMod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within PyMOL

    PubMed Central

    2012-01-01

    Background In recent years, an exponential growing number of tools for protein sequence analysis, editing and modeling tasks have been put at the disposal of the scientific community. Despite the vast majority of these tools have been released as open source software, their deep learning curves often discourages even the most experienced users. Results A simple and intuitive interface, PyMod, between the popular molecular graphics system PyMOL and several other tools (i.e., [PSI-]BLAST, ClustalW, MUSCLE, CEalign and MODELLER) has been developed, to show how the integration of the individual steps required for homology modeling and sequence/structure analysis within the PyMOL framework can hugely simplify these tasks. Sequence similarity searches, multiple sequence and structural alignments generation and editing, and even the possibility to merge sequence and structure alignments have been implemented in PyMod, with the aim of creating a simple, yet powerful tool for sequence and structure analysis and building of homology models. Conclusions PyMod represents a new tool for the analysis and the manipulation of protein sequences and structures. The ease of use, integration with many sequence retrieving and alignment tools and PyMOL, one of the most used molecular visualization system, are the key features of this tool. Source code, installation instructions, video tutorials and a user's guide are freely available at the URL http://schubert.bio.uniroma1.it/pymod/index.html PMID:22536966

  11. Skeleton-based human action recognition using multiple sequence alignment

    NASA Astrophysics Data System (ADS)

    Ding, Wenwen; Liu, Kai; Cheng, Fei; Zhang, Jin; Li, YunSong

    2015-05-01

    Human action recognition and analysis is an active research topic in computer vision for many years. This paper presents a method to represent human actions based on trajectories consisting of 3D joint positions. This method first decompose action into a sequence of meaningful atomic actions (actionlets), and then label actionlets with English alphabets according to the Davies-Bouldin index value. Therefore, an action can be represented using a sequence of actionlet symbols, which will preserve the temporal order of occurrence of each of the actionlets. Finally, we employ sequence comparison to classify multiple actions through using string matching algorithms (Needleman-Wunsch). The effectiveness of the proposed method is evaluated on datasets captured by commodity depth cameras. Experiments of the proposed method on three challenging 3D action datasets show promising results.

  12. Support for linguistic macrofamilies from weighted sequence alignment

    PubMed Central

    Jäger, Gerhard

    2015-01-01

    Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily. PMID:26403857

  13. Support for linguistic macrofamilies from weighted sequence alignment.

    PubMed

    Jäger, Gerhard

    2015-10-13

    Computational phylogenetics is in the process of revolutionizing historical linguistics. Recent applications have shed new light on controversial issues, such as the location and time depth of language families and the dynamics of their spread. So far, these approaches have been limited to single-language families because they rely on a large body of expert cognacy judgments or grammatical classifications, which is currently unavailable for most language families. The present study pursues a different approach. Starting from raw phonetic transcription of core vocabulary items from very diverse languages, it applies weighted string alignment to track both phonetic and lexical change. Applied to a collection of ∼1,000 Eurasian languages and dialects, this method, combined with phylogenetic inference, leads to a classification in excellent agreement with established findings of historical linguistics. Furthermore, it provides strong statistical support for several putative macrofamilies contested in current historical linguistics. In particular, there is a solid signal for the Nostratic/Eurasiatic macrofamily. PMID:26403857

  14. The eSNV-detect: a computational system to identify expressed single nucleotide variants from transcriptome sequencing data.

    PubMed

    Tang, Xiaojia; Baheti, Saurabh; Shameer, Khader; Thompson, Kevin J; Wills, Quin; Niu, Nifang; Holcomb, Ilona N; Boutet, Stephane C; Ramakrishnan, Ramesh; Kachergus, Jennifer M; Kocher, Jean-Pierre A; Weinshilboum, Richard M; Wang, Liewei; Thompson, E Aubrey; Kalari, Krishna R

    2014-12-16

    Rapid development of next generation sequencing technology has enabled the identification of genomic alterations from short sequencing reads. There are a number of software pipelines available for calling single nucleotide variants from genomic DNA but, no comprehensive pipelines to identify, annotate and prioritize expressed SNVs (eSNVs) from non-directional paired-end RNA-Seq data. We have developed the eSNV-Detect, a novel computational system, which utilizes data from multiple aligners to call, even at low read depths, and rank variants from RNA-Seq. Multi-platform comparisons with the eSNV-Detect variant candidates were performed. The method was first applied to RNA-Seq from a lymphoblastoid cell-line, achieving 99.7% precision and 91.0% sensitivity in the expressed SNPs for the matching HumanOmni2.5 BeadChip data. Comparison of RNA-Seq eSNV candidates from 25 ER+ breast tumors from The Cancer Genome Atlas (TCGA) project with whole exome coding data showed 90.6-96.8% precision and 91.6-95.7% sensitivity. Contrasting single-cell mRNA-Seq variants with matching traditional multicellular RNA-Seq data for the MD-MB231 breast cancer cell-line delineated variant heterogeneity among the single-cells. Further, Sanger sequencing validation was performed for an ER+ breast tumor with paired normal adjacent tissue validating 29 out of 31 candidate eSNVs. The source code and user manuals of the eSNV-Detect pipeline for Sun Grid Engine and virtual machine are available at http://bioinformaticstools.mayo.edu/research/esnv-detect/. PMID:25352556

  15. Nucleotide sequence and structural features of a novel US-a junction present in a defective herpes simplex virus genome.

    PubMed Central

    Mocarski, E S; Deiss, L P; Frenkel, N

    1985-01-01

    Defective genomes generated during serial propagation of herpes simplex virus type 1 (Justin) consist of tandem reiterations of sequences that are colinear with a portion of the S component of the standard viral genome. We determined the structure of the novel US-a junction, at which the US sequences of one repeat unit join the a sequences of the adjacent repeat unit. Comparison of the nucleotide sequence at this junction with the nucleotide sequence of the corresponding US region of the standard virus genome indicated that the defective genome repeat unit arose by a single recombinational event between an L-S junction a sequence and the US region. The recombinational process might have been mediated by limited sequence homology. The sequences retained within the US-a junction further define the signal for cleavage and packaging of viral DNA. PMID:2989551

  16. Flexible, Fast and Accurate Sequence Alignment Profiling on GPGPU with PaSWAS

    PubMed Central

    Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J. L.; Nap, Jan Peter

    2015-01-01

    Motivation To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. Results With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation. PMID:25830241

  17. CATO: The Clone Alignment Tool.

    PubMed

    Henstock, Peter V; LaPan, Peter

    2016-01-01

    High-throughput cloning efforts produce large numbers of sequences that need to be aligned, edited, compared with reference sequences, and organized as files and selected clones. Different pieces of software are typically required to perform each of these tasks. We have designed a single piece of software, CATO, the Clone Alignment Tool, that allows a user to align, evaluate, edit, and select clone sequences based on comparisons to reference sequences. The input and output are designed to be compatible with standard data formats, and thus suitable for integration into a clone processing pipeline. CATO provides both sequence alignment and visualizations to facilitate the analysis of cloning experiments. The alignment algorithm matches each of the relevant candidate sequences against each reference sequence. The visualization portion displays three levels of matching: 1) a top-level summary of the top candidate sequences aligned to each reference sequence, 2) a focused alignment view with the nucleotides of matched sequences displayed against one reference sequence, and 3) a pair-wise alignment of a single reference and candidate sequence pair. Users can select the minimum matching criteria for valid clones, edit or swap reference sequences, and export the results to a summary file as part of the high-throughput cloning workflow. PMID:27459605

  18. CATO: The Clone Alignment Tool

    PubMed Central

    Henstock, Peter V.; LaPan, Peter

    2016-01-01

    High-throughput cloning efforts produce large numbers of sequences that need to be aligned, edited, compared with reference sequences, and organized as files and selected clones. Different pieces of software are typically required to perform each of these tasks. We have designed a single piece of software, CATO, the Clone Alignment Tool, that allows a user to align, evaluate, edit, and select clone sequences based on comparisons to reference sequences. The input and output are designed to be compatible with standard data formats, and thus suitable for integration into a clone processing pipeline. CATO provides both sequence alignment and visualizations to facilitate the analysis of cloning experiments. The alignment algorithm matches each of the relevant candidate sequences against each reference sequence. The visualization portion displays three levels of matching: 1) a top-level summary of the top candidate sequences aligned to each reference sequence, 2) a focused alignment view with the nucleotides of matched sequences displayed against one reference sequence, and 3) a pair-wise alignment of a single reference and candidate sequence pair. Users can select the minimum matching criteria for valid clones, edit or swap reference sequences, and export the results to a summary file as part of the high-throughput cloning workflow. PMID:27459605

  19. The complete nucleotide sequence and genome organization of pea streak virus (genus Carlavirus).

    PubMed

    Su, Li; Li, Zhengnan; Bernardy, Mike; Wiersma, Paul A; Cheng, Zhihui; Xiang, Yu

    2015-10-01

    Pea streak virus (PeSV) is a member of the genus Carlavirus in the family Betaflexiviridae. Here, the first complete genome sequence of PeSV was determined by deep sequencing of a cDNA library constructed from dsRNA extracted from a PeSV-infected sample and Rapid Amplification of cDNA Ends (RACE) PCR. The PeSV genome consists of 8041 nucleotides excluding the poly(A) tail and contains six open reading frames (ORFs). The putative peptide encoded by the PeSV ORF6 has an estimated molecular mass of 6.6 kDa and shows no similarity to any known proteins. This differs from typical carlaviruses, whose ORF6 encodes a 12- to 18-kDa cysteine-rich nucleic-acid-binding protein. PMID:26092422

  20. Infectious hepatitis B virus from cloned DNA of known nucleotide sequence.

    PubMed Central

    Will, H; Cattaneo, R; Darai, G; Deinhardt, F; Schellekens, H; Schaller, H

    1985-01-01

    The infectivity of cloned hepatitis B viral DNA (HBV) has been tested in chimpanzees to identify a fully functional HBV genome and to assess the risk associated with its handling. Only one of two HBV DNA sequence variants tested was shown to be infectious. "Clone purified" virus of predicted nucleotide sequence was produced from the infectious HBV DNA, and the cloned viral genome was identical in structure with naturally occurring HBV. Infection could be initiated independent of whether circular monomeric or plasmid integrated dimeric forms of the viral genome were inoculated, but the infectivity of the DNA depended on liver cell transfection or intrahepatic injection. Intravenous injection of high doses of infectious HBV DNA did not induce hepatitis, suggesting that there is virtually no risk associated with routine laboratory handling of cloned HBV DNA. Images PMID:2983320

  1. Nucleotide sequence of the BsuRI restriction-modification system.

    PubMed Central

    Kiss, A; Posfai, G; Keller, C C; Venetianer, P; Roberts, R J

    1985-01-01

    The genes of the 5'-GGCC specific BsuRI restriction-modification system of Bacillus subtilis have been cloned and expressed in E. coli and their nucleotide sequence has been determined. The restriction and modification genes code for polypeptides with calculated molecular weights of 66,314 and 49,642, respectively. Both enzymes are coded by the same DNA strand. The restriction gene is upstream of the methylase gene and the coding regions are separated by 780 bp. Analysis of the RNA transcripts by S1-nuclease mapping indicates that the restriction and modification genes are transcribed from different promoters. Comparison of the amino acid sequences revealed no homology between the BsuRI restriction and modification enzymes. There are, however, regions of homology between the BsuRI methylase and two other GGCC specific modification enzymes, the BspRI and SPR methylases. Images PMID:2997708

  2. Nucleotide sequence and expression of the gene encoding the EcoRII modification enzyme.

    PubMed Central

    Som, S; Bhagwat, A S; Friedman, S

    1987-01-01

    The gene coding for the EcoRII modification enzyme has been cloned and the nucleotide sequence of 1933 base pairs containing the gene has been determined. The gene codes for a protein of 477 amino acids. Two transcriptional start sites have been mapped by S1 mapping. One deletion that removes 34 N-terminal amino acids was found to have partial enzyme activity. Comparison of the EcoRII methylase sequence with other cytosine methylases revealed several domains of partial homology among all cytosine methylases. Cloning the gene in multicopy pUC vectors increased the expression by 6-18 fold. A 40 fold overproduction of the EcoRII methylase was obtained by cloning the gene in the expression vector carrying the lambda PL promoter. Images PMID:3029675

  3. Nucleotide sequence of nifD from Frankia alni strain ArI3: phylogenetic inferences.

    PubMed

    Normand, P; Gouy, M; Cournoyer, B; Simonet, P

    1992-05-01

    The complete nucleotide sequence of the nifD gene encoding the alpha subunit of component I of nitrogenase from Frankia alni strain ArI3 was determined. The coding region is 1,458 bp in length and encodes a polypeptide of 486 residues with a predicted molecular weight of 53,500. Phylogenetic inferences with 12 complete published nifD sequences were drawn using a variety of approaches. Frankia nifD clusters with proteobacteria rather than with Clostridium pasteurianum, the other Gram-positive bacterium studied. Extant eubacterial nif genes seem to have at least three distinct evolutionary origins as a result of ancient gene duplications. Within the Gram-positive bacterial phylum, functional nif genes descend from different duplicates. PMID:1584016

  4. Nucleotide sequence analysis of beta tubulin gene in a wide range of dermatophytes.

    PubMed

    Rezaei-Matehkolaei, Ali; Mirhendi, Hossein; Makimura, Koichi; de Hoog, G Sybren; Satoh, Kazuo; Najafzadeh, Mohammad Javad; Shidfar, Mohammad Reza

    2014-10-01

    We investigated the resolving power of the beta tubulin protein-coding gene (BT2) for systematic study of dermatophyte fungi. Initially, 144 standard and clinical strains belonging to 26 species in the genera Trichophyton, Microsporum, and Epidermophyton were identified by internal transcribe spacer (ITS) sequencing. Subsequently, BT2 was partially amplified in all strains, and sequence analysis performed after construction of a BT2 database that showed length ranged from approximately 723 (T. ajelloi) to 808 nucleotides (M. persicolor) in different species. Intraspecific sequence variation was found in some species, but T. tonsurans, T. equinum, T. concentricum, T. verrucosum, T. rubrum, T. violaceum, T. eriotrephon, E. floccosum, M. canis, M. ferrugineum, and M. audouinii were invariant. The sequences were found to be relatively conserved among different strains of the same species. The species with the closest resemblance were Arthroderma benhamiae and T. concentricum and T. tonsurans and T. equinum with 100% and 99.8% identity, respectively; the most distant species were M. persicolor and M. amazonicum. The dendrogram obtained from BT2 topology was almost compatible with the species concept based on ITS sequencing, and similar clades and species were distinguished in the BT2 tree. Here, beta tubulin was characterized in a wide range of dermatophytes in order to assess intra- and interspecies variation and resolution and was found to be a taxonomically valuable gene. PMID:25079222

  5. Unique nucleotide sequence (UNS)-guided assembly of repetitive DNA parts for synthetic biology applications

    PubMed Central

    Torella, Joseph P.; Lienert, Florian; Boehm, Christian R.; Chen, Jan-Hung; Way, Jeffrey C.; Silver, Pamela A.

    2016-01-01

    Recombination-based DNA construction methods, such as Gibson assembly, have made it possible to easily and simultaneously assemble multiple DNA parts and hold promise for the development and optimization of metabolic pathways and functional genetic circuits. Over time, however, these pathways and circuits have become more complex, and the increasing need for standardization and insulation of genetic parts has resulted in sequence redundancies — for example repeated terminator and insulator sequences — that complicate recombination-based assembly. We and others have recently developed DNA assembly methods that we refer to collectively as unique nucleotide sequence (UNS)-guided assembly, in which individual DNA parts are flanked with UNSs to facilitate the ordered, recombination-based assembly of repetitive sequences. Here we present a detailed protocol for UNS-guided assembly that enables researchers to convert multiple DNA parts into sequenced, correctly-assembled constructs, or into high-quality combinatorial libraries in only 2–3 days. If the DNA parts must be generated from scratch, an additional 2–5 days are necessary. This protocol requires no specialized equipment and can easily be implemented by a student with experience in basic cloning techniques. PMID:25101822

  6. IMGT/LIGM-DB, the IMGT comprehensive database of immunoglobulin and T cell receptor nucleotide sequences.

    PubMed

    Giudicelli, Véronique; Duroux, Patrice; Ginestoux, Chantal; Folch, Géraldine; Jabado-Michaloud, Joumana; Chaume, Denys; Lefranc, Marie-Paule

    2006-01-01

    IMGT/LIGM-DB is the IMGT comprehensive database of immunoglobulin (IG) and T cell receptor (TR) nucleotide sequences from human and other vertebrate species. It was created in 1989 by LIGM, Montpellier, France and is the oldest and the largest database of IMGT. IMGT/LIGM-DB includes all germline (non-rearranged) and rearranged IG and TR genomic DNA (gDNA) and complementary DNA (cDNA) sequences published in generalist databases. IMGT/LIGM-DB allows searches from the Web interface according to biological and immunogenetic criteria through five distinct modules depending on the user interest. For a given entry, nine types of display are available including the IMGT flat file, the translation of the coding regions and the analysis by the IMGT/V-QUEST tool. IMGT/LIGM-DB distributes expertly annotated sequences. The annotations hugely enhance the quality and the accuracy of the distributed detailed information. They include the sequence identification, the gene and allele classification, the constitutive and specific motif description, the codon and amino acid numbering, and the sequence obtaining information, according to the main concepts of IMGT-ONTOLOGY. They represent the main source of IG and TR gene and allele knowledge stored in IMGT/GENE-DB and in the IMGT reference directory. IMGT/LIGM-DB is freely available at http://imgt.cines.fr. PMID:16381979

  7. Unique nucleotide sequence-guided assembly of repetitive DNA parts for synthetic biology applications

    SciTech Connect

    Torella, JP; Lienert, F; Boehm, CR; Chen, JH; Way, JC; Silver, PA

    2014-08-07

    Recombination-based DNA construction methods, such as Gibson assembly, have made it possible to easily and simultaneously assemble multiple DNA parts, and they hold promise for the development and optimization of metabolic pathways and functional genetic circuits. Over time, however, these pathways and circuits have become more complex, and the increasing need for standardization and insulation of genetic parts has resulted in sequence redundancies-for example, repeated terminator and insulator sequences-that complicate recombination-based assembly. We and others have recently developed DNA assembly methods, which we refer to collectively as unique nucleotide sequence (UNS)-guided assembly, in which individual DNA parts are flanked with UNSs to facilitate the ordered, recombination-based assembly of repetitive sequences. Here we present a detailed protocol for UNS-guided assembly that enables researchers to convert multiple DNA parts into sequenced, correctly assembled constructs, or into high-quality combinatorial libraries in only 2-3 d. If the DNA parts must be generated from scratch, an additional 2-5 d are necessary. This protocol requires no specialized equipment and can easily be implemented by a student with experience in basic cloning techniques.

  8. Mapping Nucleotide Sequences that Encode Complex Binary Disease Traits with HapMap

    PubMed Central

    Cui, Yuehua; Fu, Wenjiang; Sun, Kelian; Romero, Roberto; Wu, Rongling

    2007-01-01

    Detecting the patterns of DNA sequence variants across the human genome is a crucial step for unraveling the genetic basis of complex human diseases. The human HapMap constructed by single nucleotide polymorphisms (SNPs) provides efficient sequence variation information that can speed up the discovery of genes related to common diseases. In this article, we present a generalized linear model for identifying specific nucleotide variants that encode complex human diseases. A novel approach is derived to group haplotypes to form composite diplotypes, which largely reduces the model degrees of freedom for an association test and hence increases the power when multiple SNP markers are involved. An efficient two-stage estimation procedure based on the expectation-maximization (EM) algorithm is derived to estimate parameters. Non-genetic environmental or clinical risk factors can also be fitted into the model. Computer simulations show that our model has reasonable power and type I error rate with appropriate sample size. It is also suggested through simulations that a balanced design with approximately equal number of cases and controls should be preferred to maintain small estimation bias and reasonable testing power. To illustrate the utility, we apply the method to a genetic association study of large for gestational age (LGA) neonates. The model provides a powerful tool for elucidating the genetic basis of complex binary diseases. PMID:19384427

  9. Mapping DNA methylation by transverse current sequencing: Reduction of noise from neighboring nucleotides

    NASA Astrophysics Data System (ADS)

    Alvarez, Jose; Massey, Steven; Kalitsov, Alan; Velev, Julian

    Nanopore sequencing via transverse current has emerged as a competitive candidate for mapping DNA methylation without needed bisulfite-treatment, fluorescent tag, or PCR amplification. By eliminating the error producing amplification step, long read lengths become feasible, which greatly simplifies the assembly process and reduces the time and the cost inherent in current technologies. However, due to the large error rates of nanopore sequencing, single base resolution has not been reached. A very important source of noise is the intrinsic structural noise in the electric signature of the nucleotide arising from the influence of neighboring nucleotides. In this work we perform calculations of the tunneling current through DNA molecules in nanopores using the non-equilibrium electron transport method within an effective multi-orbital tight-binding model derived from first-principles calculations. We develop a base-calling algorithm accounting for the correlations of the current through neighboring bases, which in principle can reduce the error rate below any desired precision. Using this method we show that we can clearly distinguish DNA methylation and other base modifications based on the reading of the tunneling current.

  10. Evidence for Balancing Selection from Nucleotide Sequence Analyses of Human G6PD

    PubMed Central

    Verrelli, Brian C.; McDonald, John H.; Argyropoulos, George; Destro-Bisol, Giovanni; Froment, Alain; Drousiotou, Anthi; Lefranc, Gerard; Helal, Ahmed N.; Loiselet, Jacques; Tishkoff, Sarah A.

    2002-01-01

    Glucose-6-phosphate dehydrogenase (G6PD) mutations that result in reduced enzyme activity have been implicated in malarial resistance and constitute one of the best examples of selection in the human genome. In the present study, we characterize the nucleotide diversity across a 5.2-kb region of G6PD in a sample of 160 Africans and 56 non-Africans, to determine how selection has shaped patterns of DNA variation at this gene. Our global sample of enzymatically normal B alleles and A, A−, and Med alleles with reduced enzyme activities reveals many previously uncharacterized silent-site polymorphisms. In comparison with the absence of amino acid divergence between human and chimpanzee G6PD sequences, we find that the number of G6PD amino acid polymorphisms in human populations is significantly high. Unlike many other G6PD-activity alleles with reduced activity, we find that the age of the A variant, which is common in Africa, may not be consistent with the recent emergence of severe malaria and therefore may have originally had a historically different adaptive function. Overall, our observations strongly support previous genotype-phenotype association studies that proposed that balancing selection maintains G6PD deficiencies within human populations. The present study demonstrates that nucleotide sequence analyses can reveal signatures of both historical and recent selection in the genome and may elucidate the impact that infectious disease has had during human evolution. PMID:12378426

  11. The HLA-DRA*0102 allele: correct nucleotide sequence and associated HLA haplotypes.

    PubMed

    Kralovicova, J; Marsh, S G E; Waller, M J; Hammarstrom, L; Vorechovsky, I

    2002-09-01

    Here we correct the nucleotide sequence of a single known variant of the HLA-DRA gene. We show that the coding regions of the HLA-DRA*0101 and HLA-DRA*0102 alleles do not differ at two codons as reported previously, but only in codon 217. Using nucleotide sequencing and DNA samples from individuals homozygous in the major histocompatibility complex, we found that the variant, leucine 217-encoding HLA-DRA*0102 allele was present on the haplotypes HLA-B*0801, DRB1*03011, DQB1*0201 (ancestral haplotype AH8.1), HLA-B*07021, DRB1*15011, DQB1*0602 (AH7.1), HLA-B*1501, DRB1*15011, DQB1*0602, HLA-B*1501, DRB1*1402, DQB1*03011 and HLA-A3, B*07021, DRB1*1301, DQB1*0603. The HLA-DRA*0101 allele coding for valine 217 was observed on the haplotypes HLA-B*5701, DRB1*0701, DQB1*03032 (AH57.1), HLA-DRB1*04011, DQB1*0302, HLA-DRB1*0701, DQB1*0202, and HLA-DRB1*0101, DQB1*05011. PMID:12445311

  12. Complete nucleotide sequence of a Spanish isolate of alfalfa mosaic virus: evidence for additional genetic variability.

    PubMed

    Parrella, Giuseppe; Acanfora, Nadia; Orílio, Anelise F; Navas-Castillo, Jesús

    2011-06-01

    Alfalfa mosaic virus (AMV) is a plant virus that is distributed worldwide and can induce necrosis and/or yellow mosaic on a large variety of plant species, including commercially important crops. It is the only virus of the genus Alfamovirus in the family Bromoviridae. AMV isolates can be clustered into two genetic groups that correlate with their geographic origin. Here, we report for the first time the complete nucleotide sequence of a Spanish isolate of AMV found infecting Cape honeysuckle (Tecoma capensis) and named Tec-1. The tripartite genome of Tec-1 is composed of 3643 nucleotides (nt) for RNA1, 2594 nt for RNA2 and 2037 nt for RNA3. Comparative sequence analysis of the coat protein gene revealed that the isolate Tec-1 is distantly related to subgroup I of AMV and more closely related to subgroup II, although forming a distinct phylogenetic clade. Therefore, we propose to split subgroup II of AMV into two subgroups, namely IIA, comprising isolates previously included in subgroup II, and IIB, including the novel Spanish isolate Tec-1. PMID:21327783

  13. Complete nucleotide sequence and genome organization of Pelargonium flower break virus.

    PubMed

    Rico, P; Hernández, C

    2004-03-01

    The complete nucleotide sequence of Pelargonium flower break virus (PFBV) has been determined. The genomic RNA is 3923 nucleotides (nt) long and contains five open reading frames (ORFs). The 5'-proximal ORF encodes a 27 kDa protein (p27) and terminates with an amber codon which may be read-through into an in-frame p56 ORF to generate a 86 kDa protein (p86) containing the viral RNA dependent-RNA polymerase motifs. Two small ORFs, located in the central part of the viral genome, encode polypeptides of 7 (p7) and 12 kDa (p12), respectively, which are very likely involved in virus movement. Interestingly, p12 presents a leucine zipper motif that has not been previously reported in related proteins. The 3'-proximal ORF encodes a 37 kDa capsid protein (CP). The p12 ORF is in-frame with the p86 ORF and a double read-through protein of 99 kDa (p99) may be produced. Amino acid sequence comparisons revealed that the proteins encoded by ORFs 2, 3 and 4 are more similar to the corresponding gene products of Carnation mottle virus than to those of other carmoviruses, whereas the p27 and the CP show higher identity with the equivalent proteins of Saguaro cactus virus. Phylogenetic analysis conducted with the different viral products confirmed the assignment of PFBV to the genus Carmovirus. PMID:14991450

  14. Alignment of Escherichia coli K12 DNA sequences to a genomic restriction map.

    PubMed Central

    Rudd, K E; Miller, W; Ostell, J; Benson, D A

    1990-01-01

    We use the extensive published information describing the genome of Escherichia coli and new restriction map alignment software to align DNA sequence, genetic, and physical maps. Restriction map alignment software is used which considers restriction maps as strings analogous to DNA or protein sequences except that two values, enzyme name and DNA base address, are associated with each position on the string. The resulting alignments reveal a nearly linear relationship between the physical and genetic maps of the E. coli chromosome. Physical map comparisons with the 1976, 1980, and 1983 genetic maps demonstrate a better fit with the more recent maps. The results of these alignments are genomic kilobase coordinates, orientation and rank of the alignment that best fits the genetic data. A statistical measure based on extreme value distribution is applied to the alignments. Additional computer analyses allow us to estimate the accuracy of the published E. coli genomic restriction map, simulate rearrangements of the bacterial chromosome, and search for repetitive DNA. The procedures we used are general enough to be applicable to other genome mapping projects. PMID:2183179

  15. Self-consistently optimized statistical mechanical energy functions for sequence structure alignment.

    PubMed Central

    Koretke, K. K.; Luthey-Schulten, Z.; Wolynes, P. G.

    1996-01-01

    A quantitative form of the principle of minimal frustration is used to obtain from a database analysis statistical mechanical energy functions and gap parameters for aligning sequences to three-dimensional structures. The analysis that partially takes into account correlations in the energy landscape improves upon the previous approximations of Goldstein et al. (1994, 1995) (Goldstein R, Luthey-Schulten Z, Wolynes P, 1994, Proceedings of the 27th Hawaii International Conference on System Sciences. Los Alamitos, California: IEEE Computer Society Press. pp 306-315; Goldstein R, Luthey-Schulten Z, Wolynes P, 1995, In: Elber R, ed. New developments in theoretical studies of proteins. Singapore: World Scientific). The energy function allows for ordering of alignments based on the compatibility of a sequence to be in a given structure (i.e., lowest energy) and therefore removes the necessity of using percent identity or similarity as scoring parameters. The alignments produced by the energy function on distant homologues with low percent identity (less than 21%) are generally better than those generated with evolutionary information. The lowest energy alignment generated with the energy function for sequences containing prosite signatures but unknown structures is a structure containing the same prosite signature, providing a check on the robustness of the algorithm. Finally, the energy function can make use of known experimental evidence as constraints within the alignment algorithm to aid in finding the correct structural alignment. PMID:8762136

  16. Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads

    PubMed Central

    Song, Kai; Ren, Jie; Zhai, Zhiyuan; Liu, Xuemei

    2013-01-01

    Abstract Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, D2, \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\textbf{\\textit{D}}_{\\bf 2}^{\\bf *}$$ \\end{document}, and \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\textbf{\\textit{D}}_{\\bf 2}^S$$ \\end{document}, both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin{document} $$\\textbf{\\textit{D}}_{\\bf 2}^{\\bf *}$$ \\end{document} and \\documentclass{aastex}\\usepackage{amsbsy}\\usepackage{amsfonts}\\usepackage{amssymb}\\usepackage{bm}\\usepackage{mathrsfs}\\usepackage{pifont}\\usepackage{stmaryrd}\\usepackage{textcomp}\\usepackage{portland, xspace}\\usepackage{amsmath, amsxtra}\\pagestyle{empty}\\DeclareMathSizes{10}{9}{7}{6}\\begin

  17. The bioinformatics of nucleotide sequence coding for proteins requiring metal coenzymes and proteins embedded with metals

    NASA Astrophysics Data System (ADS)

    Tremberger, G.; Dehipawala, Sunil; Cheung, E.; Holden, T.; Sullivan, R.; Nguyen, A.; Lieberman, D.; Cheung, T.

    2015-09-01

    All metallo-proteins need post-translation metal incorporation. In fact, the isotope ratio of Fe, Cu, and Zn in physiology and oncology have emerged as an important tool. The nickel containing F430 is the prosthetic group of the enzyme methyl coenzyme M reductase which catalyzes the release of methane in the final step of methano-genesis, a prime energy metabolism candidate for life exploration space mission in the solar system. The 3.5 Gyr early life sulfite reductase as a life switch energy metabolism had Fe-Mo clusters. The nitrogenase for nitrogen fixation 3 billion years ago had Mo. The early life arsenite oxidase needed for anoxygenic photosynthesis energy metabolism 2.8 billion years ago had Mo and Fe. The selection pressure in metal incorporation inside a protein would be quantifiable in terms of the related nucleotide sequence complexity with fractal dimension and entropy values. Simulation model showed that the studied metal-required energy metabolism sequences had at least ten times more selection pressure relatively in comparison to the horizontal transferred sequences in Mealybug, guided by the outcome histogram of the correlation R-sq values. The metal energy metabolism sequence group was compared to the circadian clock KaiC sequence group using magnesium atomic level bond shifting mechanism in the protein, and the simulation model would suggest a much higher selection pressure for the energy life switch sequence group. The possibility of using Kepler 444 as an example of ancient life in Galaxy with the associated exoplanets has been proposed and is further discussed in this report. Examples of arsenic metal bonding shift probed by Synchrotron-based X-ray spectroscopy data and Zn controlled FOXP2 regulated pathways in human and chimp brain studied tissue samples are studied in relationship to the sequence bioinformatics. The analysis results suggest that relatively large metal bonding shift amount is associated with low probability correlation R

  18. Purification and characterization of Clostridium perfringens 120-kilodalton collagenase and nucleotide sequence of the corresponding gene.

    PubMed Central

    Matsushita, O; Yoshihara, K; Katayama, S; Minami, J; Okabe, A

    1994-01-01

    Clostridium perfringens type C NCIB 10662 produced various gelatinolytic enzymes with molecular masses ranging from approximately 120 to approximately 80 kDa. A 120-kDa gelatinolytic enzyme was present in the largest quantity in the culture supernatant, and this enzyme was purified to homogeneity on the basis of sodium dodecyl sulfate-polyacrylamide gel electrophoresis. The purified enzyme was identified as the major collagenase of the organism, and it cleaved typical collagenase substrates such as azocoll, a synthetic substrate (4-phenylazobenzyloxy-carbonyl-Pro-Leu-Gly-Pro-D-Arg [Pz peptide]), and a type I collagen fibril. In addition, a gene (colA) encoding a 120-kDa collagenase was cloned in Escherichia coli. Nested deletions were used to define the coding region of colA, and this region was sequenced; from the nucleotide sequence, this gene encodes a protein of 1,104 amino acids (M(r), 125,966). Furthermore, from the N-terminal amino acid sequence of the purified enzyme which was found in this reading frame, the molecular mass of the mature enzyme was calculated to be 116,339 Da. Analysis of the primary structure of the gene product showed that the enzyme was produced with a stretch of 86 amino acids containing a putative signal sequence. Within this stretch was found PLGP, the amino acid sequence constituting the Pz peptide. This sequence may be implicated in self-processing of the collagenase. A consensus zinc-binding sequence (HEXXH) suggested for vertebrate Zn collagenases is present in this bacterial collagenase. Vibrio alginolyticus collagenase and Achromobacter lyticus protease I showed significant homology with the 120-kDa collagenase of C. perfringens, suggesting that these three enzymes are evolutionarily related. Images PMID:8282691

  19. Species diagnostic single-nucleotide polymorphism and sequence-tagged site markers for the parasitic WASP Genus Nasonia (Hymenoptera: Ptermalidae)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We developed, identified and evaluated eight single nucleotide polymorphism (SNP) and three sequence-tagged site (STS) markers in nuclear gene sequences of the wasp genus Nasonia (Hymenoptera). We studied variation of these markers in natural populations of the closely related and regionally sympatr...

  20. Cloning and nucleotide sequence of anaerobically induced porin protein E1 (OprE) of Pseudomonas aeruginosa PAO1.

    PubMed

    Yamano, Y; Nishikawa, T; Komatsu, Y

    1993-05-01

    The porin oprE gene of Pseudomonas aeruginosa PAO1 was isolated. Its nucleotide sequence indicated that the structural gene of 1383 nucleotide residues encodes a precursor consisting of 460 amino acid residues with a signal peptide of 29 amino acid residues, which was confirmed by the N-terminal 23-amino-acid sequence and the reaction with anti-OprE polyclonal antiserum. Anaerobiosis induced OprE production at the transcription level. The transcription start site was determined to be 40 nucleotides upstream from the ATG initiation codon. The control region contained an appropriately situated E sigma 54 recognition site and the putative second half of an ANR box. The amino acid sequence of OprE had some clusters of sequence homologous with that of OprD of P. aeruginosa, which might be responsible for the outer membrane permeability of imipenem and basic amino acids. PMID:8394980

  1. Cloning, nucleotide sequence, and transcriptional analysis of the Pediococcus acidilactici L-(+)-lactate dehydrogenase gene.

    PubMed Central

    Garmyn, D; Ferain, T; Bernard, N; Hols, P; Delcour, J

    1995-01-01

    Recombinant plasmids containing the Pediococcus acidilactici L-(+)-lactate dehydrogenase gene (ldhL) were isolated by complementing for growth under anaerobiosis of an Escherichia coli lactate dehydrogenase-pyruvate formate lyase double mutant. The nucleotide sequence of the ldhL gene predicted a protein of 323 amino acids showing significant similarity with other bacterial L-(+)-lactate dehydrogenases and especially with that of Lactobacillus plantarum. The ldhL transcription start points in P. acidilactici were defined by primer extension, and the promoter sequence was identified as TCAAT-(17 bp)-TATAAT. This sequence is closely related to the consensus sequence of vegetative promoters from gram-positive bacteria as well as from E. coli. Northern analysis of P. acidilactici RNA showed a 1.1-kb ldhL transcript whose abundance is growth rate regulated. These data, together with the presence of a putative rho-independent transcriptional terminator, suggest that ldhL is expressed as a monocistronic transcript in P. acidilactici. PMID:7887607

  2. Nucleotide sequence of ompV, the gene for a major Vibrio cholerae outer membrane protein.

    PubMed

    Pohlner, J; Meyer, T F; Jalajakumari, M B; Manning, P A

    1986-12-01

    The nucleotide sequence of the ompV gene of Vibrio cholerae was determined. The product of the gene is a 28,000 dalton protein which, after the removal of a 19 amino acid signal sequence, produces a mature outer membrane protein of 26,000 daltons. The cleavage site was determined by amino-terminal amino acid sequencing of the purified mature protein. The DNA upstream of the gene shows the presence of a typical promoter region as judged from the Escherichia coli consensus information; however, the Shine-Dalgarno sequence is associated with a region capable of forming a secondary structure in the mRNA. The formation of this structure would inhibit binding of the mRNA to the ribosome and reduce translation. It is proposed that this structure is recognized by a positive activator in V. cholerae and because of its absence in E. coli ompV is poorly expressed. The distribution of rare codons within ompV suggests that they may serve to slow down the translation of particular domains such that the nascent polypeptide has an opportunity to take up its conformation without interference from the later formed regions. Such a mechanism could aid localization of the protein if export were by a contranslational secretion system. PMID:3031428

  3. Nucleotide sequence of both genomic RNAs of a North American tobacco rattle virus isolate.

    PubMed

    Sudarshana, M R; Berger, P H

    1998-01-01

    The complete sequence of a North American tobacco rattle virus (TRV) isolate, 'Oregon yellow' (ORY), was determined from cDNA and RT-PCR clones derived from the two genomic RNAs of this isolate. The RNA-1 is 6790 bases and RNA-2 is 3261 bases. The sequence of TRV-ORY RNA-1 was similar to RNA-1 to TRV isolate SYM, and differs in 48 nucleotides. TRV-ORY RNA-1 was one base shorter than--SYM, and had 47 base substitutions resulting in 12 amino acid substitutions of which 4 were conservative. The RNA-2 of TRV-ORY was distinct from RNA-2 of other characterized TRV isolates and contained three open reading frames (ORFs) that could potentially code for proteins of MW 22.4 kDa, 37.6 kDa and 17.9 kDa. Based on the homology of the predicted amino acid sequence with those of other tobraviruses. ORF1 of RNA-2 encodes the coat protein (CP). The protein sequence of ORF2 had regions of limited similarity with those of ORF2 of two other TRV isolates and pea early browning tobravirus. The ORF3 was unique to TRV-ORY. Phylogenetic analysis of tobravirus CPs indicated that TRV-ORY was most closely related to pepper ringspot tobravirus and TRV-TCM. The relationship of tobravirus CPs to other rod-shaped tubular plant viruses is also discussed. PMID:9739332

  4. Power Spectrum and Mutual Information Analyses of DNA Base (Nucleotide) Sequences

    NASA Astrophysics Data System (ADS)

    Isohata, Yasuhiko; Hayashi, Masaki

    2003-03-01

    On the basis of the power spectrum analyses for the base (nucleotide) sequences of various genes, we have studied long-range correlations in total base sequences which are expressed as 1/fα, behaviour of the exponent α for the accumulated base sequences as well as periodicities at short range. In particular from the analysis of content rate distributions of α we have obtained the average value \\barα=0.40± 0.01 and \\barα=0.20± 0.01 for the human genes and S. cerevisiae genes, respectively. We have also performed the analyses using the mutual information function. We show that there exists a clear difference between the content rate distributions of correlation lengths for the sample human genes and the S. cerevisiae genes. We are led to a conjecture that the elongation of the correlation length in the base sequences of genes from the early eukaryote (S. cerevisiae) to the late eukaryote (human) should be the definite reflection of the evolutionary process.

  5. Proteus mirabilis ambient-temperature fimbriae: cloning and nucleotide sequence of the aft gene cluster.

    PubMed Central

    Massad, G; Fulkerson, J F; Watson, D C; Mobley, H L

    1996-01-01

    Uropathogenic Proteus mirabilis produces at least four types of fimbriae. Amino acid sequences from two peptides, derived by tryptic digestion of the structural subunit of one type of these fimbriae, the ambient-temperature fimbriae, were determined: NVVPGQPSSTQ and LIEGENQLNYNA. PCR primers, based on these sequences and that of the N terminus, were used to amplify a 359-bp fragment. A cosmid clone, isolated from a P. mirabilis genomic library by hybridization with the 359-bp PCR product, was used to determine the nucleotide sequence of the atf gene cluster. A 3,903-bp region encodes three polypeptides: AtfA, the structural subunit; AtfB, the chaperone; and AtfC, the outer membrane molecular usher. No fimbria-related genes are evident either 5' or 3' to the three contiguous genes. AtfA demonstrates significant amino acid sequence identity with type 1 major fimbrial subunits of several enteric species. The 359-bp PCR product hybridized strongly with all Proteus isolates (n = 9) and 25% of 355 Escherichia coli isolates but failed to hybridize with any of 26 isolates among nine other uropathogenic species. Ambient-temperature fimbriae of P. mirabilis may represent a novel type of fimbriae of enteric species. PMID:8926119

  6. Increased functional protein expression using nucleotide sequence features enriched in highly expressed genes in zebrafish

    PubMed Central

    Horstick, Eric J.; Jordan, Diana C.; Bergeron, Sadie A.; Tabor, Kathryn M.; Serpe, Mihaela; Feldman, Benjamin; Burgess, Harold A.

    2015-01-01

    Many genetic manipulations are limited by difficulty in obtaining adequate levels of protein expression. Bioinformatic and experimental studies have identified nucleotide sequence features that may increase expression, however it is difficult to assess the relative influence of these features. Zebrafish embryos are rapidly injected with calibrated doses of mRNA, enabling the effects of multiple sequence changes to be compared in vivo. Using RNAseq and microarray data, we identified a set of genes that are highly expressed in zebrafish embryos and systematically analyzed for enrichment of sequence features correlated with levels of protein expression. We then tested enriched features by embryo microinjection and functional tests of multiple protein reporters. Codon selection, releasing factor recognition sequence and specific introns and 3′ untranslated regions each increased protein expression between 1.5- and 3-fold. These results suggested principles for increasing protein yield in zebrafish through biomolecular engineering. We implemented these principles for rational gene design in software for codon selection (CodonZ) and plasmid vectors incorporating the most active non-coding elements. Rational gene design thus significantly boosts expression in zebrafish, and a similar approach will likely elevate expression in other animal models. PMID:25628360

  7. Targeted capture enrichment and sequencing identifies extensive nucleotide variation in the turkey MHC-B.

    PubMed

    Reed, Kent M; Mendoza, Kristelle M; Settlage, Robert E

    2016-03-01

    Variation in the major histocompatibility complex (MHC) is increasingly associated with disease susceptibility and resistance in avian species of agricultural importance. This variation includes sequence polymorphisms but also structural differences (gene rearrangement) and copy number variation (CNV). The MHC has now been described for multiple galliform species including the best defined assemblies of the chicken (Gallus gallus) and domestic turkey (Meleagris gallopavo). Using this sequence resource, this study applied high-throughput sequencing to investigate MHC variation in turkeys of North America (NA turkeys). An MHC-specific SureSelect (Agilent) capture array was developed, and libraries were created for 14 turkeys representing domestic (commercial bred), heritage breed, and wild turkeys. In addition, a representative of the Ocellated turkey (M. ocellata) and chicken (G. gallus) was included to test cross-species applicability of the capture array allowing for identification of new species-specific polymorphisms. Libraries were hybridized to ∼12 K cRNA baits and the resulting pools were sequenced. On average, 98% of processed reads mapped to the turkey whole genome sequence and 53% to the MHC target. In addition to the MHC, capture hybridization recovered sequences corresponding to other MHC regions. Sequence alignment and de novo assembly indicated the presence of several additional BG genes in the turkey with evidence for CNV. Variant detection identified an average of 2245 polymorphisms per individual for the NA turkeys, 3012 for the Ocellated turkey, and 462 variants in the chicken (RJF-256). This study provides an extensive sequence resource for examining MHC variation and its relation to health of this agriculturally important group of birds. PMID:26729471

  8. Aptaligner: Automated Software for Aligning Pseudorandom DNA X-Aptamers from Next-Generation Sequencing Data

    PubMed Central

    2015-01-01

    Next-generation sequencing results from bead-based aptamer libraries have demonstrated that traditional DNA/RNA alignment software is insufficient. This is particularly true for X-aptamers containing specialty bases (W, X, Y, Z, ...) that are identified by special encoding. Thus, we sought an automated program that uses the inherent design scheme of bead-based X-aptamers to create a hypothetical reference library and Markov modeling techniques to provide improved alignments. Aptaligner provides this feature as well as length error and noise level cutoff features, is parallelized to run on multiple central processing units (cores), and sorts sequences from a single chip into projects and subprojects. PMID:24866698

  9. DIALIGN at GOBICS—multiple sequence alignment using various sources of external information

    PubMed Central

    Al Ait, Layal; Yamak, Zaher; Morgenstern, Burkhard

    2013-01-01

    DIALIGN is an established tool for multiple sequence alignment that is particularly useful to detect local homologies in sequences with low overall similarity. In recent years, various versions of the program have been developed, some of which are fully automated, whereas others are able to accept user-specified external information. In this article, we review some versions of the program that are available through ‘Göttingen Bioinformatics Compute Server’. In addition to previously described implementations, we present a new release of DIALIGN called ‘DIALIGN-PFAM’, which uses hits to the PFAM database for improved protein alignment. Our software is available through http://dialign.gobics.de/. PMID:23620293

  10. Aptaligner: automated software for aligning pseudorandom DNA X-aptamers from next-generation sequencing data.

    PubMed

    Lu, Emily; Elizondo-Riojas, Miguel-Angel; Chang, Jeffrey T; Volk, David E

    2014-06-10

    Next-generation sequencing results from bead-based aptamer libraries have demonstrated that traditional DNA/RNA alignment software is insufficient. This is particularly true for X-aptamers containing specialty bases (W, X, Y, Z, ...) that are identified by special encoding. Thus, we sought an automated program that uses the inherent design scheme of bead-based X-aptamers to create a hypothetical reference library and Markov modeling techniques to provide improved alignments. Aptaligner provides this feature as well as length error and noise level cutoff features, is parallelized to run on multiple central processing units (cores), and sorts sequences from a single chip into projects and subprojects. PMID:24866698

  11. Mulan: Multiple-Sequence Local Alignment and Visualization for Studying Function and Evolution

    SciTech Connect

    Ovcharenko, I; Loots, G; Giardine, B; Hou, M; Ma, J; Hardison, R; Stubbs, L; Miller, W

    2004-07-14

    Multiple sequence alignment analysis is a powerful approach for understanding phylogenetic relationships, annotating genes and detecting functional regulatory elements. With a growing number of partly or fully sequenced vertebrate genomes, effective tools for performing multiple comparisons are required to accurately and efficiently assist biological discoveries. Here we introduce Mulan (http://mulan.dcode.org/), a novel method and a network server for comparing multiple draft and finished-quality sequences to identify functional elements conserved over evolutionary time. Mulan brings together several novel algorithms: the tba multi-aligner program for rapid identification of local sequence conservation and the multiTF program for detecting evolutionarily conserved transcription factor binding sites in multiple alignments. In addition, Mulan supports two-way communication with the GALA database; alignments of multiple species dynamically generated in GALA can be viewed in Mulan, and conserved transcription factor binding sites identified with Mulan/multiTF can be integrated and overlaid with extensive genome annotation data using GALA. Local multiple alignments computed by Mulan ensure reliable representation of short-and large-scale genomic rearrangements in distant organisms. Mulan allows for interactive modification of critical conservation parameters to differentially predict conserved regions in comparisons of both closely and distantly related species. We illustrate the uses and applications of the Mulan tool through multi-species comparisons of the GATA3 gene locus and the identification of elements that are conserved differently in avians than in other genomes allowing speculation on the evolution of birds. Source code for the aligners and the aligner-evaluation software can be freely downloaded from http://bio.cse.psu.edu/.

  12. Analysis of the genome sequence of the pathogenic Muscovy duck parvovirus strain YY reveals a 14-nucleotide-pair deletion in the inverted terminal repeats.

    PubMed

    Wang, Jianye; Huang, Yu; Zhou, Mingxu; Zhu, Guoqiang

    2016-09-01

    Genomic information about Muscovy duck parvovirus is still limited. In this study, the genome of the pathogenic MDPV strain YY was sequenced. The full-length genome of YY is 5075 nucleotides (nt) long, 57 nt shorter than that of strain FM. Sequence alignment indicates that the 5' and 3' inverted terminal repeats (ITR) of strain YY contain a 14-nucleotide-pair deletion in the stem of the palindromic hairpin structure in comparison to strain FM and FZ91-30. The deleted region contains one "E-box" site and one repeated motif with the sequence "TTCCGGT" or "ACCGGAA". Phylogenetic trees constructed based the protein coding genes concordantly showed that YY, together with nine other MDPV isolates from various places, clustered in a separate branch, distinct from the branch formed by goose parvovirus (GPV) strains. These results demonstrate that, despite the distinctive deletion, the YY strain still belongs to the classical MDPV group. Moreover, the deletion of ITR may contribute to the genome evolution of MDPV under immunization pressure. PMID:27344160

  13. ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments.

    PubMed

    Schwarz, Roland F; Tamuri, Asif U; Kultys, Marek; King, James; Godwin, James; Florescu, Ana M; Schultz, Jörg; Goldman, Nick

    2016-05-01

    Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles). PMID:26819408

  14. ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments

    PubMed Central

    Schwarz, Roland F.; Tamuri, Asif U.; Kultys, Marek; King, James; Godwin, James; Florescu, Ana M.; Schultz, Jörg; Goldman, Nick

    2016-01-01

    Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles). PMID:26819408

  15. Guanine nucleotide-binding proteins that enhance choleragen ADP-ribosyltransferase activity: nucleotide and deduced amino acid sequence of an ADP-ribosylation factor cDNA.

    PubMed Central

    Price, S R; Nightingale, M; Tsai, S C; Williamson, K C; Adamik, R; Chen, H C; Moss, J; Vaughan, M

    1988-01-01

    Three (two soluble and one membrane) guanine nucleotide-binding proteins (G proteins) that enhance ADP-ribosylation of the Gs alpha stimulatory subunit of the adenylyl cyclase (EC 4.6.1.1) complex by choleragen have recently been purified from bovine brain. To further define the structure and function of these ADP-ribosylation factors (ARFs), we isolated a cDNA clone (lambda ARF2B) from a bovine retinal library by screening with a mixed heptadecanucleotide probe whose sequence was based on the partial amino acid sequence of one of the soluble ARFs from bovine brain. Comparison of the deduced amino acid sequence of lambda ARF2B with sequences of peptides from the ARF protein (total of 60 amino acids) revealed only two differences. Whether these are cloning artifacts or reflect the existence of more than one ARF protein remains to be determined. Deduced amino acid sequences of ARF, Go alpha (the alpha subunit of a G protein that may be involved in regulation of ion fluxes), and c-Ha-ras gene product p21 show similarities in regions believed to be involved in guanine nucleotide binding and GTP hydrolysis. ARF apparently lacks a site analogous to that ADP-ribosylated by choleragen in G-protein alpha subunits. Although both the ARF proteins and the alpha subunits bind guanine nucleotides and serve as choleragen substrates, they must interact with the toxin A1 peptide in different ways. In addition to serving as an ADP-ribose acceptor, ARF interacts with the toxin in a manner that modifies its catalytic properties. PMID:3135549

  16. Multiple sequence alignment algorithm based on a dispersion graph and ant colony algorithm.

    PubMed

    Chen, Weiyang; Liao, Bo; Zhu, Wen; Xiang, Xuyu

    2009-10-01

    In this article, we describe a representation for the processes of multiple sequences alignment (MSA) and used it to solve the problem of MSA. By this representation, we took every possible aligning result into account by defining the representation of gap insertion, the value of heuristic information in every optional path and scoring rule. On the basis of the proposed multidimensional graph, we used the ant colony algorithm to find the better path that denotes a better aligning result. In our article, we proposed the instance of three-dimensional graph and four-dimensional graph and advanced a special ichnographic representation to analyze MSA. It is yet only an experimental software, and we gave an example for finding the best aligning result by three-dimensional graph and ant colony algorithm. Experimental results show that our method can improve the solution quality on MSA benchmarks. PMID:19130503

  17. DNAAlignEditor: DNA alignment editor tool

    PubMed Central

    Sanchez-Villeda, Hector; Schroeder, Steven; Flint-Garcia, Sherry; Guill, Katherine E; Yamasaki, Masanori; McMullen, Michael D

    2008-01-01

    Background With advances in DNA re-sequencing methods and Next-Generation parallel sequencing approaches, there has been a large increase in genomic efforts to define and analyze the sequence variability present among individuals within a species. For very polymorphic species such as maize, this has lead to a need for intuitive, user-friendly software that aids the biologist, often with naïve programming capability, in tracking, editing, displaying, and exporting multiple individual sequence alignments. To fill this need we have developed a novel DNA alignment editor. Results We have generated a nucleotide sequence alignment editor (DNAAlignEditor) that provides an intuitive, user-friendly interface for manual editing of multiple sequence alignments with functions for input, editing, and output of sequence alignments. The color-coding of nucleotide identity and the display of associated quality score aids in the manual alignment editing process. DNAAlignEditor works as a client/server tool having two main components: a relational database that collects the processed alignments and a user interface connected to database through universal data access connectivity drivers. DNAAlignEditor can be used either as a stand-alone application or as a network application with multiple users concurrently connected. Conclusion We anticipate that this software will be of general interest to biologists and population genetics in editing DNA sequence alignments and analyzing natural sequence variation regardless of species, and will be particularly useful for manual alignment editing of sequences in species with high levels of polymorphism. PMID:18366684

  18. Filamentous hemagglutinin of Bordetella pertussis: nucleotide sequence and crucial role in adherence.

    PubMed Central

    Relman, D A; Domenighini, M; Tuomanen, E; Rappuoli, R; Falkow, S

    1989-01-01

    Filamentous hemagglutinin is a surface-associated adherence protein of Bordetella pertussis, which is a component of some new acellular pertussis vaccines. The nucleotide sequence of an open reading frame that encompasses the filamentous hemagglutinin structural gene, fhaB, suggests that proteolytic processing is necessary to generate the mature 220-kDa filamentous hemagglutinin product. An Arg-Gly-Asp (RGD) tripeptide is found within filamentous hemagglutinin that may be involved in its adherence properties. An internal in-frame deletion in fhaB, encompassing the RGD region, causes loss of B. pertussis-binding to ciliated eukaryotic cells, confirming a potential role for this protein in host-cell binding and infection. Images PMID:2539596

  19. Nucleotide sequence of a glucosyltransferase gene from Streptococcus sobrinus MFe28.

    PubMed Central

    Ferretti, J J; Gilpin, M L; Russell, R R

    1987-01-01

    The complete nucleotide sequence was determined for the Streptococcus sobrinus MFe28 gtfI gene, which encodes a glucosyltransferase that produces an insoluble glucan product. A single open reading frame encodes a mature glucosyltransferase protein of 1,559 amino acids (Mr, 172,983) and a signal peptide of 38 amino acids. In the C-terminal one-third of the protein there are six repeating units containing 35 amino acids of partial homology and two repeating units containing 48 amino acids of complete homology. The functional role of these repeating units remains to be determined, although truncated forms of glucosyltransferase containing only the first two repeating units of partial homology maintained glucosyltransferase activity and the ability to bind glucan. Regions of homology with alpha-amylase and glycogen phosphorylase were identified in the glucosyltransferase protein and may represent regions involved in functionally similar domains. Images PMID:3040686

  20. High-Throughput Sequencing Reveals Single Nucleotide Variants in Longer-Kernel Bread Wheat

    PubMed Central

    Chen, Feng; Zhu, Zibo; Zhou, Xiaobian; Yan, Yan; Dong, Zhongdong; Cui, Dangqun

    2016-01-01

    The transcriptomes of bread wheat Yunong 201 and its ethyl methanesulfonate derivative Yunong 3114 were obtained by next-sequencing technology. Single nucleotide variants (SNVs) in the wheat strains were explored and compared. A total of 5907 and 6287 non-synonymous SNVs were acquired for Yunong 201 and 3114, respectively. A total of 4021 genes with SNVs were obtained. The genes that underwent non-synonymous SNVs were significantly involved in ATP binding, protein phosphorylation, and cellular protein metabolic process. The heat map analysis also indicated that most of these mutant genes were significantly differentially expressed at different developmental stages. The SNVs in these genes possibly contribute to the longer kernel length of Yunong 3114. Our data provide useful information on wheat transcriptome for future studies on wheat functional genomics. This study could also help in illustrating the gene functions of the non-synonymous SNVs of Yunong 201 and 3114. PMID:27551288

  1. Complete nucleotide sequence of a virus associated with rusty mottle disease of sweet cherry (Prunus avium).

    PubMed

    Villamor, D V; Druffel, K L; Eastwell, K C

    2013-08-01

    Cherry rusty mottle is a disease of sweet cherries first described in 1940 in western North America. Because of the graft-transmissible nature of the disease, a viral nature of the disease was assumed. Here, the complete genomic nucleotide sequences of virus isolates from two trees expressing cherry rusty mottle disease symptoms are characterized; the virus is designated cherry rusty mottle associated virus (CRMaV). The biological and molecular characteristics of this virus in comparison to those of cherry necrotic rusty mottle virus (CNRMV) and cherry green ring mottle virus (CGRMV) are described. CRMaV was subsequently detected in additional sweet cherry trees expressing symptoms of cherry rusty mottle disease. PMID:23525699

  2. Nucleotide sequence of a gene encoding an organophosphorus nerve agent degrading enzyme from Alteromonas haloplanktis.

    PubMed

    Cheng, T; Liu, L; Wang, B; Wu, J; DeFrank, J J; Anderson, D M; Rastogi, V K; Hamilton, A B

    1997-01-01

    Organophosphorus acid anhydrolases (OPAA) catalyzing the hydrolysis of a variety of toxic organophosphorus cholinesterase inhibitors offer potential for decontamination of G-type nerve agents and pesticides. The gene (opa) encoding an OPAA was cloned from the chromosomal DNA of Alteromonas haloplanktis ATCC 23821. The nucleotide sequence of the 1.7 -kb DNA fragment contained the opa gene (1.3 kb) and its flanking region. We report structural and functional similarity of OPAAs from A. haloplanktis and Alteromonas sp JD6.5 with the enzyme prolidase that hydrolyzes dipeptides with a prolyl residue in the carboxyl-terminal position. These results corroborate the earlier conclusion that the OPAA is a type of X-Pro dipeptidase, and that X-Pro could be the native substrate for such an enzyme in Alteromonas cells. PMID:9079288

  3. Nucleotide sequence analysis of the DNA binding region of the chicken fibronectin gene.

    PubMed

    Karasaki, Y; Gotoh, S; Kubomura, S; Higashi, K; Hirano, H

    1988-12-01

    We have determined the nucleotide sequence of 2.0 kb EcoRI segment from the clone lambda FC32 of the genomic chicken fibronectin gene, which is called DNA binding domain. This segment overlapped another clone lambda FC36 and contained three exons which were 16, 17 and 18. They were classified as Type III repeat as originally shown in bovine plasma fibronectin. The average homologies of these three exons among the chicken, rat and human fibronectins in amino acid level are very high (87-98%) compared with that (79-88%) of the exons in the cell binding domain, indicating that this region is highly conservative during the evolution. PMID:3212295

  4. Developing single nucleotide polymorphism (SNP) markers from transcriptome sequences for identification of longan (Dimocarpus longan) germplasm

    PubMed Central

    Wang, Boyi; Tan, Hua-Wei; Fang, Wanping; Meinhardt, Lyndel W; Mischke, Sue; Matsumoto, Tracie; Zhang, Dapeng

    2015-01-01

    Longan (Dimocarpus longan Lour.) is an important tropical fruit tree crop. Accurate varietal identification is essential for germplasm management and breeding. Using longan transcriptome sequences from public databases, we developed single nucleotide polymorphism (SNP) markers; validated 60 SNPs in 50 longan germplasm accessions, including cultivated varieties and wild germplasm; and designated 25 SNP markers that unambiguously identified all tested longan varieties with high statistical rigor (P<0.0001). Multiple trees from the same clone were verified and off-type trees were identified. Diversity analysis revealed genetic relationships among analyzed accessions. Cultivated varieties differed significantly from wild populations (Fst=0.300; P<0.001), demonstrating untapped genetic diversity for germplasm conservation and utilization. Within cultivated varieties, apparent differences between varieties from China and those from Thailand and Hawaii indicated geographic patterns of genetic differentiation. These SNP markers provide a powerful tool to manage longan genetic resources and breeding, with accurate and efficient genotype identification. PMID:26504559

  5. High-Throughput Sequencing Reveals Single Nucleotide Variants in Longer-Kernel Bread Wheat.

    PubMed

    Chen, Feng; Zhu, Zibo; Zhou, Xiaobian; Yan, Yan; Dong, Zhongdong; Cui, Dangqun

    2016-01-01

    The transcriptomes of bread wheat Yunong 201 and its ethyl methanesulfonate derivative Yunong 3114 were obtained by next-sequencing technology. Single nucleotide variants (SNVs) in the wheat strains were explored and compared. A total of 5907 and 6287 non-synonymous SNVs were acquired for Yunong 201 and 3114, respectively. A total of 4021 genes with SNVs were obtained. The genes that underwent non-synonymous SNVs were significantly involved in ATP binding, protein phosphorylation, and cellular protein metabolic process. The heat map analysis also indicated that most of these mutant genes were significantly differentially expressed at different developmental stages. The SNVs in these genes possibly contribute to the longer kernel length of Yunong 3114. Our data provide useful information on wheat transcriptome for future studies on wheat functional genomics. This study could also help in illustrating the gene functions of the non-synonymous SNVs of Yunong 201 and 3114. PMID:27551288

  6. High-throughput nucleotide sequence analysis of diverse bacterial communities in leachates of decomposing pig carcasses

    PubMed Central

    Yang, Seung Hak; Lim, Joung Soo; Khan, Modabber Ahmed; Kim, Bong Soo; Choi, Dong Yoon; Lee, Eun Young; Ahn, Hee Kwon

    2015-01-01

    The leachate generated by the decomposition of animal carcass has been implicated as an environmental contaminant surrounding the burial site. High-throughput nucleotide sequencing was conducted to investigate the bacterial communities in leachates from the decomposition of pig carcasses. We acquired 51,230 reads from six different samples (1, 2, 3, 4, 6 and 14 week-old carcasses) and found that sequences representing the phylum Firmicutes predominated. The diversity of bacterial 16S rRNA gene sequences in the leachate was the highest at 6 weeks, in contrast to those at 2 and 14 weeks. The relative abundance of Firmicutes was reduced, while the proportion of Bacteroidetes and Proteobacteria increased from 3–6 weeks. The representation of phyla was restored after 14 weeks. However, the community structures between the samples taken at 1–2 and 14 weeks differed at the bacterial classification level. The trend in pH was similar to the changes seen in bacterial communities, indicating that the pH of the leachate could be related to the shift in the microbial community. The results indicate that the composition of bacterial communities in leachates of decomposing pig carcasses shifted continuously during the study period and might be influenced by the burial site. PMID:26500442

  7. Nucleotide sequences provide evidence of genetic exchange among distantly related lineages of Trypanosoma cruzi

    PubMed Central

    Machado, Carlos A.; Ayala, Francisco J.

    2001-01-01

    Simple phylogenetic tests were applied to a large data set of nucleotide sequences from two nuclear genes and a region of the mitochondrial genome of Trypanosoma cruzi, the agent of Chagas' disease. Incongruent gene genealogies manifest genetic exchange among distantly related lineages of T. cruzi. Two widely distributed isoenzyme types of T. cruzi are hybrids, their genetic composition being the likely result of genetic exchange between two distantly related lineages. The data show that the reference strain for the T. cruzi genome project (CL Brener) is a hybrid. Well-supported gene genealogies show that mitochondrial and nuclear gene sequences from T. cruzi cluster, respectively, in three or four distinct clades that do not fully correspond to the two previously defined major lineages of T. cruzi. There is clear genetic differentiation among the major groups of sequences, but genetic diversity within each major group is low. We estimate that the major extant lineages of T. cruzi have diverged during the Miocene or early Pliocene (3–16 million years ago). PMID:11416213

  8. Mining for single nucleotide polymorphisms and insertions / deletions in expressed sequence tag libraries of oil palm.

    PubMed

    Riju, Aykkal; Chandrasekar, Arumugam; Arunachalam, Vadivel

    2007-01-01

    The oil palm is a tropical oil bearing tree. Recently EST-derived SNPs and SSRs are a free by-product of the currently expanding EST (Expressed Sequence Tag) data bases. The development of high-throughput methods for the detection of SNPs (Single Nucleotide Polymorphism) and small indels (insertion / deletion) has led to a revolution in their use as molecular markers. Available (5452) Oil palm EST sequences were mined from dbEST of NCBI. CAP3 program was used to assemble EST sequences into contigs. Candidate SNPs and Indel polymorphisms were detected using the perl script auto_snip version 1.0 which has used 576 ESTs for detecting SNPs and Indel sites. We found 1180 SNP sites and 137 indel polymorphisms with frequency 1.36 SNPs / 100 bp. Among the six tissues from which the EST libraries had been generated, mesocarp had high frequency of 2.91 SNPs and indels per 100 bp whereas the zygotic embryos had lowest frequency of 0.15 per 100 bp. We also used the Shannon index to analyze the proportion of ten possible types of SNP/indels. ESTs from tissues of normal apex showed highest values of Shannon index (0.60) whereas abnormal apex had least value (0.02). The present report deals the use of Shannon index for comparing SNP/ indel frequencies mined from ESTlibraries and also confirm that the frequency of SNP occurrence in oil palm to use them as markers for genetic studies. PMID:21670789

  9. Complete nucleotide sequence of the mitochondrial genome of a salamander, Mertensiella luschani.

    PubMed

    Zardoya, Rafael; Malaga-Trillo, Edward; Veith, Michael; Meyer, Axel

    2003-10-23

    The complete nucleotide sequence (16,650 bp) of the mitochondrial genome of the salamander Mertensiella luschani (Caudata, Amphibia) was determined. This molecule conforms to the consensus vertebrate mitochondrial gene order. However, it is characterized by a long non-coding intervening sequence with two 124-bp repeats between the tRNA(Thr) and tRNA(Pro) genes. The new sequence data were used to reconstruct a phylogeny of jawed vertebrates. Phylogenetic analyses of all mitochondrial protein-coding genes at the amino acid level recovered a robust vertebrate tree in which lungfishes are the closest living relatives of tetrapods, salamanders and frogs are grouped together to the exclusion of caecilians (the Batrachia hypothesis) in a monophyletic amphibian clade, turtles show diapsid affinities and are placed as sister group of crocodiles+birds, and the marsupials are grouped together with monotremes and basal to placental mammals. The deduced phylogeny was used to characterize the molecular evolution of vertebrate mitochondrial proteins. Amino acid frequencies were analyzed across the main lineages of jawed vertebrates, and leucine and cysteine were found to be the most and least abundant amino acids in mitochondrial proteins, respectively. Patterns of amino acid replacements were conserved among vertebrates. Overall, cartilaginous fishes showed the least variation in amino acid frequencies and replacements. Constancy of rates of evolution among the main lineages of jawed vertebrates was rejected. PMID:14604788

  10. Whole genome sequencing of a single Bos taurus animal for single nucleotide polymorphism discovery

    PubMed Central

    Eck, Sebastian H; Benet-Pagès, Anna; Flisikowski, Krzysztof; Meitinger, Thomas; Fries, Ruedi; Strom, Tim M

    2009-01-01

    Background The majority of the 2 million bovine single nucleotide polymorphisms (SNPs) currently available in dbSNP have been identified in a single breed, Hereford cattle, during the bovine genome project. In an attempt to evaluate the variance of a second breed, we have produced a whole genome sequence at low coverage of a single Fleckvieh bull. Results We generated 24 gigabases of sequence, mainly using 36-bp paired-end reads, resulting in an average 7.4-fold sequence depth. This coverage was sufficient to identify 2.44 million SNPs, 82% of which were previously unknown, and 115,000 small indels. A comparison with the genotypes of the same animal, generated on a 50 k oligonucleotide chip, revealed a detection rate of 74% and 30% for homozygous and heterozygous SNPs, respectively. The false positive rate, as determined by comparison with genotypes determined for 196 randomly selected SNPs, was approximately 1.1%. We further determined the allele frequencies of the 196 SNPs in 48 Fleckvieh and 48 Braunvieh bulls. 95% of the SNPs were polymorphic with an average minor allele frequency of 24.5% and with 83% of the SNPs having a minor allele frequency larger than 5%. Conclusions This work provides the first single cattle genome by next-generation sequencing. The chosen approach - low to medium coverage re-sequencing - added more than 2 million novel SNPs to the currently publicly available SNP resource, providing a valuable resource for the construction of high density oligonucleotide arrays in the context of genome-wide association studies. PMID:19660108

  11. Detection and quantitation of single nucleotide polymorphisms, DNA sequence variations, DNA mutations, DNA damage and DNA mismatches

    DOEpatents

    McCutchen-Maloney, Sandra L.

    2002-01-01

    DNA mutation binding proteins alone and as chimeric proteins with nucleases are used with solid supports to detect DNA sequence variations, DNA mutations and single nucleotide polymorphisms. The solid supports may be flow cytometry beads, DNA chips, glass slides or DNA dips sticks. DNA molecules are coupled to solid supports to form DNA-support complexes. Labeled DNA is used with unlabeled DNA mutation binding proteins such at TthMutS to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by binding which gives an increase in signal. Unlabeled DNA is utilized with labeled chimeras to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by nuclease activity of the chimera which gives a decrease in signal.

  12. A Parallel Non-Alignment Based Approach to Efficient Sequence Comparison using Longest Common Subsequences

    NASA Astrophysics Data System (ADS)

    Bhowmick, S.; Shafiullah, M.; Rai, H.; Bastola, D.

    2010-11-01

    Biological sequence comparison programs have revolutionized the practice of biochemistry, and molecular and evolutionary biology. Pairwise comparison of genomic sequences is a popular method of choice for analyzing genetic sequence data. However the quality of results from most sequence comparison methods are significantly affected by small perturbations in the data and furthermore, there is a dearth of computational tools to compare sequences beyond a certain length. In this paper, we describe a parallel algorithm for comparing genetic sequences using an alignment free-method based on computing the Longest Common Subsequence (LCS) between genetic sequences. We validate the quality of our results by comparing the phylogenetic tress obtained from ClustalW and LCS. We also show through complexity analysis of the isoefficiency and by empirical measurement of the running time that our algorithm is very scalable.

  13. Ultra-large alignments using phylogeny-aware profiles.

    PubMed

    Nguyen, Nam-Phuong D; Mirarab, Siavash; Kumar, Keerthana; Warnow, Tandy

    2015-01-01

    Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp . PMID:26076734

  14. Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages.

    PubMed

    Jayaswal, Vivek; Wong, Thomas K F; Robinson, John; Poladian, Leon; Jermiin, Lars S

    2014-09-01

    Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modeled accurately using the distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or underparameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10 000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42 337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear

  15. Simultaneous Bayesian Estimation of Alignment and Phylogeny under a Joint Model of Protein Sequence and Structure

    PubMed Central

    Herman, Joseph L.; Challis, Christopher J.; Novák, Ádám; Hein, Jotun; Schmidler, Scott C.

    2014-01-01

    For sequences that are highly divergent, there is often insufficient information to infer accurate alignments, and phylogenetic uncertainty may be high. One way to address this issue is to make use of protein structural information, since structures generally diverge more slowly than sequences. In this work, we extend a recently developed stochastic model of pairwise structural evolution to multiple structures on a tree, analytically integrating over ancestral structures to permit efficient likelihood computations under the resulting joint sequence–structure model. We observe that the inclusion of structural information significantly reduces alignment and topology uncertainty, and reduces the number of topology and alignment errors in cases where the true trees and alignments are known. In some cases, the inclusion of structure results in changes to the consensus topology, indicating that structure may contain additional information beyond that which can be obtained from sequences. We use the model to investigate the order of divergence of cytoglobins, myoglobins, and hemoglobins and observe a stabilization of phylogenetic inference: although a sequence-based inference assigns significant posterior probability to several different topologies, the structural model strongly favors one of these over the others and is more robust to the choice of data set. PMID:24899668

  16. Assessing Activity Pattern Similarity with Multidimensional Sequence Alignment based on a Multiobjective Optimization Evolutionary Algorithm

    PubMed Central

    Kwan, Mei-Po; Xiao, Ningchuan; Ding, Guoxiang

    2015-01-01

    Due to the complexity and multidimensional characteristics of human activities, assessing the similarity of human activity patterns and classifying individuals with similar patterns remains highly challenging. This paper presents a new and unique methodology for evaluating the similarity among individual activity patterns. It conceptualizes multidimensional sequence alignment (MDSA) as a multiobjective optimization problem, and solves this problem with an evolutionary algorithm. The study utilizes sequence alignment to code multiple facets of human activities into multidimensional sequences, and to treat similarity assessment as a multiobjective optimization problem that aims to minimize the alignment cost for all dimensions simultaneously. A multiobjective optimization evolutionary algorithm (MOEA) is used to generate a diverse set of optimal or near-optimal alignment solutions. Evolutionary operators are specifically designed for this problem, and a local search method also is incorporated to improve the search ability of the algorithm. We demonstrate the effectiveness of our method by comparing it with a popular existing method called ClustalG using a set of 50 sequences. The results indicate that our method outperforms the existing method for most of our selected cases. The multiobjective evolutionary algorithm presented in this paper provides an effective approach for assessing activity pattern similarity, and a foundation for identifying distinctive groups of individuals with similar activity patterns. PMID:26190858

  17. Real-time single-molecule electronic DNA sequencing by synthesis using polymer-tagged nucleotides on a nanopore array

    PubMed Central

    Fuller, Carl W.; Kumar, Shiv; Porel, Mintu; Chien, Minchen; Bibillo, Arek; Stranges, P. Benjamin; Dorwart, Michael; Tao, Chuanjuan; Li, Zengmin; Guo, Wenjing; Shi, Shundi; Korenblum, Daniel; Trans, Andrew; Aguirre, Anne; Liu, Edward; Harada, Eric T.; Pollard, James; Bhat, Ashwini; Cech, Cynthia; Yang, Alexander; Arnold, Cleoma; Palla, Mirkó; Hovis, Jennifer; Chen, Roger; Morozova, Irina; Kalachikov, Sergey; Russo, James J.; Kasianowicz, John J.; Davis, Randy; Roever, Stefan; Church, George M.; Ju, Jingyue

    2016-01-01

    DNA sequencing by synthesis (SBS) offers a robust platform to decipher nucleic acid sequences. Recently, we reported a single-molecule nanopore-based SBS strategy that accurately distinguishes four bases by electronically detecting and differentiating four different polymer tags attached to the 5′-phosphate of the nucleotides during their incorporation into a growing DNA strand catalyzed by DNA polymerase. Further developing this approach, we report here the use of nucleotides tagged at the terminal phosphate with oligonucleotide-based polymers to perform nanopore SBS on an α-hemolysin nanopore array platform. We designed and synthesized several polymer-tagged nucleotides using tags that produce different electrical current blockade levels and verified they are active substrates for DNA polymerase. A highly processive DNA polymerase was conjugated to the nanopore, and the conjugates were complexed with primer/template DNA and inserted into lipid bilayers over individually addressable electrodes of the nanopore chip. When an incoming complementary-tagged nucleotide forms a tight ternary complex with the primer/template and polymerase, the tag enters the pore, and the current blockade level is measured. The levels displayed by the four nucleotides tagged with four different polymers captured in the nanopore in such ternary complexes were clearly distinguishable and sequence-specific, enabling continuous sequence determination during the polymerase reaction. Thus, real-time single-molecule electronic DNA sequencing data with single-base resolution were obtained. The use of these polymer-tagged nucleotides, combined with polymerase tethering to nanopores and multiplexed nanopore sensors, should lead to new high-throughput sequencing methods. PMID:27091962

  18. Real-time single-molecule electronic DNA sequencing by synthesis using polymer-tagged nucleotides on a nanopore array.

    PubMed

    Fuller, Carl W; Kumar, Shiv; Porel, Mintu; Chien, Minchen; Bibillo, Arek; Stranges, P Benjamin; Dorwart, Michael; Tao, Chuanjuan; Li, Zengmin; Guo, Wenjing; Shi, Shundi; Korenblum, Daniel; Trans, Andrew; Aguirre, Anne; Liu, Edward; Harada, Eric T; Pollard, James; Bhat, Ashwini; Cech, Cynthia; Yang, Alexander; Arnold, Cleoma; Palla, Mirkó; Hovis, Jennifer; Chen, Roger; Morozova, Irina; Kalachikov, Sergey; Russo, James J; Kasianowicz, John J; Davis, Randy; Roever, Stefan; Church, George M; Ju, Jingyue

    2016-05-10

    DNA sequencing by synthesis (SBS) offers a robust platform to decipher nucleic acid sequences. Recently, we reported a single-molecule nanopore-based SBS strategy that accurately distinguishes four bases by electronically detecting and differentiating four different polymer tags attached to the 5'-phosphate of the nucleotides during their incorporation into a growing DNA strand catalyzed by DNA polymerase. Further developing this approach, we report here the use of nucleotides tagged at the terminal phosphate with oligonucleotide-based polymers to perform nanopore SBS on an α-hemolysin nanopore array platform. We designed and synthesized several polymer-tagged nucleotides using tags that produce different electrical current blockade levels and verified they are active substrates for DNA polymerase. A highly processive DNA polymerase was conjugated to the nanopore, and the conjugates were complexed with primer/template DNA and inserted into lipid bilayers over individually addressable electrodes of the nanopore chip. When an incoming complementary-tagged nucleotide forms a tight ternary complex with the primer/template and polymerase, the tag enters the pore, and the current blockade level is measured. The levels displayed by the four nucleotides tagged with four different polymers captured in the nanopore in such ternary complexes were clearly distinguishable and sequence-specific, enabling continuous sequence determination during the polymerase reaction. Thus, real-time single-molecule electronic DNA sequencing data with single-base resolution were obtained. The use of these polymer-tagged nucleotides, combined with polymerase tethering to nanopores and multiplexed nanopore sensors, should lead to new high-throughput sequencing methods. PMID:27091962

  19. Recombination-Independent Recognition of DNA Homology for Repeat-Induced Point Mutation (RIP) Is Modulated by the Underlying Nucleotide Sequence.

    PubMed

    Gladyshev, Eugene; Kleckner, Nancy

    2016-05-01

    Haploid germline nuclei of many filamentous fungi have the capacity to detect homologous nucleotide sequences present on the same or different chromosomes. Once recognized, such sequences can undergo cytosine methylation or cytosine-to-thymine mutation specifically over the extent of shared homology. In Neurospora crassa this process is known as Repeat-Induced Point mutation (RIP). Previously, we showed that RIP did not require MEI-3, the only RecA homolog in Neurospora, and that it could detect homologous trinucleotides interspersed with a matching periodicity of 11 or 12 base-pairs along participating chromosomal segments. This pattern was consistent with a mechanism of homology recognition that involved direct interactions between co-aligned double-stranded (ds) DNA molecules, where sequence-specific dsDNA/dsDNA contacts could be established using no more than one triplet per turn. In the present study we have further explored the DNA sequence requirements for RIP. In our previous work, interspersed homologies were always examined in the context of a relatively long adjoining region of perfect homology. Using a new repeat system lacking this strong interaction, we now show that interspersed homologies with overall sequence identity of only 36% can be efficiently detected by RIP in the absence of any perfect homology. Furthermore, in this new system, where the total amount of homology is near the critical threshold required for RIP, the nucleotide composition of participating DNA molecules is identified as an important factor. Our results specifically pinpoint the triplet 5'-GAC-3' as a particularly efficient unit of homology recognition. Finally, we present experimental evidence that the process of homology sensing can be uncoupled from the downstream mutation. Taken together, our results advance the notion that sequence information can be compared directly between double-stranded DNA molecules during RIP and, potentially, in other processes where homologous

  20. Recombination-Independent Recognition of DNA Homology for Repeat-Induced Point Mutation (RIP) Is Modulated by the Underlying Nucleotide Sequence

    PubMed Central

    Kleckner, Nancy

    2016-01-01

    Haploid germline nuclei of many filamentous fungi have the capacity to detect homologous nucleotide sequences present on the same or different chromosomes. Once recognized, such sequences can undergo cytosine methylation or cytosine-to-thymine mutation specifically over the extent of shared homology. In Neurospora crassa this process is known as Repeat-Induced Point mutation (RIP). Previously, we showed that RIP did not require MEI-3, the only RecA homolog in Neurospora, and that it could detect homologous trinucleotides interspersed with a matching periodicity of 11 or 12 base-pairs along participating chromosomal segments. This pattern was consistent with a mechanism of homology recognition that involved direct interactions between co-aligned double-stranded (ds) DNA molecules, where sequence-specific dsDNA/dsDNA contacts could be established using no more than one triplet per turn. In the present study we have further explored the DNA sequence requirements for RIP. In our previous work, interspersed homologies were always examined in the context of a relatively long adjoining region of perfect homology. Using a new repeat system lacking this strong interaction, we now show that interspersed homologies with overall sequence identity of only 36% can be efficiently detected by RIP in the absence of any perfect homology. Furthermore, in this new system, where the total amount of homology is near the critical threshold required for RIP, the nucleotide composition of participating DNA molecules is identified as an important factor. Our results specifically pinpoint the triplet 5'-GAC-3' as a particularly efficient unit of homology recognition. Finally, we present experimental evidence that the process of homology sensing can be uncoupled from the downstream mutation. Taken together, our results advance the notion that sequence information can be compared directly between double-stranded DNA molecules during RIP and, potentially, in other processes where homologous

  1. Cloning and nucleotide sequence of the gene coding for citrate synthase from a thermotolerant Bacillus sp

    SciTech Connect

    Schendel, F.J.; August, P.R.; Anderson, C.R.; Flickinger, M.C. ); Hanson, R.S. )

    1992-01-01

    Acetate salts are emerging as potentially attractive bulk chemicals for a variety of environmental applications, for example, as catalysts to facilitate combustion of high-sulfur coal by electrical utilities and as the biodegradable noncorrosive highway deicing salt calcium magnesium acetate. The structural gene coding for citrate synthase from the gram-positive soil isolate Bacillus sp. strain C4 (ATCC 55182) capable of secreting acetic acid at pH 5.0 to 7.0 in the presence of dolime has been cloned from a genomic library by complementation of an Escherichia coli auxotrophic mutant lacking citrate synthase. The nucleotide sequence of the entire 3.1-kb HindIII fragment has been determined, and one major open reading frame was found coding for citrate synthase (ctsA). Citrate synthase from Bacillus sp. strain C4 was found to be a dimer (M{sub r}, 84,500) with a sub unit with an M{sub r} of 42,000. The N-terminal sequence was found to be identical with that predicted from the gene sequence. The kinetics were best fit to a bisubstrate enzyme with an ordered mechanism. Bacillus sp. strain C4 citrate synthase was not activated by potassium chloride and was not inhibited by NADH, ATP, ADP, or AMP at levels up to 1 mM. The predicted amino acid sequence was compared with that of the E. coli, Acinetobacter anitratum, Pseudomonas aeruginosa, Rickettsia prowazekii, porcine heart, and Saccharomyces cerevisiae cytoplasmic and mitochondrial enzymes.

  2. Human secreted carbonic anhydrase: cDNA cloning, nucleotide sequence, and hybridization histochemistry

    SciTech Connect

    Aldred, P.; Fu, Ping; Barrett, G.; Penschow, J.D.; Wright, R.D.; Coghlan, J.P.; Fernley, R.T. )

    1991-01-01

    Complementary DNA clones coding for the human secreted carbonic anhydrase isozyme (CAVI) have been isolated and their nucleotide sequences determined. These clones identify a 1.45-kb mRNA that is present in high levels in parotid submandibular salivary glands but absent in other tissues such as the sublingual gland, kidney, liver, and prostate gland. Hybridization histochemistry of human salivary glands shows mRNA for CA VI located in the acinar cells of these glands. The cDNA clones encode a protein of 308 amino acids that includes a 17 amino acid leader sequence typical of secreted proteins. The mature protein has 291 amino acids compared to 259 or 260 for the cytoplasmic isozymes, with most of the extra amino acids present as a carboxyl terminal extension. In comparison, sheep CA VI has a 45 amino acid extension. Overall the human CA VI protein has a sequence identity of 35 {percent} with human CA II, while residues involved in the active site of the enzymes have been conserved. The human and sheep secreted carbonic anhydrases have a sequence identity of 72 {percent}. This includes the two cysteine residues that are known to be involved in an intramolecular disulfide bond in the sheep CA VI. The enzyme is known to be glycosylated and three potential N-glycosylation sites (Asn-X-Thr/Ser) have been identified. Two of these are known to be glycosylated in sheep CA VI. Southern analysis of human DNA indicates that there is only one gene coding for CA VI.

  3. Complete nucleotide sequence and experimental host range of Okra mosaic virus.

    PubMed

    Stephan, Dirk; Siddiqua, Mahbuba; Ta Hoang, Anh; Engelmann, Jill; Winter, Stephan; Maiss, Edgar

    2008-02-01

    Okra mosaic virus (OkMV) is a tymovirus infecting members of the family Malvaceae. Early infections in okra (Abelmoschus esculentus) lead to yield losses of 12-19.5%. Besides intensive biological characterizations of OkMV only minor molecular data were available. Therefore, we determined the complete nucleotide sequence of a Nigerian isolate of OkMV. The complete genomic RNA (gRNA) comprises 6,223 nt and its genome organization showed three major ORFs coding for a putative movement protein (MP) of M r 73.1 kDa, a large replication-associated protein (RP) of M r 202.4 kDa and a coat protein (CP) of M r 19.6 kDa. Prediction of secondary RNA structures showed three hairpin structures with internal loops in the 5'-untranslated region (UTR) and a 3'-terminal tRNA-like structure (TLS) which comprises the anticodon for valine, typical for a member of the genus Tymovirus. Phylogenetic comparisons based on the RP, MP and CP amino acid sequences showed the close relationship of OkMV not only to other completely sequenced tymoviruses like Kennedya yellow mosaic virus (KYMV), Turnip yellow mosaic virus (TYMV) and Erysimum latent virus (ErLV), but also to Calopogonium yellow vein virus (CalYVV), Clitoria yellow vein virus (CYVV) and Desmodium yellow mottle virus (DYMoV). This is the first report of a complete OkMV genome sequence from one of the various OkMV isolates originating from West Africa described so far. Additionally, the experimental host range of OkMV including several Nicotiana species was determined. PMID:18049886

  4. The qa repressor gene of Neurospora crassa: wild-type and mutant nucleotide sequences.

    PubMed Central

    Huiet, L; Giles, N H

    1986-01-01

    The qa-1S gene, one of two regulatory genes in the qa gene cluster of Neurospora crassa, encodes the qa repressor. The qa-1S gene together with the qa-1F gene, which encodes the qa activator protein, control the expression of all seven qa genes, including those encoding the inducible enzymes responsible for the utilization of quinic acid as a carbon source. The nucleotide sequence of the qa-1S gene and its flanking regions has been determined. The deduced coding sequence for the qa-1S protein encodes 918 amino acids with a calculated molecular weight of 100,650 and is interrupted by a single 66-base-pair intervening sequence. Both constitutive and noninducible mutants occur in the qa-1S gene and two different mutations of each type have been cloned and sequenced. All four mutations occur within the predicted coding region of the qa-1S gene. This result strongly supports the hypothesis that the qa-1S gene encodes a repressor. All four mutations are located within codons for the last 300 amino acids of the qa-1S protein. The mutations in three of the mutants involve amino acid substitutions, while the fourth mutant, which has a constitutive phenotype, contains a frameshift mutation. The two constitutive mutations occur in the most distal region of the gene, possibly implicating the COOH-terminal region of the qa repressor in binding to its target. The two noninducible mutations occur in a region proximal to the constitutive mutations, possibly implicating this region of the qa repressor in binding the inducer. Images PMID:3010294

  5. Nucleotide sequence and expression of the capsid protein gene of feline calicivirus.

    PubMed Central

    Neill, J D; Reardon, I M; Heinrikson, R L

    1991-01-01

    The sequence of the 3'-terminal 2,486 bases of the feline calicivirus (FCV) genome was determined. This region of the FCV genome, from which the 2.4-kb subgenomic RNA is derived, contained two open reading frames. The larger open reading frame, found in the 5' end of the subgenomic mRNA, contained 2,004 bases encoding a polypeptide of 73,467 Da. The smaller open reading frame, encoded in the 3' end of the mRNA, was composed of 318 bases, encoding a polypeptide of 12,185 Da. The AUG initiation codon of the second open reading frame overlapped the UGA termination codon of the first, with the sequence AUGA. The nucleotide sequence of the region containing this overlap resembles the -1 frameshift sequences of the retroviruses. The 5' end of the 2.4-kb subgenomic RNA was mapped by primer extension analysis. There were two apparent transcription initiation points, both of which were 5' to the AUG initiation codon of the large open reading frame. Transcription from these sites yielded RNA transcripts with 5' nontranslated leader regions of 17 and 18 bases. The total length of the 2.4-kb subgenomic RNA was 2,375 bases (from the 5'-most start site) excluding the poly(A) tail. Edman degradation of the purified capsid protein of FCV showed that the capsid protein was encoded by the large open reading frame. Western immunoblot analysis of FCV-infected cells using a feline anti-FCV antiserum demonstrated that translation of the capsid protein was detectable at 3 h postinfection and continued to accumulate until 8 h postinfection, the last time examined. Images PMID:1716692

  6. Probabilistic sequence alignment of Late Pleistocene benthic δ18O data

    NASA Astrophysics Data System (ADS)

    Lawrence, C.; Lin, L.; Lisiecki, L. E.; Stern, J.

    2013-12-01

    The stratigraphic alignment of ocean sediment cores plays a vital role in paleoceanographic research because it is used to develop mutually consistent age models for climate proxies measured in these cores. The most common proxy used for alignment is the The stratigraphic alignment of ocean sediment cores plays a vital role in paleoceanographic research because it is used to develop mutually consistent age models for climate proxies measured in these cores. The most common proxy used for alignment is the δ18O of calcite from benthic or planktonic foraminifera because a large fraction of δ18O variance derives from the global signal of ice volume. To date, alignment has been performed either by manual, qualitative comparison or by deterministic algorithms (Martinson, Pisias et al. Quat. Res. 27 1987; Lisiecki and Lisiecki Paleoceanography 17, 2002; Huybers and Wunsch, Paleoceanography 19, 2004). Here we present a probabilistic sequence alignment algorithm which provides 95% confidence bands for the alignment of pairs of benthic δ18O records. The probabilistic algorithm presented here is based on a hidden Markov model (HMM) (Levinson, Rabiner et al. Bell Systems Technical Journal, 62,1983) similar to those that have been used extensively to align DNA and protein sequences (Durbin, Eddy et al. Biological Sequence Analysis, Ch. 4, 1998). However, here the need to the alignment of sequences stems from expansion and/or contraction in the records due to changes in sedimentation rates rather than the insertion or deletion of residues. Transition probabilities that are used in this HMM to model changes in sedimentation rates are based on radiocarbon estimates of sedimentation rates. The probabilistic algorithm considers all possible alignments with these predefined sedimentation rates. Exact calculations are completed using dynamic programming recursions. The algorithm yields the probability distributions of the age at each point in the record, which are probabilistically

  7. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer

    PubMed Central

    Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A.

    2016-01-01

    Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution. PMID:27363362

  8. Phylo-VISTA: An Interactive Visualization Tool for Multiple DNA Sequence Alignments

    SciTech Connect

    Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.; Brudno, Michael; Batzoglou, Serafim; Bethel, E. Wes; Rubin, Edward M.; Hamann, Bernd; Dubchak, Inna

    2004-04-01

    We have developed Phylo-VISTA (Shah et al., 2003), an interactive software tool for analyzing multiple alignments by visualizing a similarity measure for DNA sequences of multiple species. The complexity of visual presentation is effectively organized using a framework based upon inter-species phylogenetic relationships. The phylogenetic organization supports rapid, user-guided inter-species comparison. To aid in navigation through large sequence datasets, Phylo-VISTA provides a user with the ability to select and view data at varying resolutions. The combination of multi-resolution data visualization and analysis, combined with the phylogenetic framework for inter-species comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments.

  9. Nucleotide sequence of the 3'-noncoding region of alfalfa mosaic virus RNA 4 and its homology with the genomic RNAs.

    PubMed Central

    Koper-Zwarthoff, E C; Brederode, F T; Walstra, P; Bol, J F

    1979-01-01

    A 226-nucleotide fragment was derived from alfalfa mosaic virus RNA 4 (ALMV RNA 4), the subgenomic messenger for viral coat protein, and its sequence was deduced by in vitro labeling with polynucleotide kinase and application of RNA sequencing techniques. The fragment contains the 3'-terminal 45 nucleotides of the coat protein cistron and the complete 3'-noncoding region of 182 nucleotides. The total length of RNA 4 was calculated to be 881 nucleotides. AlMV RNAs 1, 2 and 3 were elongated with a 3'-terminal poly(A) stretch and subjected to sequence analysis by using a specific primer, reverse transcriptase and chain terminators. This revealed and extensive homology between the 3'-terminal 140 to 150 nucleotides of all four ALMV RNAs. Despite a number of base substitutions, the secondary structure of the homologous region is highly conserved. The observed homology indicates that, as with RNA 4, the sites with a high affinity for the viral coat protein are located at the 3'-termini of the genomic RNAs. Images PMID:537914

  10. Nucleotide sequence of the Kaposi sarcoma-associated herpesvirus (HHV8)

    PubMed Central

    Russo, James J.; Bohenzky, Roy A.; Chien, Ming-Cheng; Chen, Jing; Yan, Ming; Maddalena, Dawn; Parry, J. Preston; Peruzzi, Daniela; Edelman, Isidore S.; Chang, Yuan; Moore, Patrick S.

    1996-01-01

    The genome of the Kaposi sarcoma-associated herpesvirus (KSHV or HHV8) was mapped with cosmid and phage genomic libraries from the BC-1 cell line. Its nucleotide sequence was determined except for a 3-kb region at the right end of the genome that was refractory to cloning. The BC-1 KSHV genome consists of a 140.5-kb-long unique coding region flanked by multiple G+C-rich 801-bp terminal repeat sequences. A genomic duplication that apparently arose in the parental tumor is present in this cell culture-derived strain. At least 81 ORFs, including 66 with homology to herpesvirus saimiri ORFs, and 5 internal repeat regions are present in the long unique region. The virus encodes homologs to complement-binding proteins, three cytokines (two macrophage inflammatory proteins and interleukin 6), dihydrofolate reductase, bcl-2, interferon regulatory factors, interleukin 8 receptor, neural cell adhesion molecule-like adhesin, and a D-type cyclin, as well as viral structural and metabolic proteins. Terminal repeat analysis of virus DNA from a KS lesion suggests a monoclonal expansion of KSHV in the KS tumor. PMID:8962146

  11. Complete nucleotide sequence of rose yellow leaf virus, a new member of the family Tombusviridae.

    PubMed

    Mollov, Dimitre; Lockhart, Ben; Zlesak, David C

    2014-10-01

    The genome of the rose yellow leaf virus (RYLV) has been determined to be 3918 nucleotides long and to contain seven open reading frames (ORFs). ORF1 encodes a 27-kDa peptide (p27). ORF2 shares a common start codon with ORF1 and continues through the amber stop codon of p27 to encode an 87-kDa (p87) protein that has amino acid similarity to the RNA-dependent RNA polymerase (RdRp) of members of the family Tombusviridae. ORFs 3 and 4 have no significant amino acid similarity to known functional viral ORFs. ORF5 encodes a 6-kDa (p6) protein that has similarity to movement proteins of members of the Tombusviridae. ORF5A has no conventional start codon and overlaps with p6. A putative +1 frameshift mechanism allows p6 translation to continue through the stop codon and results in a 12-kDa protein that has high homology to the carmovirus p13 movement protein. The 37-kDa protein encoded by ORF6 has amino acid sequence similarity to coat proteins (CP) of members of the Tombusviridae. ORF7 has no significant amino acid similarity to known viral ORFs. Phylogenetic analysis of the RdRp amino acid sequences grouped RYLV together with the unclassified Rosa rugosa leaf distortion virus (RrLDV), pelargonium line pattern virus (PLPV), and pelargonium chlorotic ring pattern virus (PCRPV) in a distinct subgroup of the family Tombusviridae. PMID:24838852

  12. Use of nucleotide sequence data to identify a microsporidian pathogen of Pieris rapae (Lepidoptera, Pieridae).

    PubMed

    Malone, L A; McIvor, C A

    1996-11-01

    Nucleotide sequence was determined for a portion of genomic DNA which spans the V4 variable region of the small subunit ribosomal RNA gene of an unidentified microsporidium from the cabbage white butterfly, Pieris rapae (174 base pairs). Comparison with equivalent sequence data obtained here for two other microsporidian species, Nosema bombycis (240 base pairs) and Nosema bombi (200 base pairs), and from the GenBank database for 11 other microsporidian species suggests that the unidentified species from P. rapae is most closely related to some Vairimorpha species. Light and electron microscopic observations of the developmental stages of this parasite were in accord with this. Infection experiments conducted at 20 and 26 degrees C demonstrated temperature-dependent dimorphism, with the production of both binucleate free spores (mean dimensions: 3.8 x 1.8 microns; 10-13 polar filament coils) and membrane-bound uninucleate octospores (mean dimensions: 3.1 x 1.9 microns). Macrospores (mean dimensions 8.0 x 2.1 microns) were also observed. Sites of infection were the gut epithelium, the Malpighian tubules, the salivary glands, and the fat body. Infections were found in all insect life stages, including the egg. This microsporidium was found to be indistinguishable from both Nosema mesnili (Paillot) and Microsporidium (Thelohania) mesnili (Paillot) and we propose that these species be combined and transferred to the genus Vairimorpha Pilley. PMID:8931362

  13. Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in Exome Sequencing Studies

    PubMed Central

    Bao, Su-Ying; Yang, Wanling; Ho, Shu-Leong; Song, Yong-Qiang; Sham, Pak C.

    2013-01-01

    Exome sequencing is becoming a standard tool for mapping Mendelian disease-causing (or pathogenic) non-synonymous single nucleotide variants (nsSNVs). Minor allele frequency (MAF) filtering approach and functional prediction methods are commonly used to identify candidate pathogenic mutations in these studies. Combining multiple functional prediction methods may increase accuracy in prediction. Here, we propose to use a logit model to combine multiple prediction methods and compute an unbiased probability of a rare variant being pathogenic. Also, for the first time we assess the predictive power of seven prediction methods (including SIFT, PolyPhen2, CONDEL, and logit) in predicting pathogenic nsSNVs from other rare variants, which reflects the situation after MAF filtering is done in exome-sequencing studies. We found that a logit model combining all or some original prediction methods outperforms other methods examined, but is unable to discriminate between autosomal dominant and autosomal recessive disease mutations. Finally, based on the predictions of the logit model, we estimate that an individual has around 5% of rare nsSNVs that are pathogenic and carries ∼22 pathogenic derived alleles at least, which if made homozygous by consanguineous marriages may lead to recessive diseases. PMID:23341771

  14. Predicting mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies.

    PubMed

    Li, Miao-Xin; Kwan, Johnny S H; Bao, Su-Ying; Yang, Wanling; Ho, Shu-Leong; Song, Yong-Qiang; Sham, Pak C

    2013-01-01

    Exome sequencing is becoming a standard tool for mapping Mendelian disease-causing (or pathogenic) non-synonymous single nucleotide variants (nsSNVs). Minor allele frequency (MAF) filtering approach and functional prediction methods are commonly used to identify candidate pathogenic mutations in these studies. Combining multiple functional prediction methods may increase accuracy in prediction. Here, we propose to use a logit model to combine multiple prediction methods and compute an unbiased probability of a rare variant being pathogenic. Also, for the first time we assess the predictive power of seven prediction methods (including SIFT, PolyPhen2, CONDEL, and logit) in predicting pathogenic nsSNVs from other rare variants, which reflects the situation after MAF filtering is done in exome-sequencing studies. We found that a logit model combining all or some original prediction methods outperforms other methods examined, but is unable to discriminate between autosomal dominant and autosomal recessive disease mutations. Finally, based on the predictions of the logit model, we estimate that an individual has around 5% of rare nsSNVs that are pathogenic and carries ~22 pathogenic derived alleles at least, which if made homozygous by consanguineous marriages may lead to recessive diseases. PMID:23341771

  15. Mutations in core nucleotide sequence of hepatitis B virus correlate with fulminant and severe hepatitis.

    PubMed Central

    Ehata, T; Omata, M; Chuang, W L; Yokosuka, O; Ito, Y; Hosoda, K; Ohto, M

    1993-01-01

    Infection with hepatitis B virus leads to a wide spectrum of liver injury, including self-limited acute hepatitis, fulminant hepatitis, and chronic hepatitis with progression to cirrhosis or acute exacerbation to liver failure, as well as an asymptomatic chronic carrier state. Several studies have suggested that the hepatitis B core antigen could be an immunological target of cytotoxic T lymphocytes. To investigate the reason why the extreme immunological attack occurred in fulminant hepatitis and severe exacerbation patients, the entire precore and core region of hepatitis B virus DNA was sequenced in 24 subjects (5 fulminant, 10 severe fatal exacerbation, and 9 self-limited acute hepatitis patients). No significant change in the nucleotide sequence and deduced amino acid residue was noted in the nine self-limited acute hepatitis patients. In contrast, clustering changes in a small segment of 16 amino acids (codon 84-99 from the start of the core gene) in all seven adr subtype infected fulminant and severe exacerbation patients was found. A different segment with clustering substitutions (codon 48-60) was also found in seven of eight adw subtype infected fulminant and severe exacerbation patients. Of the 15 patients, 2 lacked precore stop mutation which was previously reported to be associated with fulminant hepatitis. These data suggest that these core regions with mutations may play an important role in the pathogenesis of hepatitis B viral disease, and such mutations are related to severe liver damage. Images PMID:8450049

  16. BIND - an algorithm for loss-less compression of nucleotide sequence data.

    PubMed

    Bose, Tungadri; Mohammed, Monzoorul Haque; Dutta, Anirban; Mande, Sharmila S

    2012-09-01

    Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by these technologies has been increasing exponentially since the last decade. Storage, archival and dissemination of such huge data sets require efficient solutions, both from the hardware as well as software perspective. The present paper describes BIND-an algorithm specialized for compressing nucleotide sequence data. By adopting a unique 'block-length' encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma). Moreover, in contrast to implementations of existing specialized genomic compression approaches, the implementation of BIND is enabled to handle non-ATGC and lowercase characters. This makes BIND a loss-less compression approach that is suitable for practical use. More importantly, validation results of BIND (with real-world data sets) indicate reasonable speeds of compression and decompression that can be achieved with minimal processor/ memory usage. BIND is available for download at http://metagenomics.atc.tcs.com/compression/BIND. No license is required for academic or non-profit use. PMID:22922203

  17. Nucleotide sequence and phylogenetic analysis of a new potexvirus: Malva mosaic virus.

    PubMed

    Côté, Fabien; Paré, Christine; Majeau, Nathalie; Bolduc, Marilène; Leblanc, Eric; Bergeron, Michel G; Bernardy, Michael G; Leclerc, Denis

    2008-01-01

    A filamentous virus isolated from Malva neglecta Wallr. (common mallow) and propagated in Chenopodium quinoa was grown, cloned and the complete nucleotide sequence was determined (GenBank accession # DQ660333). The genomic RNA is 6858 nt in length and contains five major open reading frames (ORFs). The genomic organization is similar to members and the viral encoded proteins shared homology with the group of the Potexvirus genus in the Flexiviridae family. Phylogenetic analysis revealed a close relationship with narcissus mosaic virus (NMV), scallion virus X (ScaVX) and, to a lesser extent, to Alstroemeria virus X (AlsVX) and pepino mosaic virus (PepMV). A novel putative pseudoknot structure is predicted in the 3'-UTR of a subgroup of potexviruses, including this newly described virus. The consensus GAAAA sequence is detected at the 5'-end of the genomic RNA and experimental data strongly suggest that this motif could be a distinctive hallmark of this genus. The name Malva mosaic virus is proposed. PMID:18054524

  18. Nucleotide sequence and expression of alpha-glucosidase-encoding gene (agdA) from Aspergillus oryzae.

    PubMed

    Minetoki, T; Gomi, K; Kitamoto, K; Kumagai, C; Tamura, G

    1995-08-01

    We have isolated an alpha-glucosidase(AGL)-encoding gene (agdA) from Aspergillus oryzae by heterologous hybridization using the corresponding Aspergillus niger gene as a probe. Southern hybridization analysis showed that the agdA gene is on a 5.0-kb ScaI fragment and there is a single copy in the A. oryzae chromosome. Comparison with the A. niger agdA gene indicated that the agdA gene contains three putative introns from 52 to 59 nucleotides long, and that it encodes 985 amino acid residues. The deduced amino acid sequence of A. oryzae AGL is 78% homologous with the A. niger AGL. The high degree of homology with the amino acid sequence bordering the putative catalytic residue of a number of AGL enzymes, and this enzyme suggests that Asp492 is a catalytic residue of A. oryzae AGL. The cloned gene was functional. Transformants of A. oryzae containing multiple copies of the cloned agdA gene showed a 6-16 fold increase in AGL activity. Like the Taka-amylase A and glucoamylase genes of A. oryzae, expression of the agdA gene was induced when maltose was provided as a carbon source, but expression was not induced by glucose. This result suggested that cis-element(s) involved in maltose induction may be also present in the agdA promoter region. PMID:7549103

  19. A parallel approach of COFFEE objective function to multiple sequence alignment

    NASA Astrophysics Data System (ADS)

    Zafalon, G. F. D.; Visotaky, J. M. V.; Amorim, A. R.; Valêncio, C. R.; Neves, L. A.; de Souza, R. C. G.; Machado, J. M.

    2015-09-01

    The computational tools to assist genomic analyzes show even more necessary due to fast increasing of data amount available. With high computational costs of deterministic algorithms for sequence alignments, many works concentrate their efforts in the development of heuristic approaches to multiple sequence alignments. However, the selection of an approach, which offers solutions with good biological significance and feasible execution time, is a great challenge. Thus, this work aims to show the parallelization of the processing steps of MSA-GA tool using multithread paradigm in the execution of COFFEE objective function. The standard objective function implemented in the tool is the Weighted Sum of Pairs (WSP), which produces some distortions in the final alignments when sequences sets with low similarity are aligned. Then, in studies previously performed we implemented the COFFEE objective function in the tool to smooth these distortions. Although the nature of COFFEE objective function implies in the increasing of execution time, this approach presents points, which can be executed in parallel. With the improvements implemented in this work, we can verify the execution time of new approach is 24% faster than the sequential approach with COFFEE. Moreover, the COFFEE multithreaded approach is more efficient than WSP, because besides it is slightly fast, its biological results are better.

  20. SVM-BALSA: Remote Homology Detection based on Bayesian Sequence Alignment

    SciTech Connect

    Webb-Robertson, Bobbie-Jo M.; Oehmen, Chris S.; Matzke, Melissa M.

    2005-11-10

    Using biopolymer sequence comparison methods to identify evolutionarily related proteins is one of the most common tasks in bioinformatics. Recently, support vector machines (SVMs) utilizing statistical learning theory have been employed in the problem of remote homology detection and shown to outperform iterative profile methods such as PSI-BLAST. In this study we demonstrate the utilization of a Bayesian alignment score, which accounts for the uncertainty of all possible alignments, in the SVM construction improves sensitivity compared to the traditional dynamic programming implementation.

  1. Comparative Topological Analysis of Neuronal Arbors via Sequence Representation and Alignment

    NASA Astrophysics Data System (ADS)

    Gillette, Todd Aaron

    Neuronal morphology is a key mediator of neuronal function, defining the profile of connectivity and shaping signal integration and propagation. Reconstructing neurite processes is technically challenging and thus data has historically been relatively sparse. Data collection and curation along with more efficient and reliable data production methods provide opportunities for the application of informatics to find new relationships and more effectively explore the field. This dissertation presents a method for aiding the development of data production as well as a novel representation and set of analyses for extracting morphological patterns. The DIADEM Challenge was organized for the purposes of determining the state of the art in automated neuronal reconstruction and what existing challenges remained. As one of the co-organizers of the Challenge, I developed the DIADEM metric, a tool designed to measure the effectiveness of automated reconstruction algorithms by comparing resulting reconstructions to expert-produced gold standards and identifying errors of various types. It has been used in the DIADEM Challenge and in the testing of several algorithms since. Further, this dissertation describes a topological sequence representation of neuronal trees amenable to various forms of sequence analysis, notably motif analysis, global pairwise alignment, clustering, and multiple sequence alignment. Motif analysis of neuronal arbors shows a large difference in bifurcation type proportions between axons and dendrites, but that relatively simple growth mechanisms account for most higher order motifs. Pairwise global alignment of topological sequences, modified from traditional sequence alignment to preserve tree relationships, enabled cluster analysis which displayed strong correspondence with known cell classes by cell type, species, and brain region. Multiple alignment of sequences in selected clusters enabled the extraction of conserved features, revealing mouse

  2. Alignment of Red-Sequence Cluster Dwarf Galaxies: From the Frontier Fields to the Local Universe

    NASA Astrophysics Data System (ADS)

    Barkhouse, Wayne Alan; Archer, Haylee; Burgad, Jaford; Foote, Gregory; Rude, Cody; Lopez-Cruz, Omar

    2015-08-01

    Galaxy clusters are the largest virialized structures in the universe. Due to their high density and mass, they are an excellent laboratory for studying the environmental effects on galaxy evolution. Numerical simulations have predicted that tidal torques acting on dwarf galaxies as they fall into the cluster environment will cause the major axis of the galaxies to align with their radial position vector (a line that extends from the cluster center to the galaxy's center). We have undertaken a study to measure the redshift evolution of the alignment of red-sequence cluster dwarf galaxies based on a sample of 57 low-redshift Abell clusters imaged at KPNO using the 0.9-meter telescope, and 64 clusters from the WINGS dataset. To supplement our low-redshift sample, we have included galaxies selected from the Hubble Space Telescope Frontier fields. Leveraging the HST data allows us to look for evolutionary changes in the alignment of red-sequence cluster dwarf galaxies over a redshift range of 0 < z < 0.35. The alignment of the major axis of the dwarf galaxies is measured by fitting a Sersic function to each red-sequence galaxy using GALFIT. The quality of each model is checked visually after subtracting the model from the galaxy. The cluster sample is then combined by scaling each cluster by r200. We present our preliminary results based on the alignment of the red-sequence dwarf galaxies with: 1) the major axis of the brightest cluster galaxy, 2) the major axis of the cluster defined by the position of cluster members, and 3) a radius vector pointing from the cluster center to individual dwarf galaxies. Our combined cluster sample is sub-divided into different radial regions and redshift bins.

  3. Proteus mirabilis urease: nucleotide sequence determination and comparison with jack bean urease.

    PubMed

    Jones, B D; Mobley, H L

    1989-12-01

    Proteus mirabilis, a common cause of urinary tract infection, produces a potent urease that hydrolyzes urea to NH3 and CO2, initiating kidney stone formation. Urease genes, which were localized to a 7.6-kilobase-pair region of DNA, were sequenced by using the dideoxy method. Six open reading frames were found within a region of 4,952 base pairs which were predicted to encode polypeptides of 31.0 (ureD), 11.0 (ureA), 12.2 (ureB), 61.0 (ureC), 17.9 (ureE), and 23.0 (ureF) kilodaltons (kDa). Each open reading frame was preceded by a ribosome-binding site, with the exception of ureE. Putative promoterlike sequences were identified upstream of ureD, ureA, and ureF. Possible termination sites were found downstream of ureD, ureC, and ureF. Structural subunits of the enzyme were encoded by ureA, ureB, and ureC and were translated from a single transcript in the order of 11.0, 12.2, and 61.0 kDa. When the deduced amino acid sequences of the P. mirabilis urease subunits were compared with the amino acid sequence of the jack bean urease, significant amino acid similarity was observed (58% exact matches; 73% exact plus conservative replacements). The 11.0-kDa polypeptide aligned with the N-terminal residues of the plant enzyme, the 12.2-kDa polypeptide lined up with internal residues, and the 61.0-kDa polypeptide matched with the C-terminal residues, suggesting an evolutionary relationship of the urease genes of jack bean and P. mirabilis. PMID:2687233

  4. Nucleotide sequence encoding the flavoprotein and hydrophobic subunits of the succinate dehydrogenase of Escherichia coli.

    PubMed Central

    Wood, D; Darlison, M G; Wilde, R J; Guest, J R

    1984-01-01

    The nucleotide sequence of a 3614 base-pair segment of DNA containing the sdhA gene, encoding the flavoprotein subunit of succinate dehydrogenase of Escherichia coli, and two genes sdhC and sdhD, encoding small hydrophobic subunits, has been determined. Together with the iron-sulphur protein gene (sdhB) these genes form an operon (sdhCDAB) situated between the citrate synthase gene (gltA) and the 2-oxoglutarate dehydrogenase complex genes (sucAB): gltA-sdhCDAB-sucAB. Transcription of the gltA and sdhCDAB gene appears to diverge from a single intergenic region that contains two pairs of potential promoter sequences and two putative CRP (cyclic AMP receptor protein)-binding sites. The sdhA structural gene comprises 1761 base-pairs (587 codons, excluding the initiation codon, AUG) and it encodes a polypeptide of Mr 64268 that is strikingly homologous with the flavoprotein subunit of fumarate reductase (frdA gene product). The FAD-binding region, including the histidine residue at the FAD-attachment site, has been identified by its homology with other flavoproteins and with the flavopeptide of the bovine heart mitochondrial succinate dehydrogenase. Potential active-site cysteine and histidine residues have also been indicated by the comparisons. The sdhC (384 base-pairs) and sdhD (342 base-pairs) structural genes encode two strongly hydrophobic proteins of Mr 14167 and 12792 respectively. These proteins resemble in size and composition, but not sequence, the membrane anchor proteins of fumarate reductase (the frdC and frdD gene products). PMID:6383359

  5. Complete Nucleotide Sequences and Genome Organization of Two Pepper Mild Mottle Virus Isolates from Capsicum annuum in South Korea

    PubMed Central

    Choi, Seung-Kook; Choi, Gug-Seoun; Kwon, Sun-Jung

    2016-01-01

    The complete genome sequences of pepper mild mottle virus (PMMoV)-P2 and -P3 were determined by the Sanger sequencing method. Although PMMoV-P2 and PMMoV-P3 have different pathogenicity in some pepper cultivars, the complete genome sequences of PMMoV-P2 and -P3 are composed of 6,356 nucleotides (nt). In this study, we report the complete genome sequences and genome organization of PMMoV-P2 and -P3 isolates from pepper species in South Korea. PMID:27198033

  6. Complete Nucleotide Sequences and Genome Organization of Two Pepper Mild Mottle Virus Isolates from Capsicum annuum in South Korea.

    PubMed

    Choi, Seung-Kook; Choi, Gug-Seoun; Kwon, Sun-Jung; Yoon, Ju-Yeon

    2016-01-01

    The complete genome sequences of pepper mild mottle virus (PMMoV)-P2 and -P3 were determined by the Sanger sequencing method. Although PMMoV-P2 and PMMoV-P3 have different pathogenicity in some pepper cultivars, the complete genome sequences of PMMoV-P2 and -P3 are composed of 6,356 nucleotides (nt). In this study, we report the complete genome sequences and genome organization of PMMoV-P2 and -P3 isolates from pepper species in South Korea. PMID:27198033

  7. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  8. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  9. Empirical Comparison of Simple Sequence Repeats and Single Nucleotide Polymorphisms in Assessment of Maize Diversity and Relatedness

    Technology Transfer Automated Retrieval System (TEKTRAN)

    While Simple Sequence Repeats (SSRs) are extremely useful genetic markers, recent advances in technology have produced a shift toward use of single nucleotide polymorphisms (SNPs). The different mutational properties of these two classes of markers result in differences in heterozygosities and allel...

  10. A high-density simple sequence repeat and single nucleotide polymorphism genetic map of the tetraploid cotton genome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Cotton genome complexity was investigated with a saturated molecular genetic map that combined several sets of microsatellites or simple sequence repeats (SSR) and the first major public set of single nucleotide polymorphism (SNP) markers in cotton genomes (Gossypium spp.), and that was constructed ...

  11. Nucleotide sequence of a predicted diguanylate cyclase unique to egg contaminating Salmonella enteritidis that does not form biofilm.

    Technology Transfer Automated Retrieval System (TEKTRAN)

    This is a nucleotide sequence submitted as bankit1052494 and given the GenBank accession number EU375808A post-review. It will be released to the public in June 2008 in coordination with an ASM abstract presentation. A paper will also be submitted that refers to this accession number. A diguanyla...

  12. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  13. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  14. 37 CFR 1.823 - Requirements for nucleotide and/or amino acid sequences as part of the application.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ... is DNA, RNA, or PRT (protein). If a nucleotide sequence contains both DNA and RNA fragments, the type shall be “DNA.” In addition, the combined DNA/RNA molecule shall be further described in the to feature... combined DNA/RNA” Name/Key Provide appropriate identifier for feature, preferably from WIPO Standard...

  15. Molecular cloning and nucleotide sequence of a transforming gene detected by transfection of chicken B-cell lymphoma DNA

    NASA Astrophysics Data System (ADS)

    Goubin, Gerard; Goldman, Debra S.; Luce, Judith; Neiman, Paul E.; Cooper, Geoffrey M.

    1983-03-01

    A transforming gene detected by transfection of chicken B-cell lymphoma DNA has been isolated by molecular cloning. It is homologous to a conserved family of sequences present in normal chicken and human DNAs but is not related to transforming genes of acutely transforming retroviruses. The nucleotide sequence of the cloned transforming gene suggests that it encodes a protein that is partially homologous to the amino terminus of transferrin and related proteins although only about one tenth the size of transferrin.

  16. Nucleotide sequence of the gene encoding the major subunit of CS3 fimbriae of enterotoxigenic Escherichia coli.

    PubMed Central

    Boylan, M; Smyth, C J; Scott, J R

    1988-01-01

    The complete nucleotide sequence of a 612-base-pair DNA fragment containing the gene for the major fimbrial subunit of CS3 of enterotoxigenic Escherichia coli is presented. A possible promoter region, a ribosome-binding site, and two potential signal peptidase cleavage sites are indicated. Unlike the best-studied fimbrial proteins, the predicted CS3 sequence has no Cys residues. PMID:2903130

  17. Nucleotide sequences of 5S rRNAs from sponge Halichondria japonica and tunicate Halocynthia roretzi and their phylogenetic positions

    PubMed Central

    Komiya, Hiroyuki; Hasegawa, Masami; Takemura, Shosuke

    1983-01-01

    The nucleotide sequences of 5S rRNAs from sponge Halichondria japonica and tunicate Halocynthia roretzi were determined by chemical and enzymatic gel methods. Their phylogenetic positions among metazoans were derived from the 5S rRNA sequences by a computer analysis based on the maximum parsimony principle. It was suggested that the sponge is closely related to several invertebrates and the tunicate has affinity to vertebrates rather than invertebrates. PMID:6835845

  18. Accuracy of sequence alignment and fold assessment using reduced amino acid alphabets.

    PubMed

    Melo, Francisco; Marti-Renom, Marc A

    2006-06-01

    Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs. PMID:16506243

  19. TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction

    PubMed Central

    Chang, Jia-Ming; Di Tommaso, Paolo; Lefort, Vincent; Gascuel, Olivier; Notredame, Cedric

    2015-01-01

    This article introduces the Transitive Consistency Score (TCS) web server; a service making it possible to estimate the local reliability of protein multiple sequence alignments (MSAs) using the TCS index. The evaluation can be used to identify the aligned positions most likely to contain structurally analogous residues and also most likely to support an accurate phylogenetic reconstruction. The TCS scoring scheme has been shown to be accurate predictor of structural alignment correctness among commonly used methods. It has also been shown to outperform common filtering schemes like Gblocks or trimAl when doing MSA post-processing prior to phylogenetic tree reconstruction. The web server is available from http://tcoffee.crg.cat/tcs. PMID:25855806

  20. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library

    PubMed Central

    2009-01-01

    Background To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (SNP) discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA) broodstock population. Results The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends). Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183) of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In addition, 2% of the

  1. Gene-based single nucleotide polymorphism discovery in bovine muscle using next-generation transcriptomic sequencing

    PubMed Central

    2013-01-01

    Background Genetic information based on molecular markers has increasingly being used in cattle breeding improvement programmes, as a mean to improve conventionally phenotypic selection. Advances in molecular genetics have led to the identification of several genetic markers associated with genes affecting economic traits. Until recently, the identification of the causative genetic variants involved in the phenotypes of interest has remained a difficult task. The advent of novel sequencing technologies now offers a new opportunity for the identification of such variants. Despite sequencing costs plummeting, sequencing whole-genomes or large targeted regions is still too expensive for most laboratories. A transcriptomic-based sequencing approach offers a cheaper alternative to identify a large number of polymorphisms and possibly to discover causative variants. In the present study, we performed a gene-based single nucleotide polymorphism (SNP) discovery analysis in bovine Longissimus thoraci, using RNA-Seq. To our knowledge, this represents the first study done in bovine muscle. Results Messenger RNAs from Longissimus thoraci from three Limousin bull calves were subjected to high-throughput sequencing. Approximately 36–46 million paired-end reads were obtained per library. A total of 19,752 transcripts were identified and 34,376 different SNPs were detected. Fifty-five percent of the SNPs were found in coding regions and ~22% resulted in an amino acid change. Applying a very stringent SNP quality threshold, we detected 8,407 different high-confidence SNPs, 18% of which are non synonymous coding SNPs. To analyse the accuracy of RNA-Seq technology for SNP detection, 48 SNPs were selected for validation by genotyping. No discrepancies were observed when using the highest SNP probability threshold. To test the usefulness of the identified SNPs, the 48 selected SNPs were assessed by genotyping 93 bovine samples, representing mostly the nine major breeds used in France

  2. Comparison of alignment software for genome-wide bisulphite sequence data

    PubMed Central

    Chatterjee, Aniruddha; Stockwell, Peter A.; Rodger, Euan J.; Morison, Ian M.

    2012-01-01

    Recent advances in next generation sequencing (NGS) technology now provide the opportunity to rapidly interrogate the methylation status of the genome. However, there are challenges in handling and interpretation of the methylation sequence data because of its large volume and the consequences of bisulphite modification. We sequenced reduced representation human genomes on the Illumina platform and efficiently mapped and visualized the data with different pipelines and software packages. We examined three pipelines for aligning bisulphite converted sequencing reads and compared their performance. We also comment on pre-processing and quality control of Illumina data. This comparison highlights differences in methods for NGS data processing and provides guidance to advance sequence-based methylation data analysis for molecular biologists. PMID:22344695

  3. Characteristic features of the nucleotide sequences of yeast mitochondrial ribosomal protein genes as analyzed by computer program GeneMark.

    PubMed

    Isono, K; McIninch, J D; Borodovsky, M

    1994-01-01

    The nucleotide sequence data for yeast mitochondrial ribosomal protein (MRP) genes were analyzed by the computer program GeneMark which predicts the presence of likely genes in sequence data by calculating statistical biases in the appearance of consecutive nucleotides. The program uses a set of standard sequence data for this calculation. We used this program for the analysis of yeast nucleotide sequence data containing MRP genes, hoping to obtain information as to whether they share features in common that are different from other yeast genes. Sequence data sets for ordinary yeast genes and for 27 known MRP genes were used. The MRP genes were nicely predicted as likely genes regardless of the data sets used, whereas other yeast genes were predicted to be likely genes only when the data set for ordinary yeast genes was used. The assembled sequence data for chromosomes II, III, VIII and XI as well as the segmented data for chromosome V were analyzed in a similar manner. In addition to the known MRP genes, eleven ORF's were predicted to be likely MRP genes. Thus, the method seems very powerful in analyzing genes of heterologous origins. PMID:7719921

  4. Enzyme sequence similarity improves the reaction alignment method for cross-species pathway comparison

    SciTech Connect

    Ovacik, Meric A.; Androulakis, Ioannis P.

    2013-09-15

    Pathway-based information has become an important source of information for both establishing evolutionary relationships and understanding the mode of action of a chemical or pharmaceutical among species. Cross-species comparison of pathways can address two broad questions: comparison in order to inform evolutionary relationships and to extrapolate species differences used in a number of different applications including drug and toxicity testing. Cross-species comparison of metabolic pathways is complex as there are multiple features of a pathway that can be modeled and compared. Among the various methods that have been proposed, reaction alignment has emerged as the most successful at predicting phylogenetic relationships based on NCBI taxonomy. We propose an improvement of the reaction alignment method by accounting for sequence similarity in addition to reaction alignment method. Using nine species, including human and some model organisms and test species, we evaluate the standard and improved comparison methods by analyzing glycolysis and citrate cycle pathways conservation. In addition, we demonstrate how organism comparison can be conducted by accounting for the cumulative information retrieved from nine pathways in central metabolism as well as a more complete study involving 36 pathways common in all nine species. Our results indicate that reaction alignment with enzyme sequence similarity results in a more accurate representation of pathway specific cross-species similarities and differences based on NCBI taxonomy.

  5. A Comprehensive Approach to Clustering of Expressed Human Gene Sequence: The Sequence Tag Alignment and Consensus Knowledge Base

    PubMed Central

    Miller, Robert T.; Christoffels, Alan G.; Gopalakrishnan, Chella; Burke, John; Ptitsyn, Andrey A.; Broveak, Tania R.; Hide, Winston A.

    1999-01-01

    The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exonic representation and expression forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110). Sixty-four percent of 1,313,103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK metaclusters consolidate a greater number of ESTs by a factor of 1.86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented. PMID:10568754

  6. Complete nucleotide sequence of the gene for the specific glycoprotein (gp55) of Friend spleen focus-forming virus.

    PubMed Central

    Amanuma, H; Katori, A; Obata, M; Sagata, N; Ikawa, Y

    1983-01-01

    The complete nucleotide sequence of the gene for the specific glycoprotein (gp55) of the polycythemic strain of Friend spleen focus-forming virus (SFFV) was derived from the cloned SFFV DNA intermediate. The gp55 gene is present within 1.4 kilobases of the 5' side of the 3'long terminal repeat sequence. The open reading frame predicts the primary translation product has a total of 409 amino acids with a Mr of 44,752. Comparisons of the deduced amino acid sequence of gp55 with those of the envelope (env) gene products of murine leukemia viruses (MuLVs) revealed that gp55 is composed of three distinct regions. The amino-terminal 80% of the molecule has a high degree of sequence homology with the amino-terminal portion of the gp70 of the Moloney mink cell focus-forming virus (BALB/Mo-MCFV). This portion of the BALB/Mo-MCFV gp70 is known to be coded for by the acquired xenotropic env-like sequence. The sequence of the following 66 amino acids of gp55 is highly homologous to that of the middle portion of the p15E of Moloney MuLV (Mo-MuLV). The sequence of the Carboxyl-terminal 12 amino acids is specific to gp55 and a comparison of the nucleotide sequence showed that this specific amino acid sequence is due to the presence of seven extra nucleotides compared with the sequence of the Mo-MuLV. PMID:6306650

  7. KMAD: knowledge-based multiple sequence alignment for intrinsically disordered proteins

    PubMed Central

    Lange, Joanna; Wyrwicz, Lucjan S.; Vriend, Gert

    2016-01-01

    Summary: Intrinsically disordered proteins (IDPs) lack tertiary structure and thus differ from globular proteins in terms of their sequence–structure–function relations. IDPs have lower sequence conservation, different types of active sites and a different distribution of functionally important regions, which altogether make their multiple sequence alignment (MSA) difficult. The KMAD MSA software has been written specifically for the alignment and annotation of IDPs. It augments the substitution matrix with knowledge about post-translational modifications, functional domains and short linear motifs. Results: MSAs produced with KMAD describe well-conserved features among IDPs, tend to agree well with biological intuition, and are a good basis for designing new experiments to shed light on this large, understudied class of proteins. Availability and implementation: KMAD web server is accessible at http://www.cmbi.ru.nl/kmad/. A standalone version is freely available. Contact: vriend@cmbi.ru.nl PMID:26568635

  8. A Convex Atomic-Norm Approach to Multiple Sequence Alignment and Motif Discovery

    PubMed Central

    Yen, Ian E. H.; Lin, Xin; Zhang, Jiong; Ravikumar, Pradeep; Dhillon, Inderjit S.

    2016-01-01

    Multiple Sequence Alignment and Motif Discovery, known as NP-hard problems, are two fundamental tasks in Bioinformatics. Existing approaches to these two problems are based on either local search methods such as Expectation Maximization (EM), Gibbs Sampling or greedy heuristic methods. In this work, we develop a convex relaxation approach to both problems based on the recent concept of atomic norm and develop a new algorithm, termed Greedy Direction Method of Multiplier, for solving the convex relaxation with two convex atomic constraints. Experiments show that our convex relaxation approach produces solutions of higher quality than those standard tools widely-used in Bioinformatics community on the Multiple Sequence Alignment and Motif Discovery problems. PMID:27559428

  9. seqphase: a web tool for interconverting phase input/output files and fasta sequence alignments.

    PubMed

    Flot, J-F

    2010-01-01

    The program phase is widely used for Bayesian inference of haplotypes from diploid genotypes; however, manually creating phase input files from sequence alignments is an error-prone and time-consuming process, especially when dealing with numerous variable sites and/or individuals. Here, a web tool called seqphase is presented that generates phase input files from fasta sequence alignments and converts phase output files back into fasta. During the production of the phase input file, several consistency checks are performed on the dataset and suitable command line options to be used for the actual phase data analysis are suggested. seqphase was written in perl and is freely accessible over the Internet at the address http://www.mnhn.fr/jfflot/seqphase. PMID:21565002

  10. Nucleotide sequence and genetic organization of the Bacillus subtilis comG operon.

    PubMed Central

    Albano, M; Breitling, R; Dubnau, D A

    1989-01-01

    A series of Tn917lac insertions define the comG region of the Bacillus subtilis chromosome. comG mutants are deficient in competence and specifically in the binding of exogenous DNA. The genes included in the comG region are first expressed during the transition from the exponential to the stationary growth phase. From nucleotide sequence information, it was concluded that the comG locus contains seven open reading frames (ORFs), several of which overlap at their termini. High-resolution S1 nuclease mapping and primer extension were used to identify the 5' terminus of the comG mRNA. The sequence upstream from the comG start site closely resembled the consensus recognition sequence for the major B. subtilis vegetative RNA polymerase holoenzyme. Complementation analysis confirmed that the comG ORF1 protein is required for the ability of competent cultures to resolve into two populations with different cell densities on Renografin (E. R. Squibb & Sons, Princeton, N.J.) gradients, as well as for full expression of comE, another late competence locus. The predicted comG ORF1 protein showed significant similarity to the virB ORF11 protein from Agrobacterium tumefaciens, which is probably involved in T-DNA transfer. The N-terminal sequences of comG ORF3 and, to a lesser extent, the comG ORF4 and ORF5 proteins were similar to a class of pilin proteins from members of the genera Bacteroides, Pseudomonas, Neisseria, and Moraxella. All of the comG proteins except comG ORF1 possessed hydrophobic domains that were potentially capable of spanning the bacterial membrane. It is likely that these proteins are membrane associated, and they may comprise part of the DNA transport machinery. When present in multiple copies, a DNA fragment carrying the comG promoter was capable of inhibiting the development of competence as well as the expression of several late com genes, suggesting a role for a transcriptional activator in the expression of those genes. Images PMID:2507524

  11. Proteus mirabilis MR/P fimbrial operon: genetic organization, nucleotide sequence, and conditions for expression.

    PubMed Central

    Bahrani, F K; Mobley, H L

    1994-01-01

    Proteus mirabilis, an agent of urinary tract infection, expresses at least four fimbrial types. Among these are the MR/P (mannose-resistant/Proteus-like) fimbriae. MrpA, the structural subunit, is optimally expressed at 37 degrees C in Luria broth cultured statically for 48 h by each of seven strains examined. Genes encoding this fimbria were isolated, and the complete nucleotide sequence was determined. The mrp gene cluster encoded by 7,293 bp predicts eight polypeptides: MrpI (22,133 Da), MrpA (17,909 Da), MrpB (19,632 Da), MrpC (96,823 Da), MrpD (27,886 Da), MrpE (19,470 Da), MrpF (17,363 Da), and MrpG (13,169 Da). mrpI is upstream of the gene encoding the major structural subunit gene mrpA and is transcribed in the direction opposite to that of the rest of the operon. All predicted polypeptides share > or = 25% amino acid identity with at least one other enteric fimbrial gene product encoded by the pap, fim, smf, fan, or mrk gene clusters. Images PMID:7910820

  12. Pediococcus acidilactici ldhD gene: cloning, nucleotide sequence, and transcriptional analysis.

    PubMed Central

    Garmyn, D; Ferain, T; Bernard, N; Hols, P; Delplace, B; Delcour, J

    1995-01-01

    The gene encoding D-lactate dehydrogenase was isolated on a 2.9-kb insert from a library of Pediococcus acidilactici DNA by complementation for growth under anaerobiosis of an Escherichia coli lactate dehydrogenase and pyruvate-formate lyase double mutant. The nucleotide sequence of ldhD encodes a protein of 331 amino acids (predicted molecular mass of 37,210 Da) which shows similarity to the family of D-2-hydroxyacid dehydrogenases. The enzyme encoded by the cloned fragment is equally active on pyruvate and hydroxypyruvate, indicating that the enzyme has both D-lactate and D-glycerate dehydrogenase activities. Three other open reading frames were found in the 2.9-kb insert, one of which (rpsB) is highly similar to bacterial genes coding for ribosomal protein S2. Northern (RNA) blotting analyses indicated the presence of a 2-kb dicistronic transcript of ldhD (a metabolic gene) and rpsB (a putative ribosomal protein gene) together with a 1-kb monocistronic rpsB mRNA. These transcripts are abundant in the early phase of exponential growth but steadily fade away to disappear in the stationary phase. Primer extension analysis identified two distinct promoters driving either cotranscription of ldhD and rpsB or transcription of rpsB alone. PMID:7539419

  13. Whole-genome sequencing identifies genomic heterogeneity at a nucleotide and chromosomal level in bladder cancer

    PubMed Central

    Morrison, Carl D.; Liu, Pengyuan; Woloszynska-Read, Anna; Zhang, Jianmin; Luo, Wei; Qin, Maochun; Bshara, Wiam; Conroy, Jeffrey M.; Sabatini, Linda; Vedell, Peter; Xiong, Donghai; Liu, Song; Wang, Jianmin; Shen, He; Li, Yinwei; Omilian, Angela R.; Hill, Annette; Head, Karen; Guru, Khurshid; Kunnev, Dimiter; Leach, Robert; Eng, Kevin H.; Darlak, Christopher; Hoeflich, Christopher; Veeranki, Srividya; Glenn, Sean; You, Ming; Pruitt, Steven C.; Johnson, Candace S.; Trump, Donald L.

    2014-01-01

    Using complete genome analysis, we sequenced five bladder tumors accrued from patients with muscle-invasive transitional cell carcinoma of the urinary bladder (TCC-UB) and identified a spectrum of genomic aberrations. In three tumors, complex genotype changes were noted. All three had tumor protein p53 mutations and a relatively large number of single-nucleotide variants (SNVs; average of 11.2 per megabase), structural variants (SVs; average of 46), or both. This group was best characterized by chromothripsis and the presence of subclonal populations of neoplastic cells or intratumoral mutational heterogeneity. Here, we provide evidence that the process of chromothripsis in TCC-UB is mediated by nonhomologous end-joining using kilobase, rather than megabase, fragments of DNA, which we refer to as “stitchers,” to repair this process. We postulate that a potential unifying theme among tumors with the more complex genotype group is a defective replication–licensing complex. A second group (two bladder tumors) had no chromothripsis, and a simpler genotype, WT tumor protein p53, had relatively few SNVs (average of 5.9 per megabase) and only a single SV. There was no evidence of a subclonal population of neoplastic cells. In this group, we used a preclinical model of bladder carcinoma cell lines to study a unique SV (translocation and amplification) of the gene glutamate receptor ionotropic N-methyl D-aspertate as a potential new therapeutic target in bladder cancer. PMID:24469795

  14. New features in the genus Ilarvirus revealed by the nucleotide sequence of Fragaria chiloensis latent virus.

    PubMed

    Tzanetakis, Ioannis E; Martin, Robert R

    2005-09-01

    Fragaria chiloensis latent virus (FClLV), a member of the genus Ilarvirus was first identified in the early 1990s. Double-stranded RNA was extracted from FClLV infected plants and cloned. The complete nucleotide sequence of the virus has been elucidated. RNA 1 encodes a protein with methyltransferase and helicase enzymatic motifs while RNA 2 encodes the viral RNA dependent RNA polymerase and an ORF, that shares no homology with other Ilarvirus genes. RNA 3 codes for movement and coat proteins and an additional ORF, making FClLV possibly the first Ilarvirus encoding a third protein in RNA 3. Phylogenetic analysis reveals that FClLV is most closely related to Prune dwarf virus, the type member of subgroup 4 of the Ilarvirus genus. FClLV is also closely related to Alfalfa mosaic virus (AlMV), a virus that shares many properties with Ilarviruses . We propose the reclassification of AlMV as a member of the Ilarvirus genus instead of being a member of a distinct genus. PMID:15878214

  15. Postzygotic single-nucleotide mosaicisms in whole-genome sequences of clinically unremarkable individuals

    PubMed Central

    Huang, August Y; Xu, Xiaojing; Ye, Adam Y; Wu, Qixi; Yan, Linlin; Zhao, Boxun; Yang, Xiaoxu; He, Yao; Wang, Sheng; Zhang, Zheng; Gu, Bowen; Zhao, Han-Qing; Wang, Meng; Gao, Hua; Gao, Ge; Zhang, Zhichao; Yang, Xiaoling; Wu, Xiru; Zhang, Yuehua; Wei, Liping

    2014-01-01

    Postzygotic single-nucleotide mutations (pSNMs) have been studied in cancer and a few other overgrowth human disorders at whole-genome scale and found to play critical roles. However, in clinically unremarkable individuals, pSNMs have never been identified at whole-genome scale largely due to technical difficulties and lack of matched control tissue samples, and thus the genome-wide characteristics of pSNMs remain unknown. We developed a new Bayesian-based mosaic genotyper and a series of effective error filters, using which we were able to identify 17 SNM sites from ∼80× whole-genome sequencing of peripheral blood DNAs from three clinically unremarkable adults. The pSNMs were thoroughly validated using pyrosequencing, Sanger sequencing of individual cloned fragments, and multiplex ligation-dependent probe amplification. The mutant allele fraction ranged from 5%-31%. We found that C→T and C→A were the predominant types of postzygotic mutations, similar to the somatic mutation profile in tumor tissues. Simulation data showed that the overall mutation rate was an order of magnitude lower than that in cancer. We detected varied allele fractions of the pSNMs among multiple samples obtained from the same individuals, including blood, saliva, hair follicle, buccal mucosa, urine, and semen samples, indicating that pSNMs could affect multiple sources of somatic cells as well as germ cells. Two of the adults have children who were diagnosed with Dravet syndrome. We identified two non-synonymous pSNMs in SCN1A, a causal gene for Dravet syndrome, from these two unrelated adults and found that the mutant alleles were transmitted to their children, highlighting the clinical importance of detecting pSNMs in genetic counseling. PMID:25312340

  16. Nucleotide sequences and genetic analysis of hydrogen oxidation (hox) genes in Azotobacter vinelandii.

    PubMed Central

    Menon, A L; Mortenson, L E; Robson, R L

    1992-01-01

    Azotobacter vinelandii contains a heterodimeric, membrane-bound [NiFe]hydrogenase capable of catalyzing the reversible oxidation of H2. The beta and alpha subunits of the enzyme are encoded by the structural genes hoxK and hoxG, respectively, which appear to form part of an operon that contains at least one further potential gene (open reading frame 3 [ORF3]). In this study, determination of the nucleotide sequence of a region of 2,344 bp downstream of ORF3 revealed four additional closely spaced or overlapping ORFs. These ORFs, ORF4 through ORF7, potentially encode polypeptides with predicted masses of 22.8, 11.4, 16.3, and 31 kDa, respectively. Mutagenesis of the chromosome of A. vinelandii in the area sequenced was carried out by introduction of antibiotic resistance gene cassettes. Disruption of hoxK and hoxG by a kanamycin resistance gene abolished whole-cell hydrogenase activity coupled to O2 and led to loss of the hydrogenase alpha subunit. Insertional mutagenesis of ORF3 through ORF7 with a promoterless lacZ-Kmr cassette established that the region is transcriptionally active and involved in H2 oxidation. We propose to call ORF3 through ORF7 hoxZ, hoxM, hoxL, hoxO, and hoxQ, respectively. The predicted hox gene products resemble those encoded by genes from hydrogenase-related operons in other bacteria, including Escherichia coli and Alcaligenes eutrophus. Images PMID:1624446

  17. RBT-GA: a novel metaheuristic for solving the multiple sequence alignment problem

    PubMed Central

    Taheri, Javid; Zomaya, Albert Y

    2009-01-01

    Background Multiple Sequence Alignment (MSA) has always been an active area of research in Bioinformatics. MSA is mainly focused on discovering biologically meaningful relationships among different sequences or proteins in order to investigate the underlying main characteristics/functions. This information is also used to generate phylogenetic trees. Results This paper presents a novel approach, namely RBT-GA, to solve the MSA problem using a hybrid solution methodology combining the Rubber Band Technique (RBT) and the Genetic Algorithm (GA) metaheuristic. RBT is inspired by the behavior of an elastic Rubber Band (RB) on a plate with several poles, which is analogues to locations in the input sequences that could potentially be biologically related. A GA attempts to mimic the evolutionary processes of life in order to locate optimal solutions in an often very complex landscape. RBT-GA is a population based optimization algorithm designed to find the optimal alignment for a set of input protein sequences. In this novel technique, each alignment answer is modeled as a chromosome consisting of several poles in the RBT framework. These poles resemble locations in the input sequences that are most likely to be correlated and/or biologically related. A GA-based optimization process improves these chromosomes gradually yielding a set of mostly optimal answers for the MSA problem. Conclusion RBT-GA is tested with one of the well-known benchmarks suites (BALiBASE 2.0) in this area. The obtained results show that the superiority of the proposed technique even in the case of formidable sequences. PMID:19594869

  18. Nucleotide sequence of the fadR gene, a multifunctional regulator of fatty acid metabolism in Escherichia coli.

    PubMed Central

    DiRusso, C C

    1988-01-01

    The Escherichia coli fadR gene is a multifunctional regulator of fatty acid and acetate metabolism. In the present work the nucleotide sequence of the 1.3 kb DNA fragment which encodes FadR has been determined. The coding sequence of the fadR gene is 714 nucleotides long and is preceded by a typical E. coli ribosome binding site and is followed by a sequence predicted to be sufficient for factor-independent chain termination. Primer extension experiments demonstrated that the transcription of the fadR gene initiates with an adenine nucleotide 33 nucleotides upstream from the predicted start of translation. The derived fadR peptide has a calculated molecular weight of 26,972. This is in reasonable agreement with the apparent molecular weight of 29,000 previously estimated on the basis of maxi-cell analysis of plasmid encoded proteins. There is a segment of twenty amino acids within the predicted peptide which resembles the DNA recognition and binding site of many transcriptional regulatory proteins. Images PMID:2843809

  19. T box transcription antitermination riboswitch: Influence of nucleotide sequence and orientation on tRNA binding by the antiterminator element

    PubMed Central

    Fauzi, Hamid; Agyeman, Akwasi; Hines, Jennifer V.

    2008-01-01

    Many bacteria utilize riboswitch transcription regulation to monitor and appropriately respond to cellular levels of important metabolites or effector molecules. The T box transcription antitermination riboswitch responds to cognate uncharged tRNA by specifically stabilizing an antiterminator element in the 5′-untranslated mRNA leader region and precluding formation of a thermodynamically more stable terminator element. Stabilization occurs when the tRNA acceptor end base pairs with the first four nucleotides in the seven nucleotide bulge of the highly conserved antiterminator element. The significance of the conservation of the antiterminator bulge nucleotides that do not base pair with the tRNA is unknown, but they are required for optimal function. In vitro selection was used to determine if the isolated antiterminator bulge context alone dictates the mode in which the tRNA acceptor end binds the bulge nucleotides. No sequence conservation beyond complementarity was observed and the location was not constrained to the first four bases of the bulge. The results indicate that formation of a structure that recognizes the tRNA acceptor end in isolation is not the determinant driving force for the high phylogenetic sequence conservation observed within the antiterminator bulge. Additional factors or T box leader features more likely influenced the phylogenetic sequence conservation. PMID:19152843

  20. JAR3D Webserver: Scoring and aligning RNA loop sequences to known 3D motifs.

    PubMed

    Roll, James; Zirbel, Craig L; Sweeney, Blake; Petrov, Anton I; Leontis, Neocles

    2016-07-01

    Many non-coding RNAs have been identified and may function by forming 2D and 3D structures. RNA hairpin and internal loops are often represented as unstructured on secondary structure diagrams, but RNA 3D structures show that most such loops are structured by non-Watson-Crick basepairs and base stacking. Moreover, different RNA sequences can form the same RNA 3D motif. JAR3D finds possible 3D geometries for hairpin and internal loops by matching loop sequences to motif groups from the RNA 3D Motif Atlas, by exact sequence match when possible, and by probabilistic scoring and edit distance for novel sequences. The scoring gauges the ability of the sequences to form the same pattern of interactions observed in 3D structures of the motif. The JAR3D webserver at http://rna.bgsu.edu/jar3d/ takes one or many sequences of a single loop as input, or else one or many sequences of longer RNAs with multiple loops. Each sequence is scored against all current motif groups. The output shows the ten best-matching motif groups. Users can align input sequences to each of the motif groups found by JAR3D. JAR3D will be updated with every release of the RNA 3D Motif Atlas, and so its performance is expected to improve over time. PMID:27235417

  1. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

    PubMed

    Bonham-Carter, Oliver; Steele, Joe; Bastola, Dhundy

    2014-11-01

    Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression. PMID:23904502

  2. Finite-temperature local protein sequence alignment: percolation and free-energy distribution.

    PubMed

    Wolfsheimer, S; Melchert, O; Hartmann, A K

    2009-12-01

    Sequence alignment is a tool in bioinformatics that is used to find homological relationships in large molecular databases. It can be mapped on the physical model of directed polymers in random media. We consider the finite-temperature version of local sequence alignment for proteins and study the transition between the linear phase and the biologically relevant logarithmic phase, where the free energy grows linearly or logarithmically with the sequence length. By means of numerical simulations and finite-size-scaling analysis, we determine the phase diagram in the plane that is spanned by the gap costs and the temperature. We use the most frequently used parameter set for protein alignment. The critical exponents that describe the parameter-driven transition are found to be explicitly temperature dependent. Furthermore, we study the shape of the (free-) energy distribution close to the transition by rare-event simulations down to probabilities on the order 10(-64). It is well known that in the logarithmic region, the optimal score distribution (T=0) is described by a modified Gumbel distribution. We confirm that this also applies for the free-energy distribution (T>0). However, in the linear phase, the distribution crosses over to a modified Gaussian distribution. PMID:20365196

  3. Nucleotide Sequence Evolution at the κ-Casein Locus: Evidence for Positive Selection within the Family Bovidae

    PubMed Central

    Ward, T. J.; Honeycutt, R. L.; Derr, J. N.

    1997-01-01

    κ-Casein is a mammalian milk protein involved in a number of important physiological processes. In the gut, the ingested protein is split into an insoluble peptide (para κ-casein) and a soluble hydrophilic glycopeptide (caseinomacropeptide). Caseinomacropeptide is responsible for increased efficiency of digestion, prevention of neonate hypersensitivity to ingested proteins, and inhibition of gastric pathogens. Variation within this peptide has significant effects associated with important traits such as milk production. The nucleotide sequences for regions of κ-casein exon and intron four were determined for representatives of the artiodactyl family Bovidae. The pattern of nucleotide substitution in κ-casein sequences for distantly related bovid taxa demonstrates that positive selection has accelerated their divergence at the amino acid sequence level. This selection has differentially influenced the molecular evolution of the two κ-casein split peptides and is focused within a 34-codon region of caseinomacropeptide. PMID:9409842

  4. Nucleotide sequence analysis of Aleutian mink disease parvovirus shows that multiple virus types are present in infected mink.

    PubMed Central

    Gottschalck, E; Alexandersen, S; Cohn, A; Poulsen, L A; Bloom, M E; Aasted, B

    1991-01-01

    Different isolates of Aleutian mink disease parvovirus (ADV) were cloned and nucleotide sequenced. Analysis of individual clones from two in vivo-derived isolates of high virulence indicated that more than one type of ADV DNA were present in each of these isolates. Analysis of several clones from two preparations of a cell culture-adapted isolate of low virulence showed the presence of only one type of ADV DNA. We also describe the nucleotide sequence from map units 44 to 88 of a new type of ADV DNA. The new type of ADV DNA is compared with the previously published ADV sequences, to which it shows 95% homology. These findings indicate that ADV, a single-stranded DNA virus, has a considerable degree of variability and that several virus types can be present simultaneously in an infected animal. PMID:1649336

  5. Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé

    PubMed Central

    Liu, Kevin; Warnow, Tandy

    2016-01-01

    SATé is a method for estimating multiple sequence alignments and trees that has been shown to produce highly accurate results for datasets with large numbers of sequences. Running SATé using its default settings is very simple, but improved accuracy can be obtained by modifying its algorithmic parameters. We provide a detailed introduction to the algorithmic approach used by SATé, and instructions for running a SATé analysis using the GUI under default settings. We also provide a discussion of how to modify these settings to obtain improved results, and how to use SATé in a phylogenetic analysis pipeline. PMID:24170406

  6. Identification of Anoectochilus based on rDNA ITS sequences alignment and SELDI-TOF-MS.

    PubMed

    Gao, Chuan; Zhang, Fusheng; Zhang, Jun; Guo, Shunxing; Shao, Hongbo

    2009-01-01

    The internal transcribed spacer (ITS) sequences alignment and proteomic difference of Anoectochilus interspecies have been studied by means of ITS molecular identification and surface enhanced laser desorption ionization time of flight mass spectrography. Results showed that variety certification on Anoectochilus by ITS sequences can not determine species, and there is proteomic difference among Anoectochilus interspecies. Moreover, proteomic finger printings of five Anoectochilus species have been established for identifying species, and genetic relationships of five species within Anoectochilus have been deduced according to proteomic differences among five species. PMID:20016748

  7. Identification of Anoectochilus based on rDNA ITS sequences alignment and SELDI-TOF-MS

    PubMed Central

    Gao, Chuan; Zhang, Fusheng; Zhang, Jun; Guo, Shunxing; Shao, Hongbo

    2009-01-01

    The internal transcribed spacer (ITS) sequences alignment and proteomic difference of Anoectochilus interspecies have been studied by means of ITS molecular identification and surface enhanced laser desorption ionization time of flight mass spectrography. Results showed that variety certification on Anoectochilus by ITS sequences can not determine species, and there is proteomic difference among Anoectochilus interspecies. Moreover, proteomic finger printings of five Anoectochilus species have been established for identifying species, and genetic relationships of five species within Anoectochilus have been deduced according to proteomic differences among five species. PMID:20016748

  8. Genomic signal processing methods for computation of alignment-free distances from DNA sequences.

    PubMed

    Borrayo, Ernesto; Mendizabal-Ruiz, E Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P; Morales, J Alejandro

    2014-01-01

    Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments. PMID:25393409

  9. Genomic Signal Processing Methods for Computation of Alignment-Free Distances from DNA Sequences

    PubMed Central

    Borrayo, Ernesto; Mendizabal-Ruiz, E. Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P.; Morales, J. Alejandro

    2014-01-01

    Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments. PMID:25393409

  10. Nucleotide sequence of a chickpea chlorotic stunt virus relative that infects pea and faba bean in China.

    PubMed

    Zhou, Cui-Ji; Xiang, Hai-Ying; Zhuo, Tao; Li, Da-Wei; Yu, Jia-Lin; Han, Cheng-Gui

    2012-07-01

    We determined the genome sequence of a new polerovirus that infects field pea and faba bean in China. Its entire nucleotide sequence (6021 nt) was most closely related (83.3% identity) to that of an Ethiopian isolate of chickpea chlorotic stunt virus (CpCSV-Eth). With the exception of the coat protein (encoded by ORF3), amino acid sequence identities of all gene products of this virus to those of CpCSV-Eth and other poleroviruses were <90%. This suggests that it is a new member of the genus Polerovirus, and the name pea mild chlorosis virus is proposed. PMID:22476900

  11. Nucleotide sequence of a cluster of early and late genes in a conserved segment of the vaccinia virus genome.

    PubMed Central

    Plucienniczak, A; Schroeder, E; Zettlmeissl, G; Streeck, R E

    1985-01-01

    The nucleotide sequence of a 7.6 kb vaccinia DNA segment from a genomic region conserved among different orthopox virus has been determined. This segment contains a tight cluster of 12 partly overlapping open reading frames most of which can be correlated with previously identified early and late proteins and mRNAs. Regulatory signals used by vaccinia virus have been studied. Presumptive promoter regions are rich in A, T and carry the consensus sequences TATA and AATAA spaced at 20-24 base pairs. Tandem repeats of a CTATTC consensus sequence are proposed to be involved in the termination of early transcription. PMID:2987815

  12. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51. Copies of WIPO... 37 Patents, Trademarks, and Copyrights 1 2010-07-01 2010-07-01 false Nucleotide and/or amino acid... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid...

  13. 37 CFR 1.821 - Nucleotide and/or amino acid sequence disclosures in patent applications.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... Director of the Federal Register in accordance with 5 U.S.C. 552(a) and 1 CFR part 51. Copies of WIPO... 37 Patents, Trademarks, and Copyrights 1 2011-07-01 2011-07-01 false Nucleotide and/or amino acid... Biotechnology Invention Disclosures Application Disclosures Containing Nucleotide And/or Amino Acid...

  14. Nucleotide sequence of a complementary DNA encoding pea cytosolic copper/zinc superoxide dismutase. [Pisum sativum L

    SciTech Connect

    White, D.A.; Zilinskas, B.A. )

    1991-08-01

    The authors now report the nucleotide sequence of the cytosolic Cu/Zn SOD cloned from a {lambda}gt11 cDNA library constructed from mRNA extracted from leaves of 7- to 10-d pea seedlings (Pisum sativum L.). The clone was isolated using a 22-base synthetic oligonucleotide complementary to the amino acid sequence CGIIGLQG. This sequence, found at the protein's carboxy terminus, is highly conserved among plant cytosolic Cu/Zn SODs but not chloroplastic Cu/Zn SODs. The 738-base pair sequence contains an open reading frame specifying 152 codons and a predicted M{sub r} of 18,024 D. The deduced amino acid sequence is highly homologous (79-82% identity) with the sequences of other known plant cytosolic Cu/Zn SODs but less highly conserved (63-65%) when compared with several chloroplastic Cu/Zn SODs including pea (10).

  15. Multiple Amino Acid Sequence Alignment Nitrogenase Component 1: Insights into Phylogenetics and Structure-Function Relationships

    PubMed Central

    Howard, James B.; Kechris, Katerina J.; Rees, Douglas C.; Glazer, Alexander N.

    2013-01-01

    Amino acid residues critical for a protein's structure-function are retained by natural selection and these residues are identified by the level of variance in co-aligned homologous protein sequences. The relevant residues in the nitrogen fixation Component 1 α- and β-subunits were identified by the alignment of 95 protein sequences. Proteins were included from species encompassing multiple microbial phyla and diverse ecological niches as well as the nitrogen fixation genotypes, anf, nif, and vnf, which encode proteins associated with cofactors differing at one metal site. After adjusting for differences in sequence length, insertions, and deletions, the remaining >85% of the sequence co-aligned the subunits from the three genotypes. Six Groups, designated Anf, Vnf , and Nif I-IV, were assigned based upon genetic origin, sequence adjustments, and conserved residues. Both subunits subdivided into the same groups. Invariant and single variant residues were identified and were defined as “core” for nitrogenase function. Three species in Group Nif-III, Candidatus Desulforudis audaxviator, Desulfotomaculum kuznetsovii, and Thermodesulfatator indicus, were found to have a seleno-cysteine that replaces one cysteinyl ligand of the 8Fe:7S, P-cluster. Subsets of invariant residues, limited to individual groups, were identified; these unique residues help identify the gene of origin (anf, nif, or vnf) yet should not be considered diagnostic of the metal content of associated cofactors. Fourteen of the 19 residues that compose the cofactor pocket are invariant or single variant; the other five residues are highly variable but do not correlate with the putative metal content of the cofactor. The variable residues are clustered on one side of the cofactor, away from other functional centers in the three dimensional structure. Many of the invariant and single variant residues were not previously recognized as potentially critical and their identification provides the bases

  16. Sequence-Specific Incorporation of Enzyme-Nucleotide Chimera by DNA Polymerases.

    PubMed

    Welter, Moritz; Verga, Daniela; Marx, Andreas

    2016-08-16

    DNA polymerases select the right nucleotide for the growing polynucleotide chain based on the shape and geometry of the nascent nucleotide pairs and thereby ensure high DNA replication selectivity. High-fidelity DNA polymerases are believed to possess tight active sites that allow little deviation from the canonical structures. However, DNA polymerases are known to use nucleotides with small modifications as substrates, which is key for numerous core biotechnology applications. We show that even high-fidelity DNA polymerases are capable of efficiently using nucleotide chimera modified with a large protein like horseradish peroxidase as substrates for template-dependent DNA synthesis, despite this "cargo" being more than 100-fold larger than the natural substrates. We exploited this capability for the development of systems that enable naked-eye detection of DNA and RNA at single nucleotide resolution. PMID:27392211

  17. Alignment-free analysis of barcode sequences by means of compression-based methods

    PubMed Central

    2013-01-01

    Background The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. Results We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. Conclusions In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our

  18. Nucleotide Sequence and Genetic Structure of a Novel Carbaryl Hydrolase Gene (cehA) from Rhizobium sp. Strain AC100

    PubMed Central

    Hashimoto, Masayuki; Fukui, Mitsuru; Hayano, Kouichi; Hayatsu, Masahito

    2002-01-01

    Rhizobium sp. strain AC100, which is capable of degrading carbaryl (1-naphthyl-N-methylcarbamate), was isolated from soil treated with carbaryl. This bacterium hydrolyzed carbaryl to 1-naphthol and methylamine. Carbaryl hydrolase from the strain was purified to homogeneity, and its N-terminal sequence, molecular mass (82 kDa), and enzymatic properties were determined. The purified enzyme hydrolyzed 1-naphthyl acetate and 4-nitrophenyl acetate indicating that the enzyme is an esterase. We then cloned the carbaryl hydrolase gene (cehA) from the plasmid DNA of the strain and determined the nucleotide sequence of the 10-kb region containing cehA. No homologous sequences were found by a database homology search using the nucleotide and deduced amino acid sequences of the cehA gene. Six open reading frames including the cehA gene were found in the 10-kb region, and sequencing analysis shows that the cehA gene is flanked by two copies of insertion sequence-like sequence, suggesting that it makes part of a composite transposon. PMID:11872471

  19. DNA sequencing by synthesis using 3′-O-azidomethyl nucleotide reversible terminators and surface-enhanced Raman spectroscopic detection

    PubMed Central

    Palla, Mirkó; Guo, Wenjing; Shi, Shundi; Li, Zengmin; Wu, Jian; Jockusch, Steffen; Guo, Cheng; Russo, James J.; Turro, Nicholas J.; Ju, Jingyue

    2014-01-01

    As an alternative to fluorescence-based DNA sequencing by synthesis (SBS), we report here an approach using an azido moiety (N3) that has an intense, narrow and unique Raman shift at 2125 cm−1, where virtually all biological molecules are transparent, as a label for SBS. We first demonstrated that the four 3′-O-azidomethyl nucleotide reversible terminators (3′-O-azidomethyl-dNTPs) displayed surface enhanced Raman scattering (SERS) at 2125 cm−1. Using these 4 nucleotide analogues as substrates, we then performed a complete 4-step SBS reaction. We used SERS to monitor the appearance of the azide-specific Raman peak at 2125 cm−1 as a result of polymerase extension by a single 3′-O-azidomethyl-dNTP into the growing DNA strand and disappearance of this Raman peak with cleavage of the azido label to permit the next nucleotide incorporation, thereby continuously determining the DNA sequence. Due to the small size of the azido label, the 3′-O-azidomethyl-dNTPs are efficient substrates for the DNA polymerase. In the SBS cycles, the natural nucleotides are restored after each incorporation and cleavage, producing a growing DNA strand that bears no modifications and will not impede further polymerase reactions. Thus, with further improvements in SERS for the azido moiety, this approach has the potential to provide an attractive alternative to fluorescence-based SBS. PMID:25396047

  20. Estimates of Gene Flow in Drosophila Pseudoobscura Determined from Nucleotide Sequence Analysis of the Alcohol Dehydrogenase Region

    PubMed Central

    Schaeffer, S. W.; Miller, E. L.

    1992-01-01

    The genetic structure of Drosophila pseudoobscura populations was inferred from a nucleotide sequence analysis of a 3.4-kb segment of the alcohol dehydrogenase (Adh) region. A total of 99 isochromosomal strains collected from 13 populations in North and South America were used to determine if any population departed from a neutral model and to estimate levels of gene flow between populations. This study also included the nucleotide sequences from two sibling species, D. persimilis and D. miranda. We estimated the neutral mutation parameter, 4Nμ, in synonymous and noncoding sites for 17 subregions of Adh in each of nine populations with sample sizes greater than three. The nucleotide diversity data in the nine populations was tested for departures from an equilibrium neutral model with two statistical tests. The Tajima and the Hudson, Kreitman, Aguade tests showed that each population fails to reject a neutral model. Tests for genetic differentiation between populations fail to show any population substructure among the North American populations of D. pseudoobscura. The nucleotide diversity data is consistent with direct and indirect measures of gene flow that show extensive dispersal between populations of D. pseudoobscura. PMID:1427038

  1. Characterization and Nucleotide Sequence of the Cryptic Cel Operon of Escherichia Coli K12

    PubMed Central

    Parker, L. L.; Hall, B. G.

    1990-01-01

    Wild-type Escherichia coli are not able to utilize β-glucoside sugars because the genes for utilization of these sugars are cryptic. Spontaneous mutations in the cel operon allow its expression and enable the organism to ferment cellobiose, arbutin and salicin. In this report we describe the structure and nucleotide sequence of the cel operon. The cel operon consists of five genes: celA, whose function is unknown; celB and celC which encode phosphoenolpyruvate-dependent phosphotransferase system enzyme II(cel) and enzyme III(cel), respectively, for the transport and phosphorylation of β-glucoside sugars; celD, which encodes a negative regulatory protein; and celF, which encodes a phospho-β-glucosidase that acts on phosphorylated cellobiose, arbutin and salicin. The mutationally activated cel operon is induced in the presence of its substrates, and is repressed in their absence. A comparison of proteins encoded by the cel operon with functionally equivalent proteins of the bgl operon, another cryptic E. coli gene system responsible for the catabolism of β-glucoside sugars, revealed no significant homology between these two systems despite common functional characteristics. The celD and celF encoded repressor and phospho-β-glucosidase proteins are homologous to the melibiose regulatory protein and to the melA encoded α-galactosidase of E. coli, respectively. Furthermore, the celC encoded PEP-dependent phosphotransferase system enzyme III(cel) is strikingly homologous to an enzyme III(lac) of the Gram-positive organism Staphylococcus aureus. We conclude that the genes for these two enzyme IIIs diverged much more recently than did their hosts, indicating that E. coli and S. aureus have undergone relatively recent exchange of chromosomal genes. PMID:2179047

  2. Nucleotide sequence and mutational analysis of the vnfENX region of Azotobacter vinelandii.

    PubMed Central

    Wolfinger, E D; Bishop, P E

    1991-01-01

    The nucleotide sequence (3,600 bp) of a second copy of nifENX-like genes in Azotobacter vinelandii has been determined. These genes are located immediately downstream from vnfA and have been designated vnfENX. The vnfENX genes appear to be organized as a single transcriptional unit that is preceded by a potential RpoN-dependent promoter. While the nifEN genes are thought to be evolutionarily related to nifDK, the vnfEN genes appear to be more closely related to nifEN than to either nifDK, vnfDK, or anfDK. Mutant strains (CA47 and CA48) carrying insertions in vnfE and vnfN, respectively, are able to grow diazotrophically in molybdenum (Mo)-deficient medium containing vanadium (V) (Vnf+) and in medium lacking both Mo and V (Anf+). However, a double mutant (strain DJ42.48) which contains a nifEN deletion and an insertion in vnfE is unable to grow diazotrophically in Mo-sufficient medium or in Mo-deficient medium with or without V. This suggests that NifE and NifN substitute for VnfE and VnfN when the vnfEN genes are mutationally inactivated. AnfA is not required for the expression of a vnfN-lacZ transcriptional fusion, even though this fusion is expressed under Mo- and V-deficient diazotrophic growth conditions. PMID:1938952

  3. Studies on structure-based sequence alignment and phylogenies of beta-lactamases.

    PubMed

    Salahuddin, Parveen; Khan, Asad U

    2014-01-01

    The β-lactamases enzymes cleave the amide bond in β-lactam ring, rendering β-lactam antibiotics harmless to bacteria. In this communication we have studied structure-function relationship and phylogenies of class A, B and D beta-lactamases using structure-based sequence alignment and phylip programs respectively. The data of structure-based sequence alignment suggests that in different isolates of TEM-1, mutations did not occur at or near sequence motifs. Since deletions are reported to be lethal to structure and function of enzyme. Therefore, in these variants antibiotic hydrolysis profile and specificity will be affected. The alignment data of class A enzyme SHV-1, CTX-M-15, class D enzyme, OXA-10, and class B enzyme VIM-2 and SIM-1 show sequence motifs along with other part of polypeptide are essentially conserved. These results imply that conformations of betalactamases are close to native state and possess normal hydrolytic activities towards beta-lactam antibiotics. However, class B enzyme such as IMP-1 and NDM-1 are less conserved than other class A and D studied here because mutation and deletions occurred at critically important region such as active site. Therefore, the structure of these beta-lactamases will be altered and antibiotic hydrolysis profile will be affected. Phylogenetic studies suggest that class A and D beta-lactamases including TOHO-1 and OXA-10 respectively evolved by horizontal gene transfer (HGT) whereas other member of class A such as TEM-1 evolved by gene duplication mechanism. Taken together, these studies justify structure-function relationship of beta-lactamases and phylogenetic studies suggest these enzymes evolved by different mechanisms. PMID:24966539

  4. Studies on structure-based sequence alignment and phylogenies of beta-lactamases

    PubMed Central

    Salahuddin, Parveen; Khan, Asad U

    2014-01-01

    The β-lactamases enzymes cleave the amide bond in β-lactam ring, rendering β-lactam antibiotics harmless to bacteria. In this communication we have studied structure-function relationship and phylogenies of class A, B and D beta-lactamases using structure-based sequence alignment and phylip programs respectively. The data of structure-based sequence alignment suggests that in different isolates of TEM-1, mutations did not occur at or near sequence motifs. Since deletions are reported to be lethal to structure and function of enzyme. Therefore, in these variants antibiotic hydrolysis profile and specificity will be affected. The alignment data of class A enzyme SHV-1, CTX-M-15, class D enzyme, OXA-10, and class B enzyme VIM-2 and SIM-1 show sequence motifs along with other part of polypeptide are essentially conserved. These results imply that conformations of betalactamases are close to native state and possess normal hydrolytic activities towards beta-lactam antibiotics. However, class B enzyme such as IMP-1 and NDM-1 are less conserved than other class A and D studied here because mutation and deletions occurred at critically important region such as active site. Therefore, the structure of these beta-lactamases will be altered and antibiotic hydrolysis profile will be affected. Phylogenetic studies suggest that class A and D beta-lactamases including TOHO-1 and OXA-10 respectively evolved by horizontal gene transfer (HGT) whereas other member of class A such as TEM-1 evolved by gene duplication mechanism. Taken together, these studies justify structure-function relationship of beta-lactamases and phylogenetic studies suggest these enzymes evolved by different mechanisms. PMID:24966539

  5. Identification and Evaluation of Single-Nucleotide Polymorphisms in Allotetraploid Peanut (Arachis hypogaea L.) Based on Amplicon Sequencing Combined with High Resolution Melting (HRM) Analysis

    PubMed Central

    Hong, Yanbin; Pandey, Manish K.; Liu, Ying; Chen, Xiaoping; Liu, Hong; Varshney, Rajeev K.; Liang, Xuanqiang; Huang, Shangzhi

    2015-01-01

    The cultivated peanut (Arachis hypogaea L.) is an allotetraploid (AABB) species derived from the A-genome (Arachis duranensis) and B-genome (Arachis ipaensis) progenitors. Presence of two versions of a DNA sequence based on the two progenitor genomes poses a serious technical and analytical problem during single nucleotide polymorphism (SNP) marker identification and analysis. In this context, we have analyzed 200 amplicons derived from expressed sequence tags (ESTs) and genome survey sequences (GSS) to identify SNPs in a panel of genotypes consisting of 12 cultivated peanut varieties and two diploid progenitors representing the ancestral genomes. A total of 18 EST-SNPs and 44 genomic-SNPs were identified in 12 peanut varieties by aligning the sequence of A. hypogaea with diploid progenitors. The average frequency of sequence polymorphism was higher for genomic-SNPs than the EST-SNPs with one genomic-SNP every 1011 bp as compared to one EST-SNP every 2557 bp. In order to estimate the potential and further applicability of these identified SNPs, 96 peanut varieties were genotyped using high resolution melting (HRM) method. Polymorphism information content (PIC) values for EST-SNPs ranged between 0.021 and 0.413 with a mean of 0.172 in the set of peanut varieties, while genomic-SNPs ranged between 0.080 and 0.478 with a mean of 0.249. Total 33 SNPs were used for polymorphism detection among the parents and 10 selected lines from mapping population Y13Zh (Zhenzhuhei × Yueyou13). Of the total 33 SNPs, nine SNPs showed polymorphism in the mapping population Y13Zh, and seven SNPs were successfully mapped into five linkage groups. Our results showed that SNPs can be identified in allotetraploid peanut with high accuracy through amplicon sequencing and HRM assay. The identified SNPs were very informative and can be used for different genetic and breeding applications in peanut. PMID:26697032

  6. Nucleotide sequence of the Salmonella typhimurium mutS gene required for mismatch repair: homology of MutS and HexA of Streptococcus pneumoniae.

    PubMed Central

    Haber, L T; Pang, P P; Sobell, D I; Mankovich, J A; Walker, G C

    1988-01-01

    The mutS gene product of Escherichia coli and Salmonella typhimurium is one of at least four proteins required for methyl-directed mismatch repair in these organisms. A functionally similar repair system in Streptococcus pneumoniae requires the hex genes. We have sequenced the S. typhimurium mutS gene, showing that it encodes a 96-kilodalton protein. Amino-terminal amino acid sequencing of purified S. typhimurium MutS protein confirmed the initial portion of the deduced amino acid sequence. The S. typhimurium MutS protein is homologous to the S. pneumoniae HexA protein, suggesting that they arose from a common ancestor before the gram-negative and gram-positive bacteria diverged. Overall, approximately 36% of the amino acids of the two proteins are identical when the sequences are optimally aligned, including regions of stronger homology which are of particular interest. One such region is close to the amino terminus. Another, located closer to the carboxy terminus, includes homology to a consensus sequence thought to be diagnostic of nucleotide-binding sites. A third one, adjacent to the second, is homologous to the consensus sequence for the helix-turn-helix motif found in many DNA-binding proteins. We found that the S. typhimurium MutS protein can substitute for the E. coli MutS protein in vitro as it can in vivo, but we have not yet been able to demonstrate a similar in vitro complementation by the S. pneumoniae HexA protein. PMID:3275609

  7. Identification and Evaluation of Single-Nucleotide Polymorphisms in Allotetraploid Peanut (Arachis hypogaea L.) Based on Amplicon Sequencing Combined with High Resolution Melting (HRM) Analysis.

    PubMed

    Hong, Yanbin; Pandey, Manish K; Liu, Ying; Chen, Xiaoping; Liu, Hong; Varshney, Rajeev K; Liang, Xuanqiang; Huang, Shangzhi

    2015-01-01

    The cultivated peanut (Arachis hypogaea L.) is an allotetraploid (AABB) species derived from the A-genome (Arachis duranensis) and B-genome (Arachis ipaensis) progenitors. Presence of two versions of a DNA sequence based on the two progenitor genomes poses a serious technical and analytical problem during single nucleotide polymorphism (SNP) marker identification and analysis. In this context, we have analyzed 200 amplicons derived from expressed sequence tags (ESTs) and genome survey sequences (GSS) to identify SNPs in a panel of genotypes consisting of 12 cultivated peanut varieties and two diploid progenitors representing the ancestral genomes. A total of 18 EST-SNPs and 44 genomic-SNPs were identified in 12 peanut varieties by aligning the sequence of A. hypogaea with diploid progenitors. The average frequency of sequence polymorphism was higher for genomic-SNPs than the EST-SNPs with one genomic-SNP every 1011 bp as compared to one EST-SNP every 2557 bp. In order to estimate the potential and further applicability of these identified SNPs, 96 peanut varieties were genotyped using high resolution melting (HRM) method. Polymorphism information content (PIC) values for EST-SNPs ranged between 0.021 and 0.413 with a mean of 0.172 in the set of peanut varieties, while genomic-SNPs ranged between 0.080 and 0.478 with a mean of 0.249. Total 33 SNPs were used for polymorphism detection among the parents and 10 selected lines from mapping population Y13Zh (Zhenzhuhei × Yueyou13). Of the total 33 SNPs, nine SNPs showed polymorphism in the mapping population Y13Zh, and seven SNPs were successfully mapped into five linkage groups. Our results showed that SNPs can be identified in allotetraploid peanut with high accuracy through amplicon sequencing and HRM assay. The identified SNPs were very informative and can be used for different genetic and breeding applications in peanut. PMID:26697032

  8. Feature-based multiexposure image-sequence fusion with guided filter and image alignment

    NASA Astrophysics Data System (ADS)

    Xu, Liang; Du, Junping; Zhang, Zhenhong

    2015-01-01

    Multiexposure fusion images have a higher dynamic range and reveal more details than a single captured image of a real-world scene. A clear and intuitive feature-based fusion technique for multiexposure image sequences is conceptually proposed. The main idea of the proposed method is to combine three image features [phase congruency (PC), local contrast, and color saturation] to obtain weight maps of the images. Then, the weight maps are further refined using a guided filter which can improve their accuracy. The final fusion result is constructed using the weighted sum of the source image sequence. In addition, for multiexposure image-sequence fusion involving dynamic scenes containing moving objects, ghost artifacts can easily occur if fusion is directly performed. Therefore, an image-alignment method is first used to adjust the input images to correspond to a reference image, after which fusion is performed. Experimental results demonstrate that the proposed method has a superior performance compared to the existing methods.

  9. CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi-GPUs.

    PubMed

    Hung, Che-Lun; Lin, Yu-Shiang; Lin, Chun-Yuan; Chung, Yeh-Ching; Chung, Yi-Fang

    2015-10-01

    For biological applications, sequence alignment is an important strategy to analyze DNA and protein sequences. Multiple sequence alignment is an essential methodology to study biological data, such as homology modeling, phylogenetic reconstruction and etc. However, multiple sequence alignment is a NP-hard problem. In the past decades, progressive approach has been proposed to successfully align multiple sequences by adopting iterative pairwise alignments. Due to rapid growth of the next generation sequencing technologies, a large number of sequences can be produced in a short period of time. When the problem instance is large, progressive alignment will be time consuming. Parallel computing is a suitable solution for such applications, and GPU is one of the important architectures for contemporary parallel computing researches. Therefore, we proposed a GPU version of ClustalW v2.0.11, called CUDA ClustalW v1.0, in this work. From the experiment results, it can be seen that the CUDA ClustalW v1.0 can achieve more than 33× speedups for overall execution time by comparing to ClustalW v2.0.11. PMID:26052076

  10. A comparison of nucleotide sequences of measles virus L genes derived from wild-type viruses and SSPE brain tissues.

    PubMed

    Komase, K; Rima, B K; Pardowitz, I; Kunz, C; Billeter, M A; ter Meulen, V; Baczko, K

    1995-04-20

    The nucleotide sequences of the large protein (L) gene derived from two wild-type measles viruses (MV) and two SSPE brain-derived viruses have been determined. All sequences have single large open reading frames encoding 2183 amino acid residues. The deduced L proteins are well conserved and the proposed functional domains which have been identified for rhabdo- and paramyxoviruses are completely conserved in all strains. The degree of variability of L proteins is the lowest of all structural proteins of MV, reflecting its role in virus reproduction and persistence. Biased hypermutation was not observed in the L genes derived from SSPE brain tissue. None of the nucleotide changes can be associated with the attenuated phenotype of the Edmonston vaccine viruses. PMID:7747453

  11. Phylogeny of Bipolaris inferred from nucleotide sequences of Brn1, a reductase gene involved in melanin biosynthesis.

    PubMed

    Shimizu, Kiminori; Tanaka, Chihiro; Peng, You-Liang; Tsuda, Mitsuya

    1998-08-01

    The Brn1 reductase melanin biosynthesis gene in the fungal genus Bipolaris was sequenced in 74 strains of 22 species. The Brn1 region was highly conserved among the species examined at the nucleotide and the amino acid levels. To elucidate the phylogenetic relationships among Bipolaris species, trees were inferred from nucleotide sequences of this region. Species in these trees formed exclusive clusters clearly separated from one another, except for B. panici-miliacei and B. setariae, and B. victoriae and B. zeicola. When unidentified strains were added to this tree, they fell within known species or formed independent clusters. These data indicated that the Brn1 gene region was suitable for species-level systematics within the genus. The results also suggest that Bipolaris consists of two or more clades that may reflect teleomorphic connections. PMID:12501419

  12. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine

    PubMed Central

    Ye, Hao; Meehan, Joe; Tong, Weida; Hong, Huixiao

    2015-01-01

    Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants. PMID:26610555

  13. A tabu search algorithm for post-processing multiple sequence alignment.

    PubMed

    Riaz, Tariq; Yi, Wang; Li, Kuo-Bin

    2005-02-01

    Tabu search is a meta-heuristic approach that is proven to be useful in solving combinatorial optimization problems. We implement the adaptive memory features of tabu search to refine a multiple sequence alignment. Adaptive memory helps the search process to avoid local optima and explores the solution space economically and effectively without getting trapped into cycles. The algorithm is further enhanced by introducing extended tabu search features such as intensification and diversification. The neighborhoods of a solution are generated stochastically and a consistency-based objective function is employed to measure its quality. The algorithm is tested with the datasets from BAliBASE benchmarking database. We have observed through experiments that tabu search is able to improve the quality of multiple alignments generated by other software such as ClustalW and T-Coffee. The source code of our algorithm is available at http://www.bii.a-star.edu.sg/~tariq/tabu/. PMID:15751117

  14. Characterization of Newcastle disease virus isolates by reverse transcription PCR coupled to direct nucleotide sequencing and development of sequence database for pathotype prediction and molecular epidemiological analysis.

    PubMed Central

    Seal, B S; King, D J; Bennett, J D

    1995-01-01

    Degenerate oligonucleotide primers were synthesized to amplify nucleotide sequences from portions of the fusion protein and matrix protein genes of Newcastle disease virus (NDV) genomic RNA that could be used diagnostically. These primers were used in a single-tube reverse transcription PCR of NDV genomic RNA coupled to direct nucleotide sequencing of the amplified product to characterize more than 30 NDV isolates. In agreement with previous reports, differences in the fusion protein cleavage sequence that correlated genotypically with virulence among various NDV pathotypes were detected. By using sequences generated from the matrix protein gene coding for the nuclear localization signal, lentogenic viruses were again grouped phylogenetically separate from other pathotypes. These techniques were applied to compare neurotropic velogenic viruses isolated from an outbreak of Newcastle disease in cormorants and turkeys. Cormorant NDV isolates and an NDV isolate from an infected turkey flock in North Dakota had the fusion protein cleavage sequence 109SRGRRQKRFVG119. The R-for-G substitution at position 110 may be unique for the cormorant-type isolates. Although the amino acid sequences from the fusion protein cleavage site were identical, nucleotide sequence data correlate the outbreak in turkeys to a cormorant virus isolate from Minnesota and not to a cormorant virus isolate from Michigan. On the basis of sequence information, the cormorant isolates are virulent viruses related to isolates of psittacine origin, possibly genotypically distinct from other velogenic NDV isolates. These techniques can be used reliably for Newcastle disease epidemiology and for prediction of pathotypes of NDV isolates without traditional live-bird inoculations. PMID:8567895

  15. T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension

    PubMed Central

    Di Tommaso, Paolo; Moretti, Sebastien; Xenarios, Ioannis; Orobitg, Miquel; Montanyola, Alberto; Chang, Jia-Ming; Taly, Jean-François; Notredame, Cedric

    2011-01-01

    This article introduces a new interface for T-Coffee, a consistency-based multiple sequence alignment program. This interface provides an easy and intuitive access to the most popular functionality of the package. These include the default T-Coffee mode for protein and nucleic acid sequences, the M-Coffee mode that allows combining the output of any other aligners, and template-based modes of T-Coffee that deliver high accuracy alignments while using structural or homology derived templates. These three available template modes are Expresso for the alignment of protein with a known 3D-Structure, R-Coffee to align RNA sequences with conserved secondary structures and PSI-Coffee to accurately align distantly related sequences using homology extension. The new server benefits from recent improvements of the T-Coffee algorithm and can align up to 150 sequences as long as 10 000 residues and is available from both http://www.tcoffee.org and its main mirror http://tcoffee.crg.cat. PMID:21558174

  16. Complete nucleotide sequence and gene rearrangement of the mitochondrial genome of the bell-ring frog, Buergeria buergeri (family Rhacophoridae).

    PubMed

    Sano, Naomi; Kurabayashi, Atsushi; Fujii, Tamotsu; Yonekawa, Hiromichi; Sumida, Masayuki

    2004-06-01

    In this study we determined the complete nucleotide sequence (19,959 bp) of the mitochondrial DNA of the rhacophorid frog Buergeria buergeri. The gene content, nucleotide composition, and codon usage of B. buergeri conformed to those of typical vertebrate patterns. However, due to an accumulation of lengthy repetitive sequences in the D-loop region, this species possesses the largest mitochondrial genome among all the vertebrates examined so far. Comparison of the gene organizations among amphibian species (Rana, Xenopus, salamanders and caecilians) revealed that the positioning of four tRNA genes and the ND5 gene in the mtDNA of B. buergeri diverged from the common vertebrate gene arrangement shared by Xenopus, salamanders and caecilians. The unique positions of the tRNA genes in B. buergeri are shared by ranid frogs, indicating that the rearrangements of the tRNA genes occurred in a common ancestral lineage of ranids and rhacophorids. On the other hand, the novel position of the ND5 gene seems to have arisen in a lineage leading to rhacophorids (and other closely related taxa) after ranid divergence. Phylogenetic analysis based on nucleotide sequence data of all mitochondrial genes also supported the gene rearrangement pathway. PMID:15329496

  17. Nucleotide sequence of the cDNA encoding the precursor of the beta subunit of rat lutropin.

    PubMed Central

    Chin, W W; Godine, J E; Klein, D R; Chang, A S; Tan, L K; Habener, J F

    1983-01-01

    We have determined the nucleotide sequences of cDNAs encoding the precursor of the beta subunit of rat lutropin, a polypeptide hormone that regulates gonadal function, including the development of gametes and the production of steroid sex hormones. The cDNAs were prepared from poly(A)+ RNA derived from the pituitary glands of rats 4 weeks after ovariectomy and were cloned in bacterial plasmids. Bacterial colonies containing transfected plasmids were screened by hybridization with a 32P-labeled cDNA encoding the beta subunit of human chorionic gonadotropin, a protein that is related in structure to lutropin. Several recombinant plasmids were detected that by nucleotide sequence analyses contained coding sequences for the precursor of the beta subunit of lutropin. Complete determination of the nucleotide sequences of these cDNAs, as well as of cDNA reverse-transcribed from pituitary poly(A)+ RNA by using a synthetic pentadecanucleotide as a primer of RNA, provided the entire 141-codon sequence of the precursor of the beta subunit of rat lutropin. The precursor consists of a 20 amino acid leader (signal) peptide and an apoprotein of 121 amino acids. The amino acid sequence of the rat lutropin beta subunit shows similarity to the beta subunits of the ovine/bovine, porcine, and human lutropins (81, 86, and 74% of amino acids identical, respectively). Blot hybridization of pituitary RNAs separated by electrophoresis on agarose gels showed that the mRNA encoding the lutropin beta subunit consists of approximately 700 bases. The availability of cDNAs for both the alpha and beta subunits of lutropin will facilitate studies of the regulation of lutropin expression. Images PMID:6192440

  18. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. PMID:25625550

  19. Alignment of high-throughput sequencing data inside in-memory databases.

    PubMed

    Firnkorn, Daniel; Knaup-Gregori, Petra; Lorenzo Bermejo, Justo; Ganzinger, Matthias

    2014-01-01

    In times of high-throughput DNA sequencing techniques, performance-capable analysis of DNA sequences is of high importance. Computer supported DNA analysis is still an intensive time-consuming task. In this paper we explore the potential of a new In-Memory database technology by using SAP's High Performance Analytic Appliance (HANA). We focus on read alignment as one of the first steps in DNA sequence analysis. In particular, we examined the widely used Burrows-Wheeler Aligner (BWA) and implemented stored procedures in both, HANA and the free database system MySQL, to compare execution time and memory management. To ensure that the results are comparable, MySQL has been running in memory as well, utilizing its integrated memory engine for database table creation. We implemented stored procedures, containing exact and inexact searching of DNA reads within the reference genome GRCh37. Due to technical restrictions in SAP HANA concerning recursion, the inexact matching problem could not be implemented on this platform. Hence, performance analysis between HANA and MySQL was made by comparing the execution time of the exact search procedures. Here, HANA was approximately 27 times faster than MySQL which means, that there is a high potential within the new In-Memory concepts, leading to further developments of DNA analysis procedures in the future. PMID:25160230

  20. Nucleotides critical for the interaction of the Streptococcus pyogenes Mga virulence regulator with Mga-regulated promoter sequences.

    PubMed

    Hause, Lara L; McIver, Kevin S

    2012-09-01

    The Mga regulator of Streptococcus pyogenes directly activates the transcription of a core regulon that encodes virulence factors such as M protein (emm), C5a peptidase (scpA), and streptococcal inhibitor of complement (sic) by directly binding to a 45-bp binding site as determined by an electrophoretic mobility shift assay (EMSA) and DNase I protection. However, by comparing the nucleotide sequences of all established Mga binding sites, we found that they exhibit only 13.4% identity with no discernible symmetry. To determine the core nucleotides involved in functional Mga-DNA interactions, the M1T1 Pemm1 binding site was altered and screened for nucleotides important for DNA binding in vitro and for transcriptional activation using a plasmid-based luciferase reporter in vivo. Following this analysis, 34 nucleotides within the Pemm1 binding site that had an effect on Mga binding, Mga-dependent transcriptional activation, or both were identified. Of these critical nucleotides, guanines and cytosines within the major groove were disproportionately identified clustered at the 5' and 3' ends of the binding site and with runs of nonessential adenines between the critical nucleotides. On the basis of these results, a Pemm1 minimal binding site of 35 bp bound Mga at a level comparable to the level of binding of the larger 45-bp site. Comparison of Pemm with directed mutagenesis performed in the M1T1 Mga-regulated PscpA and Psic promoters, as well as methylation interference analysis of PscpA, establish that Mga binds to DNA in a promoter-specific manner. PMID:22773785

  1. Nucleotide sequence analysis and seroreactivities of the 65K heat shock protein from Mycobacterium paratuberculosis.

    PubMed Central

    el-Zaatari, F A; Naser, S A; Engstrand, L; Burch, P E; Hachem, C Y; Whipple, D L; Graham, D Y

    1995-01-01

    Mycobacterium paratuberculosis is the causative agent of Johne's disease, a chronic enteritis in ruminants. It has also been implicated as a possible cause of Crohn's disease, an inflammatory bowel disease of unknown etiology. The mycobacterial 65K heat shock proteins (hsp-65K) are among the most extensively studied mycobacterial proteins, and their immunogenic characteristics have been suggested to be the basis for autoimmunization in chronic inflammatory diseases. In this context, we isolated and sequenced the hsp-65K-encoding gene from our M. paratuberculosis PTB65K genomic library. A high degree of identity was found between the open reading frame (ORF) of the PTB65K gene and those of Mycobacterium tuberculosis (89.6%), Mycobacterium leprae (86.6%), and Mycobacterium avium 18 (98.8%). The amino acid sequence alignment of the PTB65K protein with the hsp-65K homologs revealed that the M. tuberculosis and M. leprae proteins each differed by 36 amino acid residues and that the M. avium 18 protein differed by 8 residues. We also investigated the humoral immune responses of animals with Johne's disease and patients with Crohn's disease against the recombinant PTB65K antigen. Immunoblot analysis showed that sera from only 3 of 10 clinically ill and 5 of 25 subclinically ill cows reacted with PTB65K. In addition, sera from two of two sheep and one of two goats with clinical symptoms of Johne's disease also reacted with PTB65K; 0 samples from 10 normal cows reacted. In humans, sera from 7 of 13 patients with Crohn's disease, 3 of 4 with tuberculosis, 5 of 6 with leprosy, 5 of 12 with non-inflammatory bowel disease, and 0 of 4 with ulcerative colitis reacted with the recombinant PTB65K antigen. These results indicate that this PTB65K heat shock protein is uninformative when used for serodiagnosis of Johne's disease in animals. However, in humans, the high intensity of antibody reactions of some sera from Crohn's disease patients compared with that from noninflammatory

  2. Open-Phylo: a customizable crowd-computing platform for multiple sequence alignment.

    PubMed

    Kwak, Daniel; Kam, Alfred; Becerra, David; Zhou, Qikuan; Hops, Adam; Zarour, Eleyine; Kam, Arthur; Sarmenta, Luis; Blanchette, Mathieu; Waldispühl, Jérôme

    2013-01-01

    Citizen science games such as Galaxy Zoo, Foldit, and Phylo aim to harness the intelligence and processing power generated by crowds of online gamers to solve scientific problems. However, the selection of the data to be analyzed through these games is under the exclusive control of the game designers, and so are the results produced by gamers. Here, we introduce Open-Phylo, a freely accessible crowd-computing platform that enables any scientist to enter our system and use crowds of gamers to assist computer programs in solving one of the most fundamental problems in genomics: the multiple sequence alignment problem. PMID:24148814

  3. Nucleotide sequence and organization of the human S-protein gene: repeating peptide motifs in the pexin family and a model for their evolution

    SciTech Connect

    Jenne, D.; Stanley, K.K.

    1987-10-20

    The S-protein/vitronectin gene was isolated from a human genomic DNA library, and its sequence of about 5.3 kilobases including the adjacent 5' and 3' flanking regions was established. Alignment of the genomic DNA nucleotide sequence and the cDNA sequence indicated that the gene consisted of eight exons and seven introns. The intron positions in the S-protein gene and their phase type were compared to those in the hemopexin gene which shares amino acid sequence homologies with transin and the S-protein. Three introns have been found at equivalent positions; two other introns are very close to these positions and are interpreted as cases of intron sliding. Introns 3-7 occur at a conserved glycine residue within repeating peptide segments, whereas introns 1 and 2 are at the boundaries of the Somatomedin B domain of S-protein. The analysis of the exon structure in relations to repeating peptide motifs within the S-protein strongly suggest that it contains only seven repeats, one less than the hemopexin molecule. A very similar repeat pattern like that in hemopexin is shown to be present also in two other related proteins, transin and interstitial collagenase. An evolutionary model for the generation of the repeat pattern in the S-protein and the other members of this novel pexin gene family is proposed, and the sequence modifications for some of the repeats during divergent evolution are discussed in relation to know unique functional properties of hemopexin and S-protein.

  4. Two Simple and Efficient Algorithms to Compute the SP-Score Objective Function of a Multiple Sequence Alignment

    PubMed Central

    Ranwez, Vincent

    2016-01-01

    Background Multiple sequence alignment (MSA) is a crucial step in many molecular analyses and many MSA tools have been developed. Most of them use a greedy approach to construct a first alignment that is then refined by optimizing the sum of pair score (SP-score). The SP-score estimation is thus a bottleneck for most MSA tools since it is repeatedly required and is time consuming. Results Given an alignment of n sequences and L sites, I introduce here optimized solutions reaching O(nL) time complexity for affine gap cost, instead of O(n2L), which are easy to implement. PMID:27505054

  5. SMRT Sequencing of Long Tandem Nucleotide Repeats in SCA10 Reveals Unique Insight of Repeat Expansion Structure

    PubMed Central

    Landrian, Ivette; Godiska, Ronald; Shanker, Savita; Yu, Fahong; Farmerie, William G.; Ashizawa, Tetsuo

    2015-01-01

    A large, non-coding ATTCT repeat expansion causes the neurodegenerative disorder, spinocerebellar ataxia type 10 (SCA10). In a subset of SCA10 patients, interruption motifs are present at the 5’ end of the expansion and strongly correlate with epileptic seizures. Thus, interruption motifs are a predictor of the epileptic phenotype and are hypothesized to act as a phenotypic modifier in SCA10. Yet, the exact internal sequence structure of SCA10 expansions remains unknown due to limitations in current technologies for sequencing across long extended tracts of tandem nucleotide repeats. We used the third generation sequencing technology, Single Molecule Real Time (SMRT) sequencing, to obtain full-length contiguous expansion sequences, ranging from 2.5 to 4.4 kb in length, from three SCA10 patients with different clinical presentations. We obtained sequence spanning the entire length of the expansion and identified the structure of known and novel interruption motifs within the SCA10 expansion. The exact interruption patterns in expanded SCA10 alleles will allow us to further investigate the potential contributions of these interrupting sequences to the pathogenic modification leading to the epilepsy phenotype in SCA10. Our results also demonstrate that SMRT sequencing is useful for deciphering long tandem repeats that pose as “gaps” in the human genome sequence. PMID:26295943

  6. Human adult T-cell leukemia virus: complete nucleotide sequence of the provirus genome integrated in leukemia cell DNA.

    PubMed Central

    Seiki, M; Hattori, S; Hirayama, Y; Yoshida, M

    1983-01-01

    Human retrovirus adult T-cell leukemia virus (ATLV) has been shown to be closely associated with human adult T-cell leukemia (ATL) [Yoshida, M., Miyoshi, I. & Hinuma, Y. (1982) Proc. Natl. Acad. Sci. USA 79, 2031-2035]. The provirus of ATLV integrated in DNA of leukemia T cells from a patient with ATL was molecularly cloned and the complete nucleotide sequence of 9,032 bases of the proviral genome was determined. The provirus DNA contains two long terminal repeats (LTRs) consisting of 755 bases, one at each end, which are flanked by a 6-base direct repeat of the cellular DNA sequence. The nucleotides in the LTR could be arranged into a unique secondary structure, which could explain transcriptional termination within the 3' LTR but not in the 5' LTR. The nucleotide sequence of the provirus contains three large open reading frames, which are capable of coding for proteins of 48,000, 99,000, and 54,000 daltons. The three open frames are in this order from the 5' end of the viral genome and the predicted 48,000-dalton polypeptide is a precursor of gag proteins, because it has an identical amino acid sequence to that of the NH2 terminus of human T-cell leukemia virus (HTLV) p24. The open frames coding for 99,000- and 54,000-dalton polypeptides are thought to be the pol and env genes, respectively. On the 3' side of these three open frames, the ATLV sequence has four smaller open frames in various phases; these frames may code for 10,000-, 11,000-, 12,000-, and 27,000-dalton polypeptides. Although one or some of these open frames could be the transforming gene of this virus, in preliminary analysis, DNA of this region has no homology with the normal human genome. PMID:6304725

  7. EzEditor: a versatile sequence alignment editor for both rRNA- and protein-coding genes.

    PubMed

    Jeon, Yoon-Seong; Lee, Kihyun; Park, Sang-Cheol; Kim, Bong-Soo; Cho, Yong-Joon; Ha, Sung-Min; Chun, Jongsik

    2014-02-01

    EzEditor is a Java-based molecular sequence editor allowing manipulation of both DNA and protein sequence alignments for phylogenetic analysis. It has multiple features optimized to connect initial computer-generated multiple alignment and subsequent phylogenetic analysis by providing manual editing with reference to biological information specific to the genes under consideration. It provides various functionalities for editing rRNA alignments using secondary structure information. In addition, it supports simultaneous editing of both DNA sequences and their translated protein sequences for protein-coding genes. EzEditor is, to our knowledge, the first sequence editing software designed for both rRNA- and protein-coding genes with the visualization of biologically relevant information and should be useful in molecular phylogenetic studies. EzEditor is based on Java, can be run on all major computer operating systems and is freely available from http://sw.ezbiocloud.net/ezeditor/. PMID:24425826

  8. Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

    PubMed Central

    Ma, Qicheng; Chirn, Gung-Wei; Cai, Richard; Szustakowski, Joseph D; Nirmala, NR

    2005-01-01

    Background The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. Results Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. Conclusion Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes. PMID:16202129

  9. Unifying bacteria from decaying wood with various ubiquitous Gibbsiella species as G. acetica sp. nov. based on nucleotide sequence similarities and their acetic acid secretion.

    PubMed

    Geider, Klaus; Gernold, Marina; Jock, Susanne; Wensing, Annette; Völksch, Beate; Gross, Jürgen; Spiteller, Dieter

    2015-12-01

    Bacteria were isolated from necrotic apple and pear tree tissue and from dead wood in Germany and Austria as well as from pear tree exudate in China. They were selected for growth at 37 °C, screened for levan production and then characterized as Gram-negative, facultatively anaerobic rods. Nucleotide sequences from 16S rRNA genes, the housekeeping genes dnaJ, gyrB, recA and rpoB alignments, BLAST searches and phenotypic data confirmed by MALDI-TOF analysis showed that these bacteria belong to the genus Gibbsiella and resembled strains isolated from diseased oaks in Britain and Spain. Gibbsiella-specific PCR primers were designed from the proline isomerase and the levansucrase genes. Acid secretion was investigated by screening for halo formation on calcium carbonate agar and the compound identified by NMR as acetic acid. Its production by Gibbsiella spp. strains was also determined in culture supernatants by GC/MS analysis after derivatization with pentafluorobenzyl bromide. Some strains were differentiated by the PFGE patterns of SpeI digests and by sequence analyses of the lsc and the ppiD genes, and the Chinese Gibbsiella strain was most divergent. The newly investigated bacteria as well as Gibbsiella querinecans, Gibbsiella dentisursi and Gibbsiella papilionis, isolated in Britain, Spain, Korea and Japan, are taxonomically related Enterobacteriaceae, tolerate and secrete acetic acid. We therefore propose to unify them in the species Gibbsiella acetica sp. nov. PMID:26071988

  10. Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments

    PubMed Central

    Miga, Karen H.; Eisenhart, Christopher; Kent, W. James

    2015-01-01

    The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifactual peak calls that confound hypothesis testing and large-scale genomic studies. To address this problem, we have constructed mapping targets that represent roughly 8% of the human genome generally omitted from the human reference assembly. By integrating these data into standard mapping and peak-calling pipelines we demonstrate a 10-fold reduction in signals in regions common to the blacklisted region and identify a comprehensive set of regions that exhibit mapping sensitivity with the presence of the repeat-rich targets. PMID:26163063

  11. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

    PubMed Central

    Pratas, Diogo; Silva, Raquel M.; Pinho, Armando J.; Ferreira, Paulo J.S.G.

    2015-01-01

    Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail. PMID:25984837

  12. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences.

    PubMed

    Pratas, Diogo; Silva, Raquel M; Pinho, Armando J; Ferreira, Paulo J S G

    2015-01-01

    Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail. PMID:25984837

  13. Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

    PubMed Central

    2014-01-01

    Background Several methods are available for the detection of covarying positions from a multiple sequence alignment (MSA). If the MSA contains a large number of sequences, information about the proximities between residues derived from covariation maps can be sufficient to predict a protein fold. However, in many cases the structure is already known, and information on the covarying positions can be valuable to understand the protein mechanism and dynamic properties. Results In this study we have sought to determine whether a multivariate (multidimensional) extension of traditional mutual information (MI) can be an additional tool to study covariation. The performance of two multidimensional MI (mdMI) methods, designed to remove the effect of ternary/quaternary interdependencies, was tested with a set of 9 MSAs each containing <400 sequences, and was shown to be comparable to that of the newest methods based on maximum entropy/pseudolikelyhood statistical models of protein sequences. However, while all the methods tested detected a similar number of covarying pairs among the residues separated by < 8 Å in the reference X-ray structures, there was on average less than 65% overlap between the top scoring pairs detected by methods that are based on different principles. Conclusions Given the large variety of structure and evolutionary history of different proteins it is possible that a single best method to detect covariation in all proteins does not exist, and that for each protein family the best information can be derived by merging/comparing results obtained with different methods. This approach may be particularly valuable in those cases in which the size of the MSA is small or the quality of the alignment is low, leading to significant differences in the pairs detected by different methods. PMID:24886131

  14. Nucleotide sequences and mutations of the 5'-nontranslated region (5'NTR) of natural isolates of an epidemic echovirus 11' (prime).

    PubMed

    Szendrõi, A; El-Sageyer, M; Takács, M; Mezey, I; Berencsi, G

    2000-01-01

    An echovirus 11' (prime) virus caused an epidemic in Hungary in 1989. The leading clinical form of the diseases was myocarditis. Hemorrhagic hepatitis syndroms were also caused, however, with lethal outcome in 13 newborn babies. Altogether 386 children suffered from registered clinical disease. No accumulation of serous meningitis cases and intrauterine death were observed during the epidemic, and the monovalent oral poliovirus vaccination campaign has prevented the further circulation of the virus. The 5'-nontranslated region (5'-NTR) of 12 natural isolates were sequenced (nucleotides: 260-577). The 5'-NTR was found to be different from that of the prototype Gregory strain (X80059) of EV11 (less than 90% identity), but related to the swine vesicular disease virus (D16364) SVDV and EV9 (X92886) as indicated by the best fitting dendogram. The examination of the variable nucleotides in the internal ribosomal entry site (IRES) revealed, that the nucleotide sequence of a region of the epidemic 5'-NTR was identical to that of coxsackievirus B2. Five of the epidemic isolates were found to carry mutations. Seven EV11' IRES elements possessed identical sequences indicating, that the virus has evolved before its arrival to Hungary. The comparative examination of the suboptimal secondary structures revealed, that no one of the mutations affected the secondary structure of stem-loop structures IV and V in the IRES elements. Although it has been shown previously, that the echovirus group is genetically coherent and related to coxsackie B viruses the sequence differences in the epidemic isolates resulted in profound modification of the central stem (residues 477-529) of stem-loop structure No.V known to be affecting neurovirulence of polioviruses. Two alternate cloverleaf (stem-loop) structures were also recognised (nucleotides 376 to 460 and 540 to 565) which seem to mask both regions of the IRES element complementary to the 3'-end of the 18 S rRNA (460 to 466 and 561 to 570

  15. Amino acid sequence alignment of bacterial and mammalian pancreatic serine proteases based on topological equivalences.

    PubMed

    James, M N; Delbaere, L T; Brayer, G D

    1978-06-01

    The three-dimensional structures of the bacterial serine proteases SGPA, SGPB, and alpha-lytic protease have been compared with those of the pancreatic enzymes alpha-chymotrypsin and elastase. This comparison shows that approximately 60% (55-64%) of the alpha-carbon atom positions of the bacterial serine proteases are topologically equivalent to the alpha-carbon atom positions of the pancreatic enzymes. The corresponding value for a comparison of the bacterial enzymes among themselves is approximately 84%. The results of these topological comparisons have been used to deduce an experimentally sound sequence alignment for these several enzymes. This alignment shows that there is extensive tertiary structural homology among the bacteria and pancreatic enzymes without significant primary sequence identity (less than 21%). The acquisition of a zymogen function by the pancreatic enzymes is accompanied by two major changes to the bacterial enzymes' architecture: an insertion of 9 residues to increase the length of the N-terminal loop, and one of 12 residues to a loop near the activation salt bridge. In addition, in these two enzyme families, the methionine loop (residues 164-182) adopts very different comformations which are associated with their altered substrate specificities. PMID:96920

  16. Glucitol-specific enzymes of the phosphotransferase system in Escherichia coli. Nucleotide sequence of the gut operon.

    PubMed

    Yamada, M; Saier, M H

    1987-04-25

    The complete nucleotide sequence of the glucitol (gut) operon in Escherichia coli has been determined. The glucitol-specific Enzyme II and Enzyme III of the phosphoenolpyruvate:sugar phosphotransferase system as well as glucitol-6-phosphate dehydrogenase which are encoded by the gutA, gutB, and gutD genes of the gut operon, respectively, are predicted to consist of 506 (Mr = 54,018), 123 (Mr = 13,306), and 259 (Mr = 27,866) amino acyl residues, respectively. The hydropathic profile of the Enzyme IIgut revealed 7 or 8 long hydrophobic segments which may traverse the cell membrane as alpha-helices as well as 2 or 4 short strongly hydrophobic stretches which may traverse the membrane as beta-structure. The number of amino acyl residues in the sum of the molecular weights of the glucitol Enzyme II-III pair are nearly the same as those of the mannitol Enzyme II. The ratio of hydrophobic to hydrophilic amino acyl residues and the numbers of the hydrophobic segments are also nearly the same for both transport systems. However, no significant homology was found in the nucleotide or amino acyl sequences of the two systems. Glucitol-6-phosphate dehydrogenase was found to exhibit sequence homology with ribitol dehydrogenase. A repetitive extragenic palindromic sequence was found in the 3'-flanking region of the gutD gene, suggesting the presence of a gene downstream from the gutD gene. PMID:3553176

  17. An Interpretation of the Ancestral Codon from Miller’s Amino Acids and Nucleotide Correlations in Modern Coding Sequences

    PubMed Central

    Carels, Nicolas; de Leon, Miguel Ponce

    2015-01-01

    Purine bias, which is usually referred to as an “ancestral codon”, is known to result in short-range correlations between nucleotides in coding sequences, and it is common in all species. We demonstrate that RWY is a more appropriate pattern than the classical RNY, and purine bias (Rrr) is the product of a network of nucleotide compensations induced by functional constraints on the physicochemical properties of proteins. Through deductions from universal correlation properties, we also demonstrate that amino acids from Miller’s spark discharge experiment are compatible with functional primeval proteins at the dawn of living cell radiation on earth. These amino acids match the hydropathy and secondary structures of modern proteins. PMID:25922573

  18. Development of Single Nucleotide Polymorphism Markers via Sequence-based Genotyping in Cotton (Gossypium spp)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    High-throughput single nucleotide polymorphism (SNP) genotyping has become the dominant approach to genomic analysis and genetic manipulation in many crop plants. In cotton (Gossypium spp), however, only a very limited number of loci and a dearth of information have been generated from SNP genotypi...

  19. Complete nucleotide sequence of Rose yellow leaf virus, a new member of the family Tombusviridae

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The genome of the Rose yellow leaf virus (RYLV) has been determined to be 3918 nucleotides containing seven open reading frames (ORFs). ORF1 encodes a 27 kDa peptide (p27). ORF2 shares a common start codon with ORF1 and continues through the amber stop codon of p27 to encode a 87 kDa (p87) protein t...

  20. MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads

    PubMed Central

    2009-01-01

    Background Next-generation sequencing technologies provide exciting avenues for studies of transcriptomics and population genomics. There is an increasing need to conduct spliced and unspliced alignments of short transcript reads onto a reference genome and estimate minor allele frequency from sequences of population samples. Results We have designed and implemented MapNext, a software tool for both spliced and unspliced alignments of short sequence reads onto reference sequences, and automated SNP detection using neighbourhood quality standards. MapNext provides four main analyses: (i) unspliced alignment and clustering of reads, (ii) spliced alignment of transcript reads over intron boundaries, (iii) SNP detection and estimation of minor allele frequency from population sequences, and (iv) storage of result data in a database to make it available for more flexible queries and for further analyses. The software tool has been tested using both simulated and real data. Conclusion MapNext is a comprehensive and powerful tool for both spliced and unspliced alignments of short reads and automated SNP detection from population sequences. The simplicity, flexibility and efficiency of MapNext makes it a valuable tool for transcriptomic and population genomic research. PMID:19958476

  1. Extraction of high quality k-words for alignment-free sequence comparison.

    PubMed

    Gunasinghe, Upuli; Alahakoon, Damminda; Bedingfield, Susan

    2014-10-01

    The weighted Euclidean distance (D(2)) is one of the earliest dissimilarity measures used for alignment free comparison of biological sequences. This distance measure and its variants have been used in numerous applications due to its fast computation, and many variants of it have been subsequently introduced. The D(2) distance measure is based on the count of k-words in the two sequences that are compared. Traditionally, all k-words are compared when computing the distance. In this paper we show that similar accuracy in sequence comparison can be achieved by using a selected subset of k-words. We introduce a term variance based quality measure for identifying the important k-words. We demonstrate the application of the proposed technique in phylogeny reconstruction and show that up to 99% of the k-words can be filtered out for certain datasets, resulting in faster sequence comparison. The paper also presents an exploratory analysis based evaluation of optimal k-word values and discusses the impact of using subsets of k-words in such optimal instances. PMID:24846728

  2. Nucleotide sequences of fic and fic-1 genes involved in cell filamentation induced by cyclic AMP in Escherichia coli.

    PubMed Central

    Kawamukai, M; Matsuda, H; Fujii, W; Utsumi, R; Komano, T

    1989-01-01

    The nucleotide sequences of fic-1 involved in the cell filamentation induced by cyclic AMP in Escherichia coli and its normal counterpart fic were analyzed. The open reading frame of both fic-1 and fic coded for 200 amino acids. The Gly at position 55 in the Fic protein was changed to Arg in the Fic-1 protein. The promoter activity of fic was confirmed by fusing fic and lacZ. The gene downstream from fic was found to be pabA (p-aminobenzoate). There is an open reading frame (ORF190) coding for 190 amino acids upstream from the fic gene. Computer-assisted analysis showed that Fic has sequence similarity with part of CDC28 of Saccharomyces cerevisiae, CDC2 of Schizosaccharomyces pombe, and FtsA of E. coli. In addition, ORF190 has sequence similarity with the cyclosporin A-binding protein cyclophilin. PMID:2546924

  3. The complete nucleotide sequence of a 16S ribosomal RNA gene from a blue-green alga, Anacystis nidulans.

    PubMed

    Tomioka, N; Sugiura, M

    1983-01-01

    The complete nucleotide sequence of a 16S ribosomal RNA gene from a blue-green alga, Anacystis nidulans, has been determined. Its coding region is estimated to be 1,487 base pairs long, which is nearly identical to those reported for chloroplast 16S rRNA genes and is about 4% shorter than that of the Escherichia coli gene. The 16S rRNA sequence of A. nidulans has 83% homology with that of tobacco chloroplast and 74% homology with that of E. coli. Possible stem and loop structures of A. nidulans 16S rRNA sequences resemble more closely those of chloroplast 16S rRNAs than those of E. coli 16S rRNA. These observations support the endosymbiotic theory of chloroplast origin. PMID:6412038

  4. The nucleotide sequences of several tRNA genes from rat mitochondria: common features and relatedness to homologous species.

    PubMed Central

    Cantatore, P; De Benedetto, C; Gadaleta, G; Gallerani, R; Kroon, A M; Holtrop, M; Lanave, C; Pepe, G; Quagliariello, C; Saccone, C; Sbisa, E

    1982-01-01

    We have determined the nucleotide sequences of thirteen rat mt tRNA genes. The features of the primary and secondary structures of these tRNAs show that those for Gln, Ser, and f-Met resemble, while those for Lys, Cys, and Trp depart strikingly from the universal type. The remainder are slightly abnormal. Among many mammalian mt DNA sequences, those of mt tRNA genes are highly conserved, thus suggesting for those genes an additional, perhaps regulatory, function. A simple evolutionary relationship between the tRNAs of animal mitochondria and those of eukaryotic cytoplasm, of lower eukaryotic mitochondria or of prokaryotes, is not evident owing to the extreme divergence of the tRNA sequences in the two groups. However, a slightly higher homology does exist between a few animal mt tRNAs and those from prokaryotes or from lower eukaryotic mitochondria. PMID:7099963

  5. AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization

    NASA Astrophysics Data System (ADS)

    Newkirk, Daniel; Biesinger, Jacob; Chon, Alvin; Yokomori, Kyoko; Xie, Xiaohui

    High-throughput sequencing coupled to chromatin immunoprecipitation (ChIP-Seq) is widely used in characterizing genome-wide binding patterns of transcription factors, cofactors, chromatin modifiers, and other DNA binding proteins. A key step in ChIP-Seq data analysis is to map short reads from high-throughput sequencing to a reference genome and identify peak regions enriched with short reads. Although several methods have been proposed for ChIP-Seq analysis, most existing methods only consider reads that can be uniquely placed in the reference genome, and therefore have low power for detecting peaks located within repeat sequences. Here we introduce a probabilistic approach for ChIP-Seq data analysis which utilizes all reads, providing a truly genome-wide view of binding patterns. Reads are modeled using a mixture model corresponding to K enriched regions and a null genomic background. We use maximum likelihood to estimate the locations of the enriched regions, and implement an expectation-maximization (E-M) algorithm, called AREM (aligning reads by expectation maximization), to update the alignment probabilities of each read to different genomic locations. We apply the algorithm to identify genome-wide binding events of two proteins: Rad21, a component of cohesin and a key factor involved in chromatid cohesion, and Srebp-1, a transcription factor important for lipid/cholesterol homeostasis. Using AREM, we were able to identify 19,935 Rad21 peaks and 1,748 Srebp-1 peaks in the mouse genome with high confidence, including 1,517 (7.6%) Rad21 peaks and 227 (13%) Srebp-1 peaks that were missed using only uniquely mapped reads. The open source implementation of our algorithm is available at http://sourceforge.net/projects/arem

  6. Nucleotide sequence of Zygosaccharomyces bailii virus Z: Evidence for +1 programmed ribosomal frameshifting and for assignment to family Amalgaviridae.

    PubMed

    Depierreux, Delphine; Vong, Minh; Nibert, Max L

    2016-06-01

    Zygosaccharomyces bailii virus Z (ZbV-Z) is a monosegmented dsRNA virus that infects the yeast Zygosaccharomyces bailii and remains unclassified to date despite its discovery >20years ago. The previously reported nucleotide sequence of ZbV-Z (GenBank AF224490) encompasses two nonoverlapping long ORFs: upstream ORF1 encoding the putative coat protein and downstream ORF2 encoding the RNA-dependent RNA polymerase (RdRp). The lack of overlap between these ORFs raises the question of how the downstream ORF is translated. After examining the previous sequence of ZbV-Z, we predicted that it contains at least one sequencing error to explain the nonoverlapping ORFs, and hence we redetermined the nucleotide sequence of ZbV-Z, derived from the same isolate of Z. bailii as previously studied, to address this prediction. The key finding from our new sequence, which includes several insertions, deletions, and substitutions relative to the previous one, is that ORF2 in fact overlaps ORF1 in the +1 frame. Moreover, a proposed sequence motif for +1 programmed ribosomal frameshifting, previously noted in influenza A viruses, plant amalgaviruses, and others, is also present in the newly identified ORF1-ORF2 overlap region of ZbV-Z. Phylogenetic analyses provided evidence that ZbV-Z represents a distinct taxon most closely related to plant amalgaviruses (genus Amalgavirus, family Amalgaviridae). We conclude that ZbV-Z is the prototype of a new species, which we propose to assign as type species of a new genus of monosegmented dsRNA mycoviruses in family Amalgaviridae. Comparisons involving other unclassified mycoviruses with RdRps apparently related to those of plant amalgaviruses, and having either mono- or bisegmented dsRNA genomes, are also discussed. PMID:26951859

  7. Partition enrichment of nucleotide sequences (PINS)--a generally applicable, sequence based method for enrichment of complex DNA samples.

    PubMed

    Kvist, Thomas; Sondt-Marcussen, Line; Mikkelsen, Marie Just

    2014-01-01

    The dwindling cost of DNA sequencing is driving transformative changes in various biological disciplines including medicine, thus resulting in an increased need for routine sequencing. Preparation of samples suitable for sequencing is the starting point of any practical application, but enrichment of the target sequence over background DNA is often laborious and of limited sensitivity thereby limiting the usefulness of sequencing. The present paper describes a new method, Probability directed Isolation of Nucleic acid Sequences (PINS), for enrichment of DNA, enabling the sequencing of a large DNA region surrounding a small known sequence. A 275,000 fold enrichment of a target DNA sample containing integrated human papilloma virus is demonstrated. Specifically, a sample containing 0.0028 copies of target sequence per ng of total DNA was enriched to 786 copies per ng. The starting concentration of 0.0028 target copies per ng corresponds to one copy of target in a background of 100,000 complete human genomes. The enriched sample was subsequently amplified using rapid genome walking and the resulting DNA sequence revealed not only the sequence of a the truncated virus, but also 1026 base pairs 5' and 50 base pairs 3' to the integration site in chromosome 8. The demonstrated enrichment method is extremely sensitive and selective and requires only minimal knowledge of the sequence to be enriched and will therefore enable sequencing where the target concentration relative to background is too low to allow the use of other sample preparation methods or where significant parts of the target sequence is unknown. PMID:25203653

  8. Partition Enrichment of Nucleotide Sequences (PINS) - A Generally Applicable, Sequence Based Method for Enrichment of Complex DNA Samples

    PubMed Central

    Kvist, Thomas; Sondt-Marcussen, Line; Mikkelsen, Marie Just

    2014-01-01

    The dwindling cost of DNA sequencing is driving transformative changes in various biological disciplines including medicine, thus resulting in an increased need for routine sequencing. Preparation of samples suitable for sequencing is the starting point of any practical application, but enrichment of the target sequence over background DNA is often laborious and of limited sensitivity thereby limiting the usefulness of sequencing. The present paper describes a new method, Probability directed Isolation of Nucleic acid Sequences (PINS), for enrichment of DNA, enabling the sequencing of a large DNA region surrounding a small known sequence. A 275,000 fold enrichment of a target DNA sample containing integrated human papilloma virus is demonstrated. Specifically, a sample containing 0.0028 copies of target sequence per ng of total DNA was enriched to 786 copies per ng. The starting concentration of 0.0028 target copies per ng corresponds to one copy of target in a background of 100,000 complete human genomes. The enriched sample was subsequently amplified using rapid genome walking and the resulting DNA sequence revealed not only the sequence of a the truncated virus, but also 1026 base pairs 5′ and 50 base pairs 3′ to the integration site in chromosome 8. The demonstrated enrichment method is extremely sensitive and selective and requires only minimal knowledge of the sequence to be enriched and will therefore enable sequencing where the target concentration relative to background is too low to allow the use of other sample preparation methods or where significant parts of the target sequence is unknown. PMID:25203653

  9. Nucleotide sequence and genome organization of Dweet mottle virus and its relationship to members of the family Betaflexiviridae.

    PubMed

    Hajeri, Subhas; Ramadugu, Chandrika; Keremane, Manjunath; Vidalakis, Georgios; Lee, Richard

    2010-09-01

    The nucleotide sequence of Dweet mottle virus (DMV) was determined and compared to sequences of members of the families Alphaflexiviridae and Betaflexiviridae. The DMV genome has 8,747 nucleotides (nt) excluding the 3' poly-(A) tail. DMV genomic RNA contains three putative open reading frames (ORFs) and untranslated regions of 73 nt at the 5' and 541 nt at 3' termini. ORF1 potentially encoding a 227.48-kDa polyprotein, which has methyltransferase, oxygenase, endopeptidase, helicase, and RNA-dependent RNA polymerase (RdRP) domains. ORF2 encodes a movement protein of 40.25 kDa, while ORF3 encodes a coat protein of 40.69 kDa. Protein database searches showed 98-99% matches of DMV ORFs with citrus leaf blotch virus (CLBV) sequences. Phylogenetic analysis based on the RdRP core domain revealed that DMV is closely related to CLBV as a member of the genus Citrivirus. DMV did not satisfy the molecular criteria for demarcation of an independent species within the genus Citrivirus, family Betaflexiviridae, and hence, DMV can be considered a CLBV isolate. PMID:20644968

  10. Simultaneous alignment and folding of 28S rRNA sequences uncovers phylogenetic signal in structure variation.

    PubMed

    Letsch, Harald O; Greve, Carola; Kück, Patrick; Fleck, Günther; Stocsits, Roman R; Misof, Bernhard

    2009-12-01

    Secondary structure models of mitochondrial and nuclear (r)RNA sequences are frequently applied to aid the alignment of these molecules in phylogenetic analyses. Additionally, it is often speculated that structure variation of (r)RNA sequences might profitably be used as phylogenetic markers. The benefit of these approaches depends on the reliability of structure models. We used a recently developed approach to show that reliable inference of large (r)RNA secondary structures as a prerequisite of simultaneous sequence and structure alignment is feasible. The approach iteratively establishes local structure constraints of each sequence and infers fully folded individual structures by constrained MFE optimization. A comparison of structure edit distances of individual constraints and fully folded structures showed pronounced phylogenetic signal in fully folded structures. As model sequences we characterized secondary structures of 28S rRNA sequences of selected insects and examined their phylogenetic signal according to established phylogenetic hypotheses. PMID:19654047

  11. Nucleotide sequence, transcription and phylogeny of the gene encoding the superoxide dismutase of Sulfolobus acidocaldarius.

    PubMed

    Klenk, H P; Schleper, C; Schwass, V; Brudler, R

    1993-07-18

    The gene encoding the superoxide dismutase (SOD) of the thermophilic archaeon Sulfolobus acidocaldarius has been isolated and sequenced. Both the start site and the termination sites of the corresponding transcript were mapped. The deduced amino acid sequence of the protein is very similar to the sequence of manganese- or iron-containing SODs. Phylogenetic sequence analysis corroborated the monophyletic nature of the archaeal domain. PMID:8334170

  12. Complete Nucleotide Sequence of cfr-Carrying IncX4 Plasmid pSD11 from Escherichia coli

    PubMed Central

    Sun, Jian; Deng, Hui; Li, Liang; Chen, Mu-Ya; Fang, Liang-Xing; Yang, Qiu-E

    2014-01-01

    We report the complete nucleotide sequence of a plasmid carrying the multiresistance gene cfr. This plasmid was isolated from an Escherichia coli strain of swine origin in 2011. This 37,672-bp plasmid, pSD11, had an IncX4 backbone similar to those of the IncX4 plasmids obtained from the United States and Australia, in which the cfr gene was flanked by two copies of IS26 and a truncated Tn1331 was inserted. PMID:25403661

  13. DNA sequencing by a single molecule detection of labeled nucleotides sequentially cleaved from a single strand of DNA

    SciTech Connect

    Goodwin, P.M.; Schecker, J.A.; Wilkerson, C.W.; Hammond, M.L.; Ambrose, W.P.; Jett, J.H.; Martin, J.C.; Marrone, B.L.; Keller, R.A. ); Haces, A.; Shih, P.J.; Harding, J.D. )

    1993-01-01

    We are developing a laser-based technique for the rapid sequencing of large DNA fragments (several kb in size) at a rate of 100 to 1000 bases per second. Our approach relies on fluorescent labeling of the bases in a single fragment of DNA, attachment of this labeled DNA fragment to a support, movement of the supported DNA into a flowing sample stream, sequential cleavage of the end nucleotide from the DNA fragment with an exonuclease, and detection of the individual fluorescently labeled bases by laser-induced fluorescence.

  14. DNA sequencing by a single molecule detection of labeled nucleotides sequentially cleaved from a single strand of DNA

    SciTech Connect

    Goodwin, P.M.; Schecker, J.A.; Wilkerson, C.W.; Hammond, M.L.; Ambrose, W.P.; Jett, J.H.; Martin, J.C.; Marrone, B.L.; Keller, R.A.; Haces, A.; Shih, P.J.; Harding, J.D.

    1993-02-01

    We are developing a laser-based technique for the rapid sequencing of large DNA fragments (several kb in size) at a rate of 100 to 1000 bases per second. Our approach relies on fluorescent labeling of the bases in a single fragment of DNA, attachment of this labeled DNA fragment to a support, movement of the supported DNA into a flowing sample stream, sequential cleavage of the end nucleotide from the DNA fragment with an exonuclease, and detection of the individual fluorescently labeled bases by laser-induced fluorescence.

  15. Analysis of a nucleotide-binding site of 5-lipoxygenase by affinity labelling: binding characteristics and amino acid sequences.

    PubMed Central

    Zhang, Y Y; Hammarberg, T; Radmark, O; Samuelsson, B; Ng, C F; Funk, C D; Loscalzo, J

    2000-01-01

    5-Lipoxygenase (5LO) catalyses the first two steps in the biosynthesis of leukotrienes, which are inflammatory mediators derived from arachidonic acid. 5LO activity is stimulated by ATP; however, a consensus ATP-binding site or nucleotide-binding site has not been found in its protein sequence. In the present study, affinity and photoaffinity labelling of 5LO with 5'-p-fluorosulphonylbenzoyladenosine (FSBA) and 2-azido-ATP showed that 5LO bound to the ATP analogues quantitatively and specifically and that the incorporation of either analogue inhibited ATP stimulation of 5LO activity. The stoichiometry of the labelling was 1.4 mol of FSBA/mol of 5LO (of which ATP competed with 1 mol/mol) or 0.94 mol of 2-azido-ATP/mol of 5LO (of which ATP competed with 0.77 mol/mol). Labelling with FSBA prevented further labelling with 2-azido-ATP, indicating that the same binding site was occupied by both analogues. Other nucleotides (ADP, AMP, GTP, CTP and UTP) also competed with 2-azido-ATP labelling, suggesting that the site was a general nucleotide-binding site rather than a strict ATP-binding site. Ca(2+), which also stimulates 5LO activity, had no effect on the labelling of the nucleotide-binding site. Digestion with trypsin and peptide sequencing showed that two fragments of 5LO were labelled by 2-azido-ATP. These fragments correspond to residues 73-83 (KYWLNDDWYLK, in single-letter amino acid code) and 193-209 (FMHMFQSSWNDFADFEK) in the 5LO sequence. Trp-75 and Trp-201 in these peptides were modified by the labelling, suggesting that they were immediately adjacent to the C-2 position of the adenine ring of ATP. Given the stoichiometry of the labelling, the two peptide sequences of 5LO were probably near each other in the enzyme's tertiary structure, composing or surrounding the ATP-binding site of 5LO. PMID:11042125

  16. Nucleotide sequence neighbouring a late modified guanylic residue within the 28S ribosomal RNA of several eukaryotic cells.

    PubMed Central

    Eladari, M E; Hampe, A; Galibert, F

    1977-01-01

    The nucleotide sequence of a particular T1 oligonucleotide found in 41S and 28S RNAs of several cellular cell lines (human, mouse, rat and chicken fibroblast) but absent in 45S ribosomal RNA has been deduced. Its primary structure : A-U-U*-G*-psi-U-C-A-C-C-C-A-C-U-A-A-U-A-Gp shows the presence of a modified G residue which explains the existence of this oligonucleotide in the T1 fingerprint of 41S RNA and 28S. Its absence on the 45S RNA T1 fingerprint is accounted for by a late modification. Images PMID:561392

  17. Nucleotide sequence of the 3'-terminal region of the genome confirms that pea mosaic virus is a strain of bean yellow mosaic potyvirus.

    PubMed

    Xiao, X W; Frenkel, M J; Ward, C W; Shukla, D D

    1994-01-01

    The 1,035 nucleotides at the 3'end of the I strain of pea mosaic potyvirus (PMV-I) genomic RNA, encoding the coat protein, have been cloned and sequenced. A comparison of the derived coat protein sequence with those of the bean yellow mosaic virus (BYMV) strains, CS, S, D and GDD, indicates that PMV-I is a strain of BYMV. Sequence comparisons and hybridisation studies using the 3'-noncoding region support this classification. The nucleotide and protein sequence data also suggest that PMV-I and BYMV-CS form one subset of BYMV strains while the other three strains form another. PMID:8031241

  18. PerPlot & PerScan: tools for analysis of DNA curvature-related periodicity in genomic nucleotide sequences

    PubMed Central

    2011-01-01

    Background Periodic spacing of short adenine or thymine runs phased with DNA helical period of ~10.5 bp is associated with intrinsic DNA curvature and deformability, which play important roles in DNA-protein interactions and in the organization of chromosomes in both eukaryotes and prokaryotes. Local differences in DNA sequence periodicity have been linked to differences in gene expression in some organisms. Despite the significance of these periodic patterns, there are virtually no publicly accessible tools for their analysis. Results We present novel tools suitable for assessments of DNA curvature-related sequence periodicity in nucleotide sequences at the genome scale. Utility of the present software is demonstrated on a comparison of sequence periodicities in the genomes of Haemophilus influenzae, Methanocaldococcus jannaschii, Saccharomyces cerevisiae, and Arabidopsis thaliana. The software can be accessed through a web interface and the programs are also available for download. Conclusions The present software is suitable for comparing DNA curvature-related sequence periodicity among different genomes as well as for analysis of intrachromosomal heterogeneity of the sequence periodicity. It provides a quick and convenient way to detect anomalous regions of chromosomes that could have unusual structural and functional properties and/or distinct evolutionary history. PMID:22587738

  19. The nucleotide composition of the spacer sequence influences the expression yield of heterologously expressed genes in Bacillus subtilis.

    PubMed

    Liebeton, Klaus; Lengefeld, Jette; Eck, Jürgen

    2014-12-10

    Bacillus subtilis is a commonly used host for the heterologous expression of genes in academia and industry. Many factors are known to influence the expression yield in this organism e.g. the complementarity between the Shine-Dalgarno sequence (SD) and the 16S-rRNA or secondary structures in the translation initiation region of the transcript. In this study, we analysed the impact of the nucleotide composition between the SD sequence and the start codon (the spacer sequence) on the expression yield. We demonstrated that a polyadenylate-moiety spacer sequence moderately increases the expression level of laccase CotA from B. subtilis. By screening a library of artificially generated spacer variants, we identified clones with greatly increased expression levels of two model enzymes, the laccase CotA from B. subtilis (11 fold) and the metagenome derived protease H149 (30 fold). Furthermore, we demonstrated that the effect of the spacer sequence is specific to the gene of interest. These results prove the high impact of the spacer sequence on the expression yield in B. subtilis. PMID:24997355

  20. The nucleotide sequence and genomic organization of Citrus leaf blotch virus: candidate type species for a new virus genus.

    PubMed

    Vives, M C; Galipienso, L; Navarro, L; Moreno, P; Guerri, J

    2001-08-15

    The complete nucleotide sequence of Citrus leaf blotch virus (CLBV) was determined. CLBV genomic RNA (gRNA) has 8747 nt, excluding the 3'-terminal poly(A) tail, and contains three open reading frames (ORFs) and untranslated regions (UTR) of 73 and 541 nucleotides at the 5' and 3' termini, respectively. ORF1 potentially encodes a 227.4-kDa polypeptide, which has methyltransferase, papain-like protease, helicase, and RNA-dependent RNA polymerase motifs. ORF2 encodes a 40.2-kDa polypeptide containing a motif characteristic of cell-to-cell movement proteins. The 40.7-kDa polypeptide encoded by ORF3 was identified as the coat protein. The genome organization of CLBV resembles that of viruses in the genus Trichovirus, but they differ in various aspects: (i) in trichoviruses ORF2 overlaps ORFs 1 and 3, whereas in CLBV, ORFs 2 and 3 are separated and ORFs 1 and 2 overlap in one nucleotide; (ii) CLBV gRNA and CP are larger than those of trichoviruses; and (iii) the CLBV 3' UTR is larger than that of trichoviruses. Phylogenetic comparisons based on CP amino acid signatures clearly separates CLBV from trichoviruses. Also contrasting with trichoviruses, CLBV could not be transmitted to Chenopodium quinoa Willd. Considering these singularities, we propose that CLBV should be included in a new virus genus. PMID:11504557