Statistical properties of DNA sequences
NASA Technical Reports Server (NTRS)
Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Simons, M.; Stanley, H. E.
1995-01-01
We review evidence supporting the idea that the DNA sequence in genes containing non-coding regions is correlated, and that the correlation is remarkably long range--indeed, nucleotides thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationarity" feature of the sequence of base pairs by applying a new algorithm called detrended fluctuation analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and non-coding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to every DNA sequence (33301 coding and 29453 non-coding) in the entire GenBank database. Finally, we describe briefly some recent work showing that the non-coding sequences have certain statistical features in common with natural and artificial languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts. These statistical properties of non-coding sequences support the possibility that non-coding regions of DNA may carry biological information.
Functional interrogation of non-coding DNA through CRISPR genome editing
Canver, Matthew C.; Bauer, Daniel E.; Orkin, Stuart H.
2017-01-01
Methodologies to interrogate non-coding regions have lagged behind coding regions despite comprising the vast majority of the genome. However, the rapid evolution of clustered regularly interspaced short palindromic repeats (CRISPR)-based genome editing has provided a multitude of novel techniques for laboratory investigation including significant contributions to the toolbox for studying non-coding DNA. CRISPR-mediated loss-of-function strategies rely on direct disruption of the underlying sequence or repression of transcription without modifying the targeted DNA sequence. CRISPR-mediated gain-of-function approaches similarly benefit from methods to alter the targeted sequence through integration of customized sequence into the genome as well as methods to activate transcription. Here we review CRISPR-based loss- and gain-of-function techniques for the interrogation of non-coding DNA. PMID:28288828
Functional interrogation of non-coding DNA through CRISPR genome editing.
Canver, Matthew C; Bauer, Daniel E; Orkin, Stuart H
2017-05-15
Methodologies to interrogate non-coding regions have lagged behind coding regions despite comprising the vast majority of the genome. However, the rapid evolution of clustered regularly interspaced short palindromic repeats (CRISPR)-based genome editing has provided a multitude of novel techniques for laboratory investigation including significant contributions to the toolbox for studying non-coding DNA. CRISPR-mediated loss-of-function strategies rely on direct disruption of the underlying sequence or repression of transcription without modifying the targeted DNA sequence. CRISPR-mediated gain-of-function approaches similarly benefit from methods to alter the targeted sequence through integration of customized sequence into the genome as well as methods to activate transcription. Here we review CRISPR-based loss- and gain-of-function techniques for the interrogation of non-coding DNA. Copyright © 2017 Elsevier Inc. All rights reserved.
2012-01-01
Background Detecting the borders between coding and non-coding regions is an essential step in the genome annotation. And information entropy measures are useful for describing the signals in genome sequence. However, the accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved. Methods In this study, we first applied a new recursive entropic segmentation method on DNA sequences to get preliminary significant cuts. A 22-symbol alphabet is used to capture the differential composition of nucleotide doublets and stop codon patterns along three phases in both DNA strands. This process requires no prior training datasets. Results Comparing with the previous segmentation methods, the experimental results on three bacteria genomes, Rickettsia prowazekii, Borrelia burgdorferi and E.coli, show that our approach improves the accuracy for finding the borders between coding and non-coding regions in DNA sequences. Conclusions This paper presents a new segmentation method in prokaryotes based on Jensen-Rényi divergence with a 22-symbol alphabet. For three bacteria genomes, comparing to A12_JR method, our method raised the accuracy of finding the borders between protein coding and non-coding regions in DNA sequences. PMID:23282225
Kress, W John; Erickson, David L
2007-06-06
A useful DNA barcode requires sufficient sequence variation to distinguish between species and ease of application across a broad range of taxa. Discovery of a DNA barcode for land plants has been limited by intrinsically lower rates of sequence evolution in plant genomes than that observed in animals. This low rate has complicated the trade-off in finding a locus that is universal and readily sequenced and has sufficiently high sequence divergence at the species-level. Here, a global plant DNA barcode system is evaluated by comparing universal application and degree of sequence divergence for nine putative barcode loci, including coding and non-coding regions, singly and in pairs across a phylogenetically diverse set of 48 genera (two species per genus). No single locus could discriminate among species in a pair in more than 79% of genera, whereas discrimination increased to nearly 88% when the non-coding trnH-psbA spacer was paired with one of three coding loci, including rbcL. In silico trials were conducted in which DNA sequences from GenBank were used to further evaluate the discriminatory power of a subset of these loci. These trials supported the earlier observation that trnH-psbA coupled with rbcL can correctly identify and discriminate among related species. A combination of the non-coding trnH-psbA spacer region and a portion of the coding rbcL gene is recommended as a two-locus global land plant barcode that provides the necessary universality and species discrimination.
Kress, W. John; Erickson, David L.
2007-01-01
Background A useful DNA barcode requires sufficient sequence variation to distinguish between species and ease of application across a broad range of taxa. Discovery of a DNA barcode for land plants has been limited by intrinsically lower rates of sequence evolution in plant genomes than that observed in animals. This low rate has complicated the trade-off in finding a locus that is universal and readily sequenced and has sufficiently high sequence divergence at the species-level. Methodology/Principal Findings Here, a global plant DNA barcode system is evaluated by comparing universal application and degree of sequence divergence for nine putative barcode loci, including coding and non-coding regions, singly and in pairs across a phylogenetically diverse set of 48 genera (two species per genus). No single locus could discriminate among species in a pair in more than 79% of genera, whereas discrimination increased to nearly 88% when the non-coding trnH-psbA spacer was paired with one of three coding loci, including rbcL. In silico trials were conducted in which DNA sequences from GenBank were used to further evaluate the discriminatory power of a subset of these loci. These trials supported the earlier observation that trnH-psbA coupled with rbcL can correctly identify and discriminate among related species. Conclusions/Significance A combination of the non-coding trnH-psbA spacer region and a portion of the coding rbcL gene is recommended as a two-locus global land plant barcode that provides the necessary universality and species discrimination. PMID:17551588
Qiu, Guo-Hua
2016-01-01
In this review, the protective function of the abundant non-coding DNA in the eukaryotic genome is discussed from the perspective of genome defense against exogenous nucleic acids. Peripheral non-coding DNA has been proposed to act as a bodyguard that protects the genome and the central protein-coding sequences from ionizing radiation-induced DNA damage. In the proposed mechanism of protection, the radicals generated by water radiolysis in the cytosol and IR energy are absorbed, blocked and/or reduced by peripheral heterochromatin; then, the DNA damage sites in the heterochromatin are removed and expelled from the nucleus to the cytoplasm through nuclear pore complexes, most likely through the formation of extrachromosomal circular DNA. To strengthen this hypothesis, this review summarizes the experimental evidence supporting the protective function of non-coding DNA against exogenous nucleic acids. Based on these data, I hypothesize herein about the presence of an additional line of defense formed by small RNAs in the cytosol in addition to their bodyguard protection mechanism in the nucleus. Therefore, exogenous nucleic acids may be initially inactivated in the cytosol by small RNAs generated from non-coding DNA via mechanisms similar to the prokaryotic CRISPR-Cas system. Exogenous nucleic acids may enter the nucleus, where some are absorbed and/or blocked by heterochromatin and others integrate into chromosomes. The integrated fragments and the sites of DNA damage are removed by repetitive non-coding DNA elements in the heterochromatin and excluded from the nucleus. Therefore, the normal eukaryotic genome and the central protein-coding sequences are triply protected by non-coding DNA against invasion by exogenous nucleic acids. This review provides evidence supporting the protective role of non-coding DNA in genome defense. Copyright © 2016 Elsevier B.V. All rights reserved.
Gene Identification Algorithms Using Exploratory Statistical Analysis of Periodicity
NASA Astrophysics Data System (ADS)
Mukherjee, Shashi Bajaj; Sen, Pradip Kumar
2010-10-01
Studying periodic pattern is expected as a standard line of attack for recognizing DNA sequence in identification of gene and similar problems. But peculiarly very little significant work is done in this direction. This paper studies statistical properties of DNA sequences of complete genome using a new technique. A DNA sequence is converted to a numeric sequence using various types of mappings and standard Fourier technique is applied to study the periodicity. Distinct statistical behaviour of periodicity parameters is found in coding and non-coding sequences, which can be used to distinguish between these parts. Here DNA sequences of Drosophila melanogaster were analyzed with significant accuracy.
DNA rearrangements directed by non-coding RNAs in ciliates
Mochizuki, Kazufumi
2013-01-01
Extensive programmed rearrangement of DNA, including DNA elimination, chromosome fragmentation, and DNA descrambling, takes place in the newly developed macronucleus during the sexual reproduction of ciliated protozoa. Recent studies have revealed that two distant classes of ciliates use distinct types of non-coding RNAs to regulate such DNA rearrangement events. DNA elimination in Tetrahymena is regulated by small non-coding RNAs that are produced and utilized in an RNAi-related process. It has been proposed that the small RNAs produced from the micronuclear genome are used to identify eliminated DNA sequences by whole-genome comparison between the parental macronucleus and the micronucleus. In contrast, DNA descrambling in Oxytricha is guided by long non-coding RNAs that are produced from the parental macronuclear genome. These long RNAs are proposed to act as templates for the direct descrambling events that occur in the developing macronucleus. Both cases provide useful examples to study epigenetic chromatin regulation by non-coding RNAs. PMID:21956937
GATA: A graphic alignment tool for comparative sequenceanalysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nix, David A.; Eisen, Michael B.
2005-01-01
Several problems exist with current methods used to align DNA sequences for comparative sequence analysis. Most dynamic programming algorithms assume that conserved sequence elements are collinear. This assumption appears valid when comparing orthologous protein coding sequences. Functional constraints on proteins provide strong selective pressure against sequence inversions, and minimize sequence duplications and feature shuffling. For non-coding sequences this collinearity assumption is often invalid. For example, enhancers contain clusters of transcription factor binding sites that change in number, orientation, and spacing during evolution yet the enhancer retains its activity. Dotplot analysis is often used to estimate non-coding sequence relatedness. Yet dotmore » plots do not actually align sequences and thus cannot account well for base insertions or deletions. Moreover, they lack an adequate statistical framework for comparing sequence relatedness and are limited to pairwise comparisons. Lastly, dot plots and dynamic programming text outputs fail to provide an intuitive means for visualizing DNA alignments.« less
USDA-ARS?s Scientific Manuscript database
Single-nucleotide Polymorphism (SNP) markers are by far the most common form of DNA polymorphism in a genome. The objectives of this study were to discover SNPs in common bean comparing sequences from coding and non-coding regions obtained from Genbank and genomic DNA and to compare sequencing resu...
Hall, L; Laird, J E; Craig, R K
1984-01-01
Nucleotide sequence analysis of cloned guinea-pig casein B cDNA sequences has identified two casein B variants related to the bovine and rat alpha s1 caseins. Amino acid homology was largely confined to the known bovine or predicted rat phosphorylation sites and within the 'signal' precursor sequence. Comparison of the deduced nucleotide sequence of the guinea-pig and rat alpha s1 casein mRNA species showed greater sequence conservation in the non-coding than in the coding regions, suggesting a functional and possibly regulatory role for the non-coding regions of casein mRNA. The results provide insight into the evolution of the casein genes, and raise questions as to the role of conserved nucleotide sequences within the non-coding regions of mRNA species. Images Fig. 1. PMID:6548375
An algebraic hypothesis about the primeval genetic code architecture.
Sánchez, Robersy; Grau, Ricardo
2009-09-01
A plausible architecture of an ancient genetic code is derived from an extended base triplet vector space over the Galois field of the extended base alphabet {D,A,C,G,U}, where symbol D represents one or more hypothetical bases with unspecific pairings. We hypothesized that the high degeneration of a primeval genetic code with five bases and the gradual origin and improvement of a primeval DNA repair system could make possible the transition from ancient to modern genetic codes. Our results suggest that the Watson-Crick base pairing G identical with C and A=U and the non-specific base pairing of the hypothetical ancestral base D used to define the sum and product operations are enough features to determine the coding constraints of the primeval and the modern genetic code, as well as, the transition from the former to the latter. Geometrical and algebraic properties of this vector space reveal that the present codon assignment of the standard genetic code could be induced from a primeval codon assignment. Besides, the Fourier spectrum of the extended DNA genome sequences derived from the multiple sequence alignment suggests that the called period-3 property of the present coding DNA sequences could also exist in the ancient coding DNA sequences. The phylogenetic analyses achieved with metrics defined in the N-dimensional vector space (B(3))(N) of DNA sequences and with the new evolutionary model presented here also suggest that an ancient DNA coding sequence with five or more bases does not contradict the expected evolutionary history.
Pietan, Lucas L.; Spradling, Theresa A.
2016-01-01
In animals, mitochondrial DNA (mtDNA) typically occurs as a single circular chromosome with 13 protein-coding genes and 22 tRNA genes. The various species of lice examined previously, however, have shown mitochondrial genome rearrangements with a range of chromosome sizes and numbers. Our research demonstrates that the mitochondrial genomes of two species of chewing lice found on pocket gophers, Geomydoecus aurei and Thomomydoecus minor, are fragmented with the 1,536 base-pair (bp) cytochrome-oxidase subunit I (cox1) gene occurring as the only protein-coding gene on a 1,916–1,964 bp minicircular chromosome in the two species, respectively. The cox1 gene of T. minor begins with an atypical start codon, while that of G. aurei does not. Components of the non-protein coding sequence of G. aurei and T. minor include a tRNA (isoleucine) gene, inverted repeat sequences consistent with origins of replication, and an additional non-coding region that is smaller than the non-coding sequence of other lice with such fragmented mitochondrial genomes. Sequences of cox1 minichromosome clones for each species reveal extensive length and sequence heteroplasmy in both coding and noncoding regions. The highly variable non-gene regions of G. aurei and T. minor have little sequence similarity with one another except for a 19-bp region of phylogenetically conserved sequence with unknown function. PMID:27589589
Liu, Huitao; Cui, Peng; Zhan, Kehui; Lin, Qiang; Zhuo, Guoyin; Guo, Xiaoli; Ding, Feng; Yang, Wenlong; Liu, Dongcheng; Hu, Songnian; Yu, Jun; Zhang, Aimin
2011-03-29
Plant mitochondria, semiautonomous organelles that function as manufacturers of cellular ATP, have their own genome that has a slow rate of evolution and rapid rearrangement. Cytoplasmic male sterility (CMS), a common phenotype in higher plants, is closely associated with rearrangements in mitochondrial DNA (mtDNA), and is widely used to produce F1 hybrid seeds in a variety of valuable crop species. Novel chimeric genes deduced from mtDNA rearrangements causing CMS have been identified in several plants, such as rice, sunflower, pepper, and rapeseed, but there are very few reports about mtDNA rearrangements in wheat. In the present work, we describe the mitochondrial genome of a wheat K-type CMS line and compare it with its maintainer line. The complete mtDNA sequence of a wheat K-type (with cytoplasm of Aegilops kotschyi) CMS line, Ks3, was assembled into a master circle (MC) molecule of 647,559 bp and found to harbor 34 known protein-coding genes, three rRNAs (18 S, 26 S, and 5 S rRNAs), and 16 different tRNAs. Compared to our previously published sequence of a K-type maintainer line, Km3, we detected Ks3-specific mtDNA (> 100 bp, 11.38%) and repeats (> 100 bp, 29 units) as well as genes that are unique to each line: rpl5 was missing in Ks3 and trnH was absent from Km3. We also defined 32 single nucleotide polymorphisms (SNPs) in 13 protein-coding, albeit functionally irrelevant, genes, and predicted 22 unique ORFs in Ks3, representing potential candidates for K-type CMS. All these sequence variations are candidates for involvement in CMS. A comparative analysis of the mtDNA of several angiosperms, including those from Ks3, Km3, rice, maize, Arabidopsis thaliana, and rapeseed, showed that non-coding sequences of higher plants had mostly divergent multiple reorganizations during the mtDNA evolution of higher plants. The complete mitochondrial genome of the wheat K-type CMS line Ks3 is very different from that of its maintainer line Km3, especially in non-coding sequences. Sequence rearrangement has produced novel chimeric ORFs, which may be candidate genes for CMS. Comparative analysis of several angiosperm mtDNAs indicated that non-coding sequences are the most frequently reorganized during mtDNA evolution in higher plants.
Tramontano, A; Macchiato, M F
1986-01-01
An algorithm to determine the probability that a reading frame codifies for a protein is presented. It is based on the results of our previous studies on the thermodynamic characteristics of a translated reading frame. We also develop a prediction procedure to distinguish between coding and non-coding reading frames. The procedure is based on the characteristics of the putative product of the DNA sequence and not on periodicity characteristics of the sequence, so the prediction is not biased by the presence of overlapping translated reading frames or by the presence of translated reading frames on the complementary DNA strand. PMID:3753761
VaDiR: an integrated approach to Variant Detection in RNA.
Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy
2018-02-01
Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.
Cost-effective sequencing of full-length cDNA clones powered by a de novo-reference hybrid assembly.
Kuroshu, Reginaldo M; Watanabe, Junichi; Sugano, Sumio; Morishita, Shinichi; Suzuki, Yutaka; Kasahara, Masahiro
2010-05-07
Sequencing full-length cDNA clones is important to determine gene structures including alternative splice forms, and provides valuable resources for experimental analyses to reveal the biological functions of coded proteins. However, previous approaches for sequencing cDNA clones were expensive or time-consuming, and therefore, a fast and efficient sequencing approach was demanded. We developed a program, MuSICA 2, that assembles millions of short (36-nucleotide) reads collected from a single flow cell lane of Illumina Genome Analyzer to shotgun-sequence approximately 800 human full-length cDNA clones. MuSICA 2 performs a hybrid assembly in which an external de novo assembler is run first and the result is then improved by reference alignment of shotgun reads. We compared the MuSICA 2 assembly with 200 pooled full-length cDNA clones finished independently by the conventional primer-walking using Sanger sequencers. The exon-intron structure of the coding sequence was correct for more than 95% of the clones with coding sequence annotation when we excluded cDNA clones insufficiently represented in the shotgun library due to PCR failure (42 out of 200 clones excluded), and the nucleotide-level accuracy of coding sequences of those correct clones was over 99.99%. We also applied MuSICA 2 to full-length cDNA clones from Toxoplasma gondii, to confirm that its ability was competent even for non-human species. The entire sequencing and shotgun assembly takes less than 1 week and the consumables cost only approximately US$3 per clone, demonstrating a significant advantage over previous approaches.
Cost-Effective Sequencing of Full-Length cDNA Clones Powered by a De Novo-Reference Hybrid Assembly
Sugano, Sumio; Morishita, Shinichi; Suzuki, Yutaka
2010-01-01
Background Sequencing full-length cDNA clones is important to determine gene structures including alternative splice forms, and provides valuable resources for experimental analyses to reveal the biological functions of coded proteins. However, previous approaches for sequencing cDNA clones were expensive or time-consuming, and therefore, a fast and efficient sequencing approach was demanded. Methodology We developed a program, MuSICA 2, that assembles millions of short (36-nucleotide) reads collected from a single flow cell lane of Illumina Genome Analyzer to shotgun-sequence ∼800 human full-length cDNA clones. MuSICA 2 performs a hybrid assembly in which an external de novo assembler is run first and the result is then improved by reference alignment of shotgun reads. We compared the MuSICA 2 assembly with 200 pooled full-length cDNA clones finished independently by the conventional primer-walking using Sanger sequencers. The exon-intron structure of the coding sequence was correct for more than 95% of the clones with coding sequence annotation when we excluded cDNA clones insufficiently represented in the shotgun library due to PCR failure (42 out of 200 clones excluded), and the nucleotide-level accuracy of coding sequences of those correct clones was over 99.99%. We also applied MuSICA 2 to full-length cDNA clones from Toxoplasma gondii, to confirm that its ability was competent even for non-human species. Conclusions The entire sequencing and shotgun assembly takes less than 1 week and the consumables cost only ∼US$3 per clone, demonstrating a significant advantage over previous approaches. PMID:20479877
Statistical and linguistic features of DNA sequences
NASA Technical Reports Server (NTRS)
Havlin, S.; Buldyrev, S. V.; Goldberger, A. L.; Mantegna, R. N.; Peng, C. K.; Simons, M.; Stanley, H. E.
1995-01-01
We present evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range--indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationary" feature of the sequence of base pairs by applying a new algorithm called Detrended Fluctuation Analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and noncoding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to all eukaryotic DNA sequences (33 301 coding and 29 453 noncoding) in the entire GenBank database. We describe a simple model to account for the presence of long-range power-law correlations which is based upon a generalization of the classic Levy walk. Finally, we describe briefly some recent work showing that the noncoding sequences have certain statistical features in common with natural languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the "redundancy" of a linguistic text in terms of a measurable entropy function. We suggest that noncoding regions in plants and invertebrates may display a smaller entropy and larger redundancy than coding regions, further supporting the possibility that noncoding regions of DNA may carry biological information.
Recurrence time statistics: versatile tools for genomic DNA sequence analysis.
Cao, Yinhe; Tung, Wen-Wen; Gao, J B
2004-01-01
With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.
Ancient DNA sequence revealed by error-correcting codes.
Brandão, Marcelo M; Spoladore, Larissa; Faria, Luzinete C B; Rocha, Andréa S L; Silva-Filho, Marcio C; Palazzo, Reginaldo
2015-07-10
A previously described DNA sequence generator algorithm (DNA-SGA) using error-correcting codes has been employed as a computational tool to address the evolutionary pathway of the genetic code. The code-generated sequence alignment demonstrated that a residue mutation revealed by the code can be found in the same position in sequences of distantly related taxa. Furthermore, the code-generated sequences do not promote amino acid changes in the deviant genomes through codon reassignment. A Bayesian evolutionary analysis of both code-generated and homologous sequences of the Arabidopsis thaliana malate dehydrogenase gene indicates an approximately 1 MYA divergence time from the MDH code-generated sequence node to its paralogous sequences. The DNA-SGA helps to determine the plesiomorphic state of DNA sequences because a single nucleotide alteration often occurs in distantly related taxa and can be found in the alternative codon patterns of noncanonical genetic codes. As a consequence, the algorithm may reveal an earlier stage of the evolution of the standard code.
Ancient DNA sequence revealed by error-correcting codes
Brandão, Marcelo M.; Spoladore, Larissa; Faria, Luzinete C. B.; Rocha, Andréa S. L.; Silva-Filho, Marcio C.; Palazzo, Reginaldo
2015-01-01
A previously described DNA sequence generator algorithm (DNA-SGA) using error-correcting codes has been employed as a computational tool to address the evolutionary pathway of the genetic code. The code-generated sequence alignment demonstrated that a residue mutation revealed by the code can be found in the same position in sequences of distantly related taxa. Furthermore, the code-generated sequences do not promote amino acid changes in the deviant genomes through codon reassignment. A Bayesian evolutionary analysis of both code-generated and homologous sequences of the Arabidopsis thaliana malate dehydrogenase gene indicates an approximately 1 MYA divergence time from the MDH code-generated sequence node to its paralogous sequences. The DNA-SGA helps to determine the plesiomorphic state of DNA sequences because a single nucleotide alteration often occurs in distantly related taxa and can be found in the alternative codon patterns of noncanonical genetic codes. As a consequence, the algorithm may reveal an earlier stage of the evolution of the standard code. PMID:26159228
NASA Technical Reports Server (NTRS)
Chang, Dong Kyung; Metzgar, David; Wills, Christopher; Boland, C. Richard
2003-01-01
All "minor" components of the human DNA mismatch repair (MMR) system-MSH3, MSH6, PMS2, and the recently discovered MLH3-contain mononucleotide microsatellites in their coding sequences. This intriguing finding contrasts with the situation found in the major components of the DNA MMR system-MSH2 and MLH1-and, in fact, most human genes. Although eukaryotic genomes are rich in microsatellites, non-triplet microsatellites are rare in coding regions. The recurring presence of exonal mononucleotide repeat sequences within a single family of human genes would therefore be considered exceptional.
The full mitochondrial genome sequence of Raillietina tetragona from chicken (Cestoda: Davaineidae).
Liang, Jian-Ying; Lin, Rui-Qing
2016-11-01
In the present study, the complete mitochondrial DNA (mtDNA) sequence of Raillietina tetragona was sequenced and its gene contents and genome organizations was compared with that of other tapeworm. The complete mt genome sequence of R. tetragona is 14,444 bp in length. It contains 12 protein-coding genes, two ribosomal RNA genes, 22 transfer RNA genes, and two non-coding region. All genes are transcribed in the same direction and have a nucleotide composition high in A and T. The contents of A + T of the complete mt genome are 71.4% for R. tetragona. The R. tetragona mt genome sequence provides novel mtDNA marker for studying the molecular epidemiology and population genetics of Raillietina and has implications for the molecular diagnosis of chicken cestodosis caused by Raillietina.
DNA barcode goes two-dimensions: DNA QR code web server.
Liu, Chang; Shi, Linchun; Xu, Xiaolan; Li, Huan; Xing, Hang; Liang, Dong; Jiang, Kun; Pang, Xiaohui; Song, Jingyuan; Chen, Shilin
2012-01-01
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, "DNA barcode" actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications.
Turco, Gina; Schnable, James C.; Pedersen, Brent; Freeling, Michael
2013-01-01
Conserved non-coding sequences (CNS) are islands of non-coding sequence that, like protein coding exons, show less divergence in sequence between related species than functionless DNA. Several CNSs have been demonstrated experimentally to function as cis-regulatory regions. However, the specific functions of most CNSs remain unknown. Previous searches for CNS in plants have either anchored on exons and only identified nearby sequences or required years of painstaking manual annotation. Here we present an open source tool that can accurately identify CNSs between any two related species with sequenced genomes, including both those immediately adjacent to exons and distal sequences separated by >12 kb of non-coding sequence. We have used this tool to characterize new motifs, associate CNSs with additional functions, and identify previously undetected genes encoding RNA and protein in the genomes of five grass species. We provide a list of 15,363 orthologous CNSs conserved across all grasses tested. We were also able to identify regulatory sequences present in the common ancestor of grasses that have been lost in one or more extant grass lineages. Lists of orthologous gene pairs and associated CNSs are provided for reference inbred lines of arabidopsis, Japonica rice, foxtail millet, sorghum, brachypodium, and maize. PMID:23874343
Reicher, S; Seroussi, E; Weller, J I; Rosov, A; Gootwine, E
2012-07-01
Polymorphisms in mitochondrial DNA (mtDNA) protein- and tRNA-coding genes were shown to be associated with various diseases in humans as well as with production and reproduction traits in livestock. Alignment of full length mitochondria sequences from the 5 known ovine haplogroups: HA (n = 3), HB (n = 5), HC (n = 3), HD (n = 2), and HE (n = 2; GenBank accession nos. HE577847-50 and 11 published complete ovine mitochondria sequences) revealed sequence variation in 10 out of the 13 protein coding mtDNA sequences. Twenty-six of the 245 variable sites found in the protein coding sequences represent non-synonymous mutations. Sequence variation was observed also in 8 out of the 22 tRNA mtDNA sequences. On the basis of the mtDNA control region and cytochrome b partial sequences along with information on maternal lineages within an Afec-Assaf flock, 1,126 Afec-Assaf ewes were assigned to mitochondrial haplogroups HA, HB, and HC, with frequencies of 0.43, 0.43, and 0.14, respectively. Analysis of birth weight and growth rate records of lamb (n = 1286) and productivity from 4,993 lambing records revealed no association between mitochondrial haplogroup affiliation and female longevity, lambs perinatal survival rate, birth weight, and daily growth rate of lambs up to 150 d that averaged 1,664 d, 88.3%, 4.5 kg, and 320 g/d, respectively. However, significant (P < 0.0001) differences among the haplogroups were found for prolificacy of ewes, with prolificacies (mean ± SE) of 2.14 ± 0.04, 2.25 ± 0.04, and 2.30 ± 0.06 lamb born/ewe lambing for the HA, HB, and the HC haplogroups, respectively. Our results highlight the ovine mitogenome genetic variation in protein- and tRNA coding genes and suggest that sequence variation in ovine mtDNA is associated with variation in ewe prolificacy.
DNA Barcode Goes Two-Dimensions: DNA QR Code Web Server
Li, Huan; Xing, Hang; Liang, Dong; Jiang, Kun; Pang, Xiaohui; Song, Jingyuan; Chen, Shilin
2012-01-01
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, “DNA barcode” actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications. PMID:22574113
The primary structure of the Saccharomyces cerevisiae gene for 3-phosphoglycerate kinase.
Hitzeman, R A; Hagie, F E; Hayflick, J S; Chen, C Y; Seeburg, P H; Derynck, R
1982-01-01
The DNA sequence of the gene for the yeast glycolytic enzyme, 3-phosphoglycerate kinase (PGK), has been obtained by sequencing part of a 3.1 kbp HindIII fragment obtained from the yeast genome. The structural gene sequence corresponds to a reading frame of 1251 bp coding for 416 amino acids with no intervening DNA sequences. The amino acid sequence is approximately 65 percent homologous with human and horse PGK protein sequences and is in general agreement with the published protein sequence for yeast PGK. As for other highly expressed structural genes in yeast, the coding sequence is highly codon biased with 95 percent of the amino acids coded for by a select 25 codons (out of 61 possible). Besides structural DNA sequence, 291 bp of 5'-flanking sequence and 286 bp of 3'-flanking sequence were determined. Transcription starts 36 nucleotides upstream from the translational start and stops 86-93 nucleotides downstream from the translational stop. These results suggest a non-polyadenylated mRNA length of 1373 to 1380 nucleotides, which is consistent with the observed length of 1500 nucleotides for polyadenylated PGK mRNA. A sequence TATATATAAA is found at 145 nucleotides upstream from the translational start. This sequence resembles the TATAAA box that is possibly associated with RNA polymerase II binding. Images PMID:6296791
Tsuchiya, Mariko; Amano, Kojiro; Abe, Masaya; Seki, Misato; Hase, Sumitaka; Sato, Kengo; Sakakibara, Yasubumi
2016-06-15
Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs. We developed an algorithm termed SHARAKU to align two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5'-end processing and 3'-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain. The source code of our program SHARAKU is available at http://www.dna.bio.keio.ac.jp/sharaku/, and the simulated dataset used in this work is available at the same link. Accession code: The sequence data from the whole RNA transcripts in the hippocampus of the left brain used in this work is available from the DNA DataBank of Japan (DDBJ) Sequence Read Archive (DRA) under the accession number DRA004502. yasu@bio.keio.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
2018-01-01
FAM230C, a long intergenic non-coding RNA (lincRNA) gene in human chromosome 13 (chr13) is a member of lincRNA genes termed family with sequence similarity 230. An analysis using bioinformatics search tools and alignment programs was undertaken to determine properties of FAM230C and its related genes. Results reveal that the DNA translocation element, the Translocation Breakpoint Type A (TBTA) sequence, which consists of satellite DNA, Alu elements, and AT-rich sequences is embedded in the FAM230C gene. Eight lincRNA genes related to FAM230C also carry the TBTA sequences. These genes were formed from a large segment of the 3’ half of the FAM230C sequence duplicated in chr22, and are specifically in regions of low copy repeats (LCR22)s, in or close to the 22q.11.2 region. 22q11.2 is a chromosomal segment that undergoes a high rate of DNA translocation and is prone to genetic deletions. FAM230C-related genes present in other chromosomes do not carry the TBTA motif and were formed from the 5’ half region of the FAM230C sequence. These findings identify a high specificity in lincRNA gene formation by gene sequence duplication in different chromosomes. PMID:29668722
Kowalski, Madzia P.; Baylis, Howard A.; Krude, Torsten
2015-01-01
ABSTRACT Stem bulge RNAs (sbRNAs) are a family of small non-coding stem-loop RNAs present in Caenorhabditis elegans and other nematodes, the function of which is unknown. Here, we report the first functional characterisation of nematode sbRNAs. We demonstrate that sbRNAs from a range of nematode species are able to reconstitute the initiation of chromosomal DNA replication in the presence of replication proteins in vitro, and that conserved nucleotide sequence motifs are essential for this function. By functionally inactivating sbRNAs with antisense morpholino oligonucleotides, we show that sbRNAs are required for S phase progression, early embryonic development and the viability of C. elegans in vivo. Thus, we demonstrate a new and essential role for sbRNAs during the early development of C. elegans. sbRNAs show limited nucleotide sequence similarity to vertebrate Y RNAs, which are also essential for the initiation of DNA replication. Our results therefore establish that the essential function of small non-coding stem-loop RNAs during DNA replication extends beyond vertebrates. PMID:25908866
Position specific variation in the rate of evolution in transcription factor binding sites
Moses, Alan M; Chiang, Derek Y; Kellis, Manolis; Lander, Eric S; Eisen, Michael B
2003-01-01
Background The binding sites of sequence specific transcription factors are an important and relatively well-understood class of functional non-coding DNAs. Although a wide variety of experimental and computational methods have been developed to characterize transcription factor binding sites, they remain difficult to identify. Comparison of non-coding DNA from related species has shown considerable promise in identifying these functional non-coding sequences, even though relatively little is known about their evolution. Results Here we analyse the genome sequences of the budding yeasts Saccharomyces cerevisiae, S. bayanus, S. paradoxus and S. mikatae to study the evolution of transcription factor binding sites. As expected, we find that both experimentally characterized and computationally predicted binding sites evolve slower than surrounding sequence, consistent with the hypothesis that they are under purifying selection. We also observe position-specific variation in the rate of evolution within binding sites. We find that the position-specific rate of evolution is positively correlated with degeneracy among binding sites within S. cerevisiae. We test theoretical predictions for the rate of evolution at positions where the base frequencies deviate from background due to purifying selection and find reasonable agreement with the observed rates of evolution. Finally, we show how the evolutionary characteristics of real binding motifs can be used to distinguish them from artefacts of computational motif finding algorithms. Conclusion As has been observed for protein sequences, the rate of evolution in transcription factor binding sites varies with position, suggesting that some regions are under stronger functional constraint than others. This variation likely reflects the varying importance of different positions in the formation of the protein-DNA complex. The characterization of the pattern of evolution in known binding sites will likely contribute to the effective use of comparative sequence data in the identification of transcription factor binding sites and is an important step toward understanding the evolution of functional non-coding DNA. PMID:12946282
NASA Astrophysics Data System (ADS)
Walker, David Lee
1999-12-01
This study uses dynamical analysis to examine in a quantitative fashion the information coding mechanism in DNA sequences. This exceeds the simple dichotomy of either modeling the mechanism by comparing DNA sequence walks as Fractal Brownian Motion (fbm) processes. The 2-D mappings of the DNA sequences for this research are from Iterated Function System (IFS) (Also known as the ``Chaos Game Representation'' (CGR)) mappings of the DNA sequences. This technique converts a 1-D sequence into a 2-D representation that preserves subsequence structure and provides a visual representation. The second step of this analysis involves the application of Wavelet Packet Transforms, a recently developed technique from the field of signal processing. A multi-fractal model is built by using wavelet transforms to estimate the Hurst exponent, H. The Hurst exponent is a non-parametric measurement of the dynamism of a system. This procedure is used to evaluate gene- coding events in the DNA sequence of cystic fibrosis mutations. The H exponent is calculated for various mutation sites in this gene. The results of this study indicate the presence of anti-persistent, random walks and persistent ``sub-periods'' in the sequence. This indicates the hypothesis of a multi-fractal model of DNA information encoding warrants further consideration. This work examines the model's behavior in both pathological (mutations) and non-pathological (healthy) base pair sequences of the cystic fibrosis gene. These mutations both natural and synthetic were introduced by computer manipulation of the original base pair text files. The results show that disease severity and system ``information dynamics'' correlate. These results have implications for genetic engineering as well as in mathematical biology. They suggest that there is scope for more multi-fractal models to be developed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
von Nickisch-Rosenegk, Markus; Brown, Wesley M.; Boore, Jeffrey L.
2001-01-01
Using ''long-PCR'' we have amplified in overlapping fragments the complete mitochondrial genome of the tapeworm Hymenolepis diminuta (Platyhelminthes: Cestoda) and determined its 13,900 nucleotide sequence. The gene content is the same as that typically found for animal mitochondrial DNA (mtDNA) except that atp8 appears to be lacking, a condition found previously for several other animals. Despite the small size of this mtDNA, there are two large non-coding regions, one of which contains 13 repeats of a 31 nucleotide sequence and a potential stem-loop structure of 25 base pairs with an 11-member loop. Large potential secondary structures are identified also formore » the non-coding regions of two other cestode mtDNAs. Comparison of the mitochondrial gene arrangement of H. diminuta with those previously published supports a phylogenetic position of flatworms as members of the Eutrochozoa, rather than being basal to either a clade of protostomes or a clade of coelomates.« less
Association of Amine-Receptor DNA Sequence Variants with Associative Learning in the Honeybee.
Lagisz, Malgorzata; Mercer, Alison R; de Mouzon, Charlotte; Santos, Luana L S; Nakagawa, Shinichi
2016-03-01
Octopamine- and dopamine-based neuromodulatory systems play a critical role in learning and learning-related behaviour in insects. To further our understanding of these systems and resulting phenotypes, we quantified DNA sequence variations at six loci coding octopamine-and dopamine-receptors and their association with aversive and appetitive learning traits in a population of honeybees. We identified 79 polymorphic sequence markers (mostly SNPs and a few insertions/deletions) located within or close to six candidate genes. Intriguingly, we found that levels of sequence variation in the protein-coding regions studied were low, indicating that sequence variation in the coding regions of receptor genes critical to learning and memory is strongly selected against. Non-coding and upstream regions of the same genes, however, were less conserved and sequence variations in these regions were weakly associated with between-individual differences in learning-related traits. While these associations do not directly imply a specific molecular mechanism, they suggest that the cross-talk between dopamine and octopamine signalling pathways may influence olfactory learning and memory in the honeybee.
NASA Astrophysics Data System (ADS)
Lestari, D.; Bustamam, A.; Novianti, T.; Ardaneswari, G.
2017-07-01
DNA sequence can be defined as a succession of letters, representing the order of nucleotides within DNA, using a permutation of four DNA base codes including adenine (A), guanine (G), cytosine (C), and thymine (T). The precise code of the sequences is determined using DNA sequencing methods and technologies, which have been developed since the 1970s and currently become highly developed, advanced and highly throughput sequencing technologies. So far, DNA sequencing has greatly accelerated biological and medical research and discovery. However, in some cases DNA sequencing could produce any ambiguous and not clear enough sequencing results that make them quite difficult to be determined whether these codes are A, T, G, or C. To solve these problems, in this study we can introduce other representation of DNA codes namely Quaternion Q = (PA, PT, PG, PC), where PA, PT, PG, PC are the probability of A, T, G, C bases that could appear in Q and PA + PT + PG + PC = 1. Furthermore, using Quaternion representations we are able to construct the improved scoring matrix for global sequence alignment processes, by applying a dot product method. Moreover, this scoring matrix produces better and higher quality of the match and mismatch score between two DNA base codes. In implementation, we applied the Needleman-Wunsch global sequence alignment algorithm using Octave, to analyze our target sequence which contains some ambiguous sequence data. The subject sequences are the DNA sequences of Streptococcus pneumoniae families obtained from the Genebank, meanwhile the target DNA sequence are received from our collaborator database. As the results we found the Quaternion representations improve the quality of the sequence alignment score and we can conclude that DNA sequence target has maximum similarity with Streptococcus pneumoniae.
RPS8—a New Informative DNA Marker for Phylogeny of Babesia and Theileria Parasites in China
Tian, Zhan-Cheng; Liu, Guang-Yuan; Yin, Hong; Luo, Jian-Xun; Guan, Gui-Quan; Luo, Jin; Xie, Jun-Ren; Shen, Hui; Tian, Mei-Yuan; Zheng, Jin-feng; Yuan, Xiao-song; Wang, Fang-fang
2013-01-01
Piroplasmosis is a serious debilitating and sometimes fatal disease. Phylogenetic relationships within piroplasmida are complex and remain unclear. We compared the intron–exon structure and DNA sequences of the RPS8 gene from Babesia and Theileria spp. isolates in China. Similar to 18S rDNA, the 40S ribosomal protein S8 gene, RPS8, including both coding and non-coding regions is a useful and novel genetic marker for defining species boundaries and for inferring phylogenies because it tends to have little intra-specific variation but considerable inter-specific difference. However, more samples are needed to verify the usefulness of the RPS8 (coding and non-coding regions) gene as a marker for the phylogenetic position and detection of most Babesia and Theileria species, particularly for some closely related species. PMID:24244571
The DNA Methylome of Human Peripheral Blood Mononuclear Cells
Ye, Mingzhi; Zheng, Hancheng; Yu, Jian; Wu, Honglong; Sun, Jihua; Zhang, Hongyu; Chen, Quan; Luo, Ruibang; Chen, Minfeng; He, Yinghua; Jin, Xin; Zhang, Qinghui; Yu, Chang; Zhou, Guangyu; Sun, Jinfeng; Huang, Yebo; Zheng, Huisong; Cao, Hongzhi; Zhou, Xiaoyu; Guo, Shicheng; Hu, Xueda; Li, Xin; Kristiansen, Karsten; Bolund, Lars; Xu, Jiujin; Wang, Wen; Yang, Huanming; Wang, Jian; Li, Ruiqiang; Beck, Stephan; Wang, Jun; Zhang, Xiuqing
2010-01-01
DNA methylation plays an important role in biological processes in human health and disease. Recent technological advances allow unbiased whole-genome DNA methylation (methylome) analysis to be carried out on human cells. Using whole-genome bisulfite sequencing at 24.7-fold coverage (12.3-fold per strand), we report a comprehensive (92.62%) methylome and analysis of the unique sequences in human peripheral blood mononuclear cells (PBMC) from the same Asian individual whose genome was deciphered in the YH project. PBMC constitute an important source for clinical blood tests world-wide. We found that 68.4% of CpG sites and <0.2% of non-CpG sites were methylated, demonstrating that non-CpG cytosine methylation is minor in human PBMC. Analysis of the PBMC methylome revealed a rich epigenomic landscape for 20 distinct genomic features, including regulatory, protein-coding, non-coding, RNA-coding, and repeat sequences. Integration of our methylome data with the YH genome sequence enabled a first comprehensive assessment of allele-specific methylation (ASM) between the two haploid methylomes of any individual and allowed the identification of 599 haploid differentially methylated regions (hDMRs) covering 287 genes. Of these, 76 genes had hDMRs within 2 kb of their transcriptional start sites of which >80% displayed allele-specific expression (ASE). These data demonstrate that ASM is a recurrent phenomenon and is highly correlated with ASE in human PBMCs. Together with recently reported similar studies, our study provides a comprehensive resource for future epigenomic research and confirms new sequencing technology as a paradigm for large-scale epigenomics studies. PMID:21085693
Informational structure of genetic sequences and nature of gene splicing
NASA Astrophysics Data System (ADS)
Trifonov, E. N.
1991-10-01
Only about 1/20 of DNA of higher organisms codes for proteins, by means of classical triplet code. The rest of DNA sequences is largely silent, with unclear functions, if any. The triplet code is not the only code (message) carried by the sequences. There are three levels of molecular communication, where the same sequence ``talks'' to various bimolecules, while having, respectively, three different appearances: DNA, RNA and protein. Since the molecular structures and, hence, sequence specific preferences of these are substantially different, the original DNA sequence has to carry simultaneously three types of sequence patterns (codes, messages), thus, being a composite structure in which one had the same letter (nucleotide) is frequently involved in several overlapping codes of different nature. This multiplicity and overlapping of the codes is a unique feature of the Gnomic, language of genetic sequences. The coexisting codes have to be degenerate in various degrees to allow an optimal and concerted performance of all the encoded functions. There is an obvious conflict between the best possible performance of a given function and necessity to compromise the quality of a given sequence pattern in favor of other patterns. It appears that the major role of various changes in the sequences on their ``ontogenetic'' way from DNA to RNA to protein, like RNA editing and splicing, or protein post-translational modifications is to resolve such conflicts. New data are presented strongly indicating that the gene splicing is such a device to resolve the conflict between the code of DNA folding in chromatin and the triplet code for protein synthesis.
Goremykin, Vadim V; Lockhart, Peter J; Viola, Roberto; Velasco, Riccardo
2012-08-01
Mitochondrial genomes of spermatophytes are the largest of all organellar genomes. Their large size has been attributed to various factors; however, the relative contribution of these factors to mitochondrial DNA (mtDNA) expansion remains undetermined. We estimated their relative contribution in Malus domestica (apple). The mitochondrial genome of apple has a size of 396 947 bp and a one to nine ratio of coding to non-coding DNA, close to the corresponding average values for angiosperms. We determined that 71.5% of the apple mtDNA sequence was highly similar to sequences of its nuclear DNA. Using nuclear gene exons, nuclear transposable elements and chloroplast DNA as markers of promiscuous DNA content in mtDNA, we estimated that approximately 20% of the apple mtDNA consisted of DNA sequences imported from other cell compartments, mostly from the nucleus. Similar marker-based estimates of promiscuous DNA content in the mitochondrial genomes of other species ranged between 21.2 and 25.3% of the total mtDNA length for grape, between 23.1 and 38.6% for rice, and between 47.1 and 78.4% for maize. All these estimates are conservative, because they underestimate the import of non-functional DNA. We propose that the import of promiscuous DNA is a core mechanism for mtDNA size expansion in seed plants. In apple, maize and grape this mechanism contributed far more to genome expansion than did homologous recombination. In rice the estimated contribution of both mechanisms was found to be similar. © 2012 The Authors. The Plant Journal © 2012 Blackwell Publishing Ltd.
A Tandemly Arranged Pattern of Two 5S rDNA Arrays in Amolops mantzorum (Anura, Ranidae).
Liu, Ting; Song, Menghuan; Xia, Yun; Zeng, Xiaomao
2017-01-01
In an attempt to extend the knowledge of the 5S rDNA organization in anurans, the 5S rDNA sequences of Amolops mantzorum were isolated, characterized, and mapped by FISH. Two forms of 5S rDNA, type I (209 bp) and type II (about 870 bp), were found in specimens investigated from various populations. Both of them contained a 118-bp coding sequence, readily differentiated by their non-transcribed spacer (NTS) sizes and compositions. Four probes (the 5S rDNA coding sequences, the type I NTS, the type II NTS, and the entire type II 5S rDNA sequences) were respectively labeled with TAMRA or digoxigenin to hybridize with mitotic chromosomes for samples of all localities. It turned out that all probes showed the same signals that appeared in every centromeric region and in the telomeric regions of chromosome 5, without differences within or between populations. Obviously, both type I and type II of the 5S rDNA arrays arranged in tandem, which was contrasting with other frogs or fishes recorded to date. More interestingly, all the probes detected centromeric regions in all karyotypes, suggesting the presence of a satellite DNA family derived from 5S rDNA. © 2017 S. Karger AG, Basel.
Pilotte, Nils; Papaiakovou, Marina; Grant, Jessica R; Bierwert, Lou Ann; Llewellyn, Stacey; McCarthy, James S; Williams, Steven A
2016-03-01
The soil transmitted helminths are a group of parasitic worms responsible for extensive morbidity in many of the world's most economically depressed locations. With growing emphasis on disease mapping and eradication, the availability of accurate and cost-effective diagnostic measures is of paramount importance to global control and elimination efforts. While real-time PCR-based molecular detection assays have shown great promise, to date, these assays have utilized sub-optimal targets. By performing next-generation sequencing-based repeat analyses, we have identified high copy-number, non-coding DNA sequences from a series of soil transmitted pathogens. We have used these repetitive DNA elements as targets in the development of novel, multi-parallel, PCR-based diagnostic assays. Utilizing next-generation sequencing and the Galaxy-based RepeatExplorer web server, we performed repeat DNA analysis on five species of soil transmitted helminths (Necator americanus, Ancylostoma duodenale, Trichuris trichiura, Ascaris lumbricoides, and Strongyloides stercoralis). Employing high copy-number, non-coding repeat DNA sequences as targets, novel real-time PCR assays were designed, and assays were tested against established molecular detection methods. Each assay provided consistent detection of genomic DNA at quantities of 2 fg or less, demonstrated species-specificity, and showed an improved limit of detection over the existing, proven PCR-based assay. The utilization of next-generation sequencing-based repeat DNA analysis methodologies for the identification of molecular diagnostic targets has the ability to improve assay species-specificity and limits of detection. By exploiting such high copy-number repeat sequences, the assays described here will facilitate soil transmitted helminth diagnostic efforts. We recommend similar analyses when designing PCR-based diagnostic tests for the detection of other eukaryotic pathogens.
Lozano, Gloria; Trenado, Helena P.; Fiallo-Olivé, Elvira; Chirinos, Dorys; Geraud-Pouey, Francis; Briddon, Rob W.; Navas-Castillo, Jesús
2016-01-01
Begomoviruses (family Geminiviridae) are whitefly-transmitted, plant-infecting single-stranded DNA viruses that cause crop losses throughout the warmer parts of the World. Sweepoviruses are a phylogenetically distinct group of begomoviruses that infect plants of the family Convolvulaceae, including sweet potato (Ipomoea batatas). Two classes of subviral molecules are often associated with begomoviruses, particularly in the Old World; the betasatellites and the alphasatellites. An analysis of sweet potato and Ipomoea indica samples from Spain and Merremia dissecta samples from Venezuela identified small non-coding subviral molecules in association with several distinct sweepoviruses. The sequences of 18 clones were obtained and found to be structurally similar to tomato leaf curl virus-satellite (ToLCV-sat, the first DNA satellite identified in association with a begomovirus), with a region with significant sequence identity to the conserved region of betasatellites, an A-rich sequence, a predicted stem–loop structure containing the nonanucleotide TAATATTAC, and a second predicted stem–loop. These sweepovirus-associated satellites join an increasing number of ToLCV-sat-like non-coding satellites identified recently. Although sharing some features with betasatellites, evidence is provided to suggest that the ToLCV-sat-like satellites are distinct from betasatellites and should be considered a separate class of satellites, for which the collective name deltasatellites is proposed. PMID:26925037
CRITICA: coding region identification tool invoking comparative analysis
NASA Technical Reports Server (NTRS)
Badger, J. H.; Olsen, G. J.; Woese, C. R. (Principal Investigator)
1999-01-01
Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein-coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data, such that CRITICA is not dependent on the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared with the DNA sequence annotations and with the predictions of GenMark. CRITICA proved to be more accurate than GenMark, and moreover, many of its predictions that would seem to be errors instead reflect problems in the sequence databases. The source code of CRITICA is freely available by anonymous FTP (rdp.life.uiuc.edu in/pub/critica) and on the World Wide Web (http:/(/)rdpwww.life.uiuc.edu).
Reading of the non-template DNA by transcription elongation factors.
Svetlov, Vladimir; Nudler, Evgeny
2018-05-14
Unlike transcription initiation and termination, which have easily discernable signals such as promoters and terminators, elongation is regulated through a dynamic network involving RNA/DNA pause signals and states- rather than sequence-specific protein interactions. A report by Nedialkov et al. (in press) provides experimental evidence for sequence-specific recruitment of elongation factor RfaH to transcribing RNA polymerase (RNAP) and outlines the mechanism of gene expression regulation by restraint ("locking") of the DNA non-template strand. According to this model, the elongation complex pauses at the so called "operon polarity sequence" (found in some long bacterial operons coding for virulence genes), when the usually flexible non-template DNA strand adopts a distinct hairpin-loop conformation on the surface of transcribing RNAP. Sequence-specific binding of RfaH to this DNA segment facilitates conversion of RfaH from its inactive closed to its active open conformation. The interaction network formed between RfaH, non-template DNA, and RNAP locks DNA in a conformation that renders the elongation complex resistant to pausing and termination. The effects of such locking on transcript elongation can be mimicked by restraint of the non-template strand due to its shortening. This work advances our understanding of regulation of transcript elongation and has important implications for the action of general transcription factors, such as NusG, which lack apparent sequence-specificity, as well as for the mechanisms of other processes linked to transcription such as transcription-coupled DNA repair. This article is protected by copyright. All rights reserved. © 2018 John Wiley & Sons Ltd.
Zhang, Ai-bing; Feng, Jie; Ward, Robert D; Wan, Ping; Gao, Qiang; Wu, Jun; Zhao, Wei-zhong
2012-01-01
Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.
Cartwright, Joseph F; Anderson, Karin; Longworth, Joseph; Lobb, Philip; James, David C
2018-06-01
High-fidelity replication of biologic-encoding recombinant DNA sequences by engineered mammalian cell cultures is an essential pre-requisite for the development of stable cell lines for the production of biotherapeutics. However, immortalized mammalian cells characteristically exhibit an increased point mutation frequency compared to mammalian cells in vivo, both across their genomes and at specific loci (hotspots). Thus unforeseen mutations in recombinant DNA sequences can arise and be maintained within producer cell populations. These may affect both the stability of recombinant gene expression and give rise to protein sequence variants with variable bioactivity and immunogenicity. Rigorous quantitative assessment of recombinant DNA integrity should therefore form part of the cell line development process and be an essential quality assurance metric for instances where synthetic/multi-component assemblies are utilized to engineer mammalian cells, such as the assessment of recombinant DNA fidelity or the mutability of single-site integration target loci. Based on Pacific Biosciences (Menlo Park, CA) single molecule real-time (SMRT™) circular consensus sequencing (CCS) technology we developed a rDNA sequence analysis tool to process the multi-parallel sequencing of ∼40,000 single recombinant DNA molecules. After statistical filtering of raw sequencing data, we show that this analytical method is capable of detecting single point mutations in rDNA to a minimum single mutation frequency of 0.0042% (<1/24,000 bases). Using a stable CHO transfectant pool harboring a randomly integrated 5 kB plasmid construct encoding GFP we found that 28% of recombinant plasmid copies contained at least one low frequency (<0.3%) point mutation. These mutations were predominantly found in GC base pairs (85%) and that there was no positional bias in mutation across the plasmid sequence. There was no discernable difference between the mutation frequencies of coding and non-coding DNA. The putative ratio of non-synonymous and synonymous changes within the open reading frames (ORFs) in the plasmid sequence indicates that natural selection does not impact upon the prevalence of these mutations. Here we have demonstrated the abundance of mutations that fall outside of the reported range of detection of next generation sequencing (NGS) and second generation sequencing (SGS) platforms, providing a methodology capable of being utilized in cell line development platforms to identify the fidelity of recombinant genes throughout the production process. © 2018 Wiley Periodicals, Inc.
Phylogenetic Network for European mtDNA
Finnilä, Saara; Lehtonen, Mervi S.; Majamaa, Kari
2001-01-01
The sequence in the first hypervariable segment (HVS-I) of the control region has been used as a source of evolutionary information in most phylogenetic analyses of mtDNA. Population genetic inference would benefit from a better understanding of the variation in the mtDNA coding region, but, thus far, complete mtDNA sequences have been rare. We determined the nucleotide sequence in the coding region of mtDNA from 121 Finns, by conformation-sensitive gel electrophoresis and subsequent sequencing and by direct sequencing of the D loop. Furthermore, 71 sequences from our previous reports were included, so that the samples represented all the mtDNA haplogroups present in the Finnish population. We found a total of 297 variable sites in the coding region, which allowed the compilation of unambiguous phylogenetic networks. The D loop harbored 104 variable sites, and, in most cases, these could be localized within the coding-region networks, without discrepancies. Interestingly, many homoplasies were detected in the coding region. Nucleotide variation in the rRNA and tRNA genes was 6%, and that in the third nucleotide positions of structural genes amounted to 22% of that in the HVS-I. The complete networks enabled the relationships between the mtDNA haplogroups to be analyzed. Phylogenetic networks based on the entire coding-region sequence in mtDNA provide a rich source for further population genetic studies, and complete sequences make it easier to differentiate between disease-causing mutations and rare polymorphisms. PMID:11349229
Kazakoff, Stephen H.; Imelfort, Michael; Edwards, David; Koehorst, Jasper; Biswas, Bandana; Batley, Jacqueline; Scott, Paul T.; Gresshoff, Peter M.
2012-01-01
Pongamia pinnata (syn. Millettia pinnata) is a novel, fast-growing arboreal legume that bears prolific quantities of oil-rich seeds suitable for the production of biodiesel and aviation biofuel. Here, we have used Illumina® ‘Second Generation DNA Sequencing (2GS)’ and a new short-read de novo assembler, SaSSY, to assemble and annotate the Pongamia chloroplast (152,968 bp; cpDNA) and mitochondrial (425,718 bp; mtDNA) genomes. We also show that SaSSY can be used to accurately assemble 2GS data, by re-assembling the Lotus japonicus cpDNA and in the process assemble its mtDNA (380,861 bp). The Pongamia cpDNA contains 77 unique protein-coding genes and is almost 60% gene-dense. It contains a 50 kb inversion common to other legumes, as well as a novel 6.5 kb inversion that is responsible for the non-disruptive, re-orientation of five protein-coding genes. Additionally, two copies of an inverted repeat firmly place the species outside the subclade of the Fabaceae lacking the inverted repeat. The Pongamia and L. japonicus mtDNA contain just 33 and 31 unique protein-coding genes, respectively, and like other angiosperm mtDNA, have expanded intergenic and multiple repeat regions. Through comparative analysis with Vigna radiata we measured the average synonymous and non-synonymous divergence of all three legume mitochondrial (1.59% and 2.40%, respectively) and chloroplast (8.37% and 8.99%, respectively) protein-coding genes. Finally, we explored the relatedness of Pongamia within the Fabaceae and showed the utility of the organellar genome sequences by mapping transcriptomic data to identify up- and down-regulated stress-responsive gene candidates and confirm in silico predicted RNA editing sites. PMID:23272141
Kazakoff, Stephen H; Imelfort, Michael; Edwards, David; Koehorst, Jasper; Biswas, Bandana; Batley, Jacqueline; Scott, Paul T; Gresshoff, Peter M
2012-01-01
Pongamia pinnata (syn. Millettia pinnata) is a novel, fast-growing arboreal legume that bears prolific quantities of oil-rich seeds suitable for the production of biodiesel and aviation biofuel. Here, we have used Illumina® 'Second Generation DNA Sequencing (2GS)' and a new short-read de novo assembler, SaSSY, to assemble and annotate the Pongamia chloroplast (152,968 bp; cpDNA) and mitochondrial (425,718 bp; mtDNA) genomes. We also show that SaSSY can be used to accurately assemble 2GS data, by re-assembling the Lotus japonicus cpDNA and in the process assemble its mtDNA (380,861 bp). The Pongamia cpDNA contains 77 unique protein-coding genes and is almost 60% gene-dense. It contains a 50 kb inversion common to other legumes, as well as a novel 6.5 kb inversion that is responsible for the non-disruptive, re-orientation of five protein-coding genes. Additionally, two copies of an inverted repeat firmly place the species outside the subclade of the Fabaceae lacking the inverted repeat. The Pongamia and L. japonicus mtDNA contain just 33 and 31 unique protein-coding genes, respectively, and like other angiosperm mtDNA, have expanded intergenic and multiple repeat regions. Through comparative analysis with Vigna radiata we measured the average synonymous and non-synonymous divergence of all three legume mitochondrial (1.59% and 2.40%, respectively) and chloroplast (8.37% and 8.99%, respectively) protein-coding genes. Finally, we explored the relatedness of Pongamia within the Fabaceae and showed the utility of the organellar genome sequences by mapping transcriptomic data to identify up- and down-regulated stress-responsive gene candidates and confirm in silico predicted RNA editing sites.
Analysis of 16S-23S rRNA intergenic spacer regions of Vibrio cholerae and Vibrio mimicus.
Chun, J; Huq, A; Colwell, R R
1999-05-01
Vibrio cholerae identification based on molecular sequence data has been hampered by a lack of sequence variation from the closely related Vibrio mimicus. The two species share many genes coding for proteins, such as ctxAB, and show almost identical 16S DNA coding for rRNA (rDNA) sequences. Primers targeting conserved sequences flanking the 3' end of the 16S and the 5' end of the 23S rDNAs were used to amplify the 16S-23S rRNA intergenic spacer regions of V. cholerae and V. mimicus. Two major (ca. 580 and 500 bp) and one minor (ca. 750 bp) amplicons were consistently generated for both species, and their sequences were determined. The largest fragment contains three tRNA genes (tDNAs) coding for tRNAGlu, tRNALys, and tRNAVal, which has not previously been found in bacteria examined to date. The 580-bp amplicon contained tDNAIle and tDNAAla, whereas the 500-bp fragment had single tDNA coding either tRNAGlu or tRNAAla. Little variation, i.e., 0 to 0.4%, was found among V. cholerae O1 classical, O1 El Tor, and O139 epidemic strains. Slightly more variation was found against the non-O1/non-O139 serotypes (ca. 1% difference) and V. mimicus (2 to 3% difference). A pair of oligonucleotide primers were designed, based on the region differentiating all of V. cholerae strains from V. mimicus. The PCR system developed was subsequently evaluated by using representatives of V. cholerae from environmental and clinical sources, and of other taxa, including V. mimicus. This study provides the first molecular tool for identifying the species V. cholerae.
Free Energy Gap and Statistical Thermodynamic Fidelity of DNA Codes
2007-10-01
reverse-complement unless otherwise stated. For strand x, let Nx denote its complement. A (perfect) Watson - Crick duplex is the joining of complement...is possible for complementary sequences to form a non-perfectly aligned duplex, we will call any x W Nx duplex a Watson - Crick (WC) duplex. Two...DATES COVERED (From - To) 4. TITLE AND SUBTITLE FREE ENERGY GAP AND STATISTICAL THERMODYNAMIC FIDELITY OF DNA CODES 5a. CONTRACT NUMBER FA8750-07
Free Energy Gap and Statistical Thermodynamic Fidelity of DNA Codes (Postprint)
2007-01-01
reverse-complement unless otherwise stated. For strand x, let Nx denote its complement. A (perfect) Watson - Crick duplex is the joining of complement...is possible for complementary sequences to form a non-perfectly aligned duplex, we will call any x W Nx duplex a Watson - Crick (WC) duplex. Two...DATES COVERED (From - To) 4. TITLE AND SUBTITLE FREE ENERGY GAP AND STATISTICAL THERMODYNAMIC FIDELITY OF DNA CODES 5a. CONTRACT NUMBER FA8750-07
de Lange, Orlando; Wolf, Christina; Dietze, Jörn; Elsaesser, Janett; Morbitzer, Robert; Lahaye, Thomas
2014-01-01
The tandem repeats of transcription activator like effectors (TALEs) mediate sequence-specific DNA binding using a simple code. Naturally, TALEs are injected by Xanthomonas bacteria into plant cells to manipulate the host transcriptome. In the laboratory TALE DNA binding domains are reprogrammed and used to target a fused functional domain to a genomic locus of choice. Research into the natural diversity of TALE-like proteins may provide resources for the further improvement of current TALE technology. Here we describe TALE-like proteins from the endosymbiotic bacterium Burkholderia rhizoxinica, termed Bat proteins. Bat repeat domains mediate sequence-specific DNA binding with the same code as TALEs, despite less than 40% sequence identity. We show that Bat proteins can be adapted for use as transcription factors and nucleases and that sequence preferences can be reprogrammed. Unlike TALEs, the core repeats of each Bat protein are highly polymorphic. This feature allowed us to explore alternative strategies for the design of custom Bat repeat arrays, providing novel insights into the functional relevance of non-RVD residues. The Bat proteins offer fertile grounds for research into the creation of improved programmable DNA-binding proteins and comparative insights into TALE-like evolution. PMID:24792163
Kikhno, Irina
2014-01-01
Highly homologous sequences 154–157 bp in length grouped under the name of “conserved non-protein-coding element” (CNE) were revealed in all of the sequenced genomes of baculoviruses belonging to the genus Alphabaculovirus. A CNE alignment led to the detection of a set of highly conserved nucleotide clusters that occupy strictly conserved positions in the CNE sequence. The significant length of the CNE and conservation of both its length and cluster architecture were identified as a combination of characteristics that make this CNE different from known viral non-coding functional sequences. The essential role of the CNE in the Alphabaculovirus life cycle was demonstrated through the use of a CNE-knockout Autographa californica multiple nucleopolyhedrovirus (AcMNPV) bacmid. It was shown that the essential function of the CNE was not mediated by the presumed expression activities of the protein- and non-protein-coding genes that overlap the AcMNPV CNE. On the basis of the presented data, the AcMNPV CNE was categorized as a complex-structured, polyfunctional genomic element involved in an essential DNA transaction that is associated with an undefined function of the baculovirus genome. PMID:24740153
A deep learning method for lincRNA detection using auto-encoder algorithm.
Yu, Ning; Yu, Zeng; Pan, Yi
2017-12-06
RNA sequencing technique (RNA-seq) enables scientists to develop novel data-driven methods for discovering more unidentified lincRNAs. Meantime, knowledge-based technologies are experiencing a potential revolution ignited by the new deep learning methods. By scanning the newly found data set from RNA-seq, scientists have found that: (1) the expression of lincRNAs appears to be regulated, that is, the relevance exists along the DNA sequences; (2) lincRNAs contain some conversed patterns/motifs tethered together by non-conserved regions. The two evidences give the reasoning for adopting knowledge-based deep learning methods in lincRNA detection. Similar to coding region transcription, non-coding regions are split at transcriptional sites. However, regulatory RNAs rather than message RNAs are generated. That is, the transcribed RNAs participate the biological process as regulatory units instead of generating proteins. Identifying these transcriptional regions from non-coding regions is the first step towards lincRNA recognition. The auto-encoder method achieves 100% and 92.4% prediction accuracy on transcription sites over the putative data sets. The experimental results also show the excellent performance of predictive deep neural network on the lincRNA data sets compared with support vector machine and traditional neural network. In addition, it is validated through the newly discovered lincRNA data set and one unreported transcription site is found by feeding the whole annotated sequences through the deep learning machine, which indicates that deep learning method has the extensive ability for lincRNA prediction. The transcriptional sequences of lincRNAs are collected from the annotated human DNA genome data. Subsequently, a two-layer deep neural network is developed for the lincRNA detection, which adopts the auto-encoder algorithm and utilizes different encoding schemes to obtain the best performance over intergenic DNA sequence data. Driven by those newly annotated lincRNA data, deep learning methods based on auto-encoder algorithm can exert their capability in knowledge learning in order to capture the useful features and the information correlation along DNA genome sequences for lincRNA detection. As our knowledge, this is the first application to adopt the deep learning techniques for identifying lincRNA transcription sequences.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Boore, Jeffrey L.; Medina, Monica; Rosenberg, Lewis A.
2004-01-31
We have determined the complete sequence of the mitochondrial genome of the scaphopod mollusk Graptacme eborea (Conrad, 1846) (14,492 nts) and completed the sequence of the mitochondrial genome of the bivalve mollusk Mytilus edulis Linnaeus, 1758 (16,740 nts). (The name Graptacme eborea is a revision of the species formerly known as Dentalium eboreum.) G. eborea mtDNA contains the 37 genes that are typically found and has the genes divided about evenly between the two strands, but M. edulis contains an extra trnM and is missing atp8, and has all genes on the same strand. Each has a highly rearranged genemore » order relative to each other and to all other studied mtDNAs. G. eborea mtDNA has almost no strand skew, but the coding strand of M. edulis mtDNA is very rich in G and T. This is reflected in differential codon usage patterns and even in amino acid compositions. G. eborea mtDNA has fewer non-coding nucleotides than any other mtDNA studied to date, with the largest non-coding region being only 24 nt long. Phylogenetic analysis using 2,420 aligned amino acid positions of concatenated proteins weakly supports an association of the scaphopod with gastropods to the exclusion of Bivalvia, Cephalopoda, and Polyplacophora, but is generally unable to convincingly resolve the relationships among major groups of the Lophotrochozoa, in contrast to the good resolution seen for several other major metazoan groups.« less
Many human accelerated regions are developmental enhancers
Capra, John A.; Erwin, Genevieve D.; McKinsey, Gabriel; Rubenstein, John L. R.; Pollard, Katherine S.
2013-01-01
The genetic changes underlying the dramatic differences in form and function between humans and other primates are largely unknown, although it is clear that gene regulatory changes play an important role. To identify regulatory sequences with potentially human-specific functions, we and others used comparative genomics to find non-coding regions conserved across mammals that have acquired many sequence changes in humans since divergence from chimpanzees. These regions are good candidates for performing human-specific regulatory functions. Here, we analysed the DNA sequence, evolutionary history, histone modifications, chromatin state and transcription factor (TF) binding sites of a combined set of 2649 non-coding human accelerated regions (ncHARs) and predicted that at least 30% of them function as developmental enhancers. We prioritized the predicted ncHAR enhancers using analysis of TF binding site gain and loss, along with the functional annotations and expression patterns of nearby genes. We then tested both the human and chimpanzee sequence for 29 ncHARs in transgenic mice, and found 24 novel developmental enhancers active in both species, 17 of which had very consistent patterns of activity in specific embryonic tissues. Of these ncHAR enhancers, five drove expression patterns suggestive of different activity for the human and chimpanzee sequence at embryonic day 11.5. The changes to human non-coding DNA in these ncHAR enhancers may modify the complex patterns of gene expression necessary for proper development in a human-specific manner and are thus promising candidates for understanding the genetic basis of human-specific biology. PMID:24218637
Converting Panax ginseng DNA and chemical fingerprints into two-dimensional barcode.
Cai, Yong; Li, Peng; Li, Xi-Wen; Zhao, Jing; Chen, Hai; Yang, Qing; Hu, Hao
2017-07-01
In this study, we investigated how to convert the Panax ginseng DNA sequence code and chemical fingerprints into a two-dimensional code. In order to improve the compression efficiency, GATC2Bytes and digital merger compression algorithms are proposed. HPLC chemical fingerprint data of 10 groups of P. ginseng from Northeast China and the internal transcribed spacer 2 (ITS2) sequence code as the DNA sequence code were ready for conversion. In order to convert such data into a two-dimensional code, the following six steps were performed: First, the chemical fingerprint characteristic data sets were obtained through the inflection filtering algorithm. Second, precompression processing of such data sets is undertaken. Third, precompression processing was undertaken with the P. ginseng DNA (ITS2) sequence codes. Fourth, the precompressed chemical fingerprint data and the DNA (ITS2) sequence code were combined in accordance with the set data format. Such combined data can be compressed by Zlib, an open source data compression algorithm. Finally, the compressed data generated a two-dimensional code called a quick response code (QR code). Through the abovementioned converting process, it can be found that the number of bytes needed for storing P. ginseng chemical fingerprints and its DNA (ITS2) sequence code can be greatly reduced. After GTCA2Bytes algorithm processing, the ITS2 compression rate reaches 75% and the chemical fingerprint compression rate exceeds 99.65% via filtration and digital merger compression algorithm processing. Therefore, the overall compression ratio even exceeds 99.36%. The capacity of the formed QR code is around 0.5k, which can easily and successfully be read and identified by any smartphone. P. ginseng chemical fingerprints and its DNA (ITS2) sequence code can form a QR code after data processing, and therefore the QR code can be a perfect carrier of the authenticity and quality of P. ginseng information. This study provides a theoretical basis for the development of a quality traceability system of traditional Chinese medicine based on a two-dimensional code.
Genomics dataset of unidentified disclosed isolates.
Rekadwad, Bhagwan N
2016-09-01
Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.
DNABIT Compress - Genome compression algorithm.
Rajarajeswari, Pothuraju; Apparao, Allam
2011-01-22
Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, "DNABIT Compress" for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that "DNABIT Compress" algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases.
Correlation approach to identify coding regions in DNA sequences
NASA Technical Reports Server (NTRS)
Ossadnik, S. M.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Peng, C. K.; Simons, M.; Stanley, H. E.
1994-01-01
Recently, it was observed that noncoding regions of DNA sequences possess long-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on this finding that enables investigators to perform a statistical analysis on long DNA sequences to locate possible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the complete genome of yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond to putative coding regions; the algorithm correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000 and 3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of this new algorithm supports the claim that there is a fundamental difference in the correlation property between coding and noncoding sequences. This algorithm, which is not species-dependent, can be implemented with other techniques for rapidly and accurately locating relatively long coding regions in genomic sequences.
Brain cDNA clone for human cholinesterase
DOE Office of Scientific and Technical Information (OSTI.GOV)
McTiernan, C.; Adkins, S.; Chatonnet, A.
1987-10-01
A cDNA library from human basal ganglia was screened with oligonucleotide probes corresponding to portions of the amino acid sequence of human serum cholinesterase. Five overlapping clones, representing 2.4 kilobases, were isolated. The sequenced cDNA contained 207 base pairs of coding sequence 5' to the amino terminus of the mature protein in which there were four ATG translation start sites in the same reading frame as the protein. Only the ATG coding for Met-(-28) lay within a favorable consensus sequence for functional initiators. There were 1722 base pairs of coding sequence corresponding to the protein found circulating in human serum.more » The amino acid sequence deduced from the cDNA exactly matched the 574 amino acid sequence of human serum cholinesterase, as previously determined by Edman degradation. Therefore, our clones represented cholinesterase rather than acetylcholinesterase. It was concluded that the amino acid sequences of cholinesterase from two different tissues, human brain and human serum, were identical. Hybridization of genomic DNA blots suggested that a single gene, or very few genes coded for cholinesterase.« less
DNA Translator and Aligner: HyperCard utilities to aid phylogenetic analysis of molecules.
Eernisse, D J
1992-04-01
DNA Translator and Aligner are molecular phylogenetics HyperCard stacks for Macintosh computers. They manipulate sequence data to provide graphical gene mapping, conversions, translations and manual multiple-sequence alignment editing. DNA Translator is able to convert documented GenBank or EMBL documented sequences into linearized, rescalable gene maps whose gene sequences are extractable by clicking on the corresponding map button or by selection from a scrolling list. Provided gene maps, complete with extractable sequences, consist of nine metazoan, one yeast, and one ciliate mitochondrial DNAs and three green plant chloroplast DNAs. Single or multiple sequences can be manipulated to aid in phylogenetic analysis. Sequences can be translated between nucleic acids and proteins in either direction with flexible support of alternate genetic codes and ambiguous nucleotide symbols. Multiple aligned sequence output from diverse sources can be converted to Nexus, Hennig86 or PHYLIP format for subsequent phylogenetic analysis. Input or output alignments can be examined with Aligner, a convenient accessory stack included in the DNA Translator package. Aligner is an editor for the manual alignment of up to 100 sequences that toggles between display of matched characters and normal unmatched sequences. DNA Translator also generates graphic displays of amino acid coding and codon usage frequency relative to all other, or only synonymous, codons for approximately 70 select organism-organelle combinations. Codon usage data is compatible with spreadsheet or UWGCG formats for incorporation of additional molecules of interest. The complete package is available via anonymous ftp and is free for non-commercial uses.
Evidence of birth-and-death evolution of 5S rRNA gene in Channa species (Teleostei, Perciformes).
Barman, Anindya Sundar; Singh, Mamta; Singh, Rajeev Kumar; Lal, Kuldeep Kumar
2016-12-01
In higher eukaryotes, minor rDNA family codes for 5S rRNA that is arranged in tandem arrays and comprises of a highly conserved 120 bp long coding sequence with a variable non-transcribed spacer (NTS). Initially the 5S rDNA repeats are considered to be evolved by the process of concerted evolution. But some recent reports, including teleost fishes suggested that evolution of 5S rDNA repeat does not fit into the concerted evolution model and evolution of 5S rDNA family may be explained by a birth-and-death evolution model. In order to study the mode of evolution of 5S rDNA repeats in Perciformes fish species, nucleotide sequence and molecular organization of five species of genus Channa were analyzed in the present study. Molecular analyses revealed several variants of 5S rDNA repeats (four types of NTS) and networks created by a neighbor net algorithm for each type of sequences (I, II, III and IV) did not show a clear clustering in species specific manner. The stable secondary structure is predicted and upstream and downstream conserved regulatory elements were characterized. Sequence analyses also shown the presence of two putative pseudogenes in Channa marulius. Present study supported that 5S rDNA repeats in genus Channa were evolved under the process of birth-and-death.
Chromatin accessibility prediction via a hybrid deep convolutional neural network.
Liu, Qiao; Xia, Fei; Yin, Qijin; Jiang, Rui
2018-03-01
A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies. We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases. Deopen is freely available at https://github.com/kimmo1019/Deopen. ruijiang@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ANN modeling of DNA sequences: new strategies using DNA shape code.
Parbhane, R V; Tambe, S S; Kulkarni, B D
2000-09-01
Two new encoding strategies, namely, wedge and twist codes, which are based on the DNA helical parameters, are introduced to represent DNA sequences in artificial neural network (ANN)-based modeling of biological systems. The performance of the new coding strategies has been evaluated by conducting three case studies involving mapping (modeling) and classification applications of ANNs. The proposed coding schemes have been compared rigorously and shown to outperform the existing coding strategies especially in situations wherein limited data are available for building the ANN models.
Toren, Dmitri; Barzilay, Thomer; Tacutu, Robi; Lehmann, Gilad; Muradian, Khachik K; Fraifeld, Vadim E
2016-01-04
Mitochondria are the only organelles in the animal cells that have their own genome. Due to a key role in energy production, generation of damaging factors (ROS, heat), and apoptosis, mitochondria and mtDNA in particular have long been considered one of the major players in the mechanisms of aging, longevity and age-related diseases. The rapidly increasing number of species with fully sequenced mtDNA, together with accumulated data on longevity records, provides a new fascinating basis for comparative analysis of the links between mtDNA features and animal longevity. To facilitate such analyses and to support the scientific community in carrying these out, we developed the MitoAge database containing calculated mtDNA compositional features of the entire mitochondrial genome, mtDNA coding (tRNA, rRNA, protein-coding genes) and non-coding (D-loop) regions, and codon usage/amino acids frequency for each protein-coding gene. MitoAge includes 922 species with fully sequenced mtDNA and maximum lifespan records. The database is available through the MitoAge website (www.mitoage.org or www.mitoage.info), which provides the necessary tools for searching, browsing, comparing and downloading the data sets of interest for selected taxonomic groups across the Kingdom Animalia. The MitoAge website assists in statistical analysis of different features of the mtDNA and their correlative links to longevity. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
DNA methylation of miRNA coding sequences putatively associated with childhood obesity.
Mansego, M L; Garcia-Lacarte, M; Milagro, F I; Marti, A; Martinez, J A
2017-02-01
Epigenetic mechanisms may be involved in obesity onset and its consequences. The aim of the present study was to evaluate whether DNA methylation status in microRNA (miRNA) coding regions is associated with childhood obesity. DNA isolated from white blood cells of 24 children (identification sample: 12 obese and 12 non-obese) from the Grupo Navarro de Obesidad Infantil study was hybridized in a 450 K methylation microarray. Several CpGs whose DNA methylation levels were statistically different between obese and non-obese were validated by MassArray® in 95 children (validation sample) from the same study. Microarray analysis identified 16 differentially methylated CpGs between both groups (6 hypermethylated and 10 hypomethylated). DNA methylation levels in miR-1203, miR-412 and miR-216A coding regions significantly correlated with body mass index standard deviation score (BMI-SDS) and explained up to 40% of the variation of BMI-SDS. The network analysis identified 19 well-defined obesity-relevant biological pathways from the KEGG database. MassArray® validation identified three regions located in or near miR-1203, miR-412 and miR-216A coding regions differentially methylated between obese and non-obese children. The current work identified three CpG sites located in coding regions of three miRNAs (miR-1203, miR-412 and miR-216A) that were differentially methylated between obese and non-obese children, suggesting a role of miRNA epigenetic regulation in childhood obesity. © 2016 World Obesity Federation.
Genomics dataset on unclassified published organism (patent US 7547531).
Khan Shawan, Mohammad Mahfuz Ali; Hasan, Md Ashraful; Hossain, Md Mozammel; Hasan, Md Mahmudul; Parvin, Afroza; Akter, Salina; Uddin, Kazi Rasel; Banik, Subrata; Morshed, Mahbubul; Rahman, Md Nazibur; Rahman, S M Badier
2016-12-01
Nucleotide (DNA) sequence analysis provides important clues regarding the characteristics and taxonomic position of an organism. With the intention that, DNA sequence analysis is very crucial to learn about hierarchical classification of that particular organism. This dataset (patent US 7547531) is chosen to simplify all the complex raw data buried in undisclosed DNA sequences which help to open doors for new collaborations. In this data, a total of 48 unidentified DNA sequences from patent US 7547531 were selected and their complete sequences were retrieved from NCBI BioSample database. Quick response (QR) code of those DNA sequences was constructed by DNA BarID tool. QR code is useful for the identification and comparison of isolates with other organisms. AT/GC content of the DNA sequences was determined using ENDMEMO GC Content Calculator, which indicates their stability at different temperature. The highest GC content was observed in GP445188 (62.5%) which was followed by GP445198 (61.8%) and GP445189 (59.44%), while lowest was in GP445178 (24.39%). In addition, New England BioLabs (NEB) database was used to identify cleavage code indicating the 5, 3 and blunt end and enzyme code indicating the methylation site of the DNA sequences was also shown. These data will be helpful for the construction of the organisms' hierarchical classification, determination of their phylogenetic and taxonomic position and revelation of their molecular characteristics.
de Lange, Orlando; Wolf, Christina; Dietze, Jörn; Elsaesser, Janett; Morbitzer, Robert; Lahaye, Thomas
2014-06-01
The tandem repeats of transcription activator like effectors (TALEs) mediate sequence-specific DNA binding using a simple code. Naturally, TALEs are injected by Xanthomonas bacteria into plant cells to manipulate the host transcriptome. In the laboratory TALE DNA binding domains are reprogrammed and used to target a fused functional domain to a genomic locus of choice. Research into the natural diversity of TALE-like proteins may provide resources for the further improvement of current TALE technology. Here we describe TALE-like proteins from the endosymbiotic bacterium Burkholderia rhizoxinica, termed Bat proteins. Bat repeat domains mediate sequence-specific DNA binding with the same code as TALEs, despite less than 40% sequence identity. We show that Bat proteins can be adapted for use as transcription factors and nucleases and that sequence preferences can be reprogrammed. Unlike TALEs, the core repeats of each Bat protein are highly polymorphic. This feature allowed us to explore alternative strategies for the design of custom Bat repeat arrays, providing novel insights into the functional relevance of non-RVD residues. The Bat proteins offer fertile grounds for research into the creation of improved programmable DNA-binding proteins and comparative insights into TALE-like evolution. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Transposable elements and G-quadruplexes.
Kejnovsky, Eduard; Tokan, Viktor; Lexa, Matej
2015-09-01
A significant part of eukaryotic genomes is formed by transposable elements (TEs) containing not only genes but also regulatory sequences. Some of the regulatory sequences located within TEs can form secondary structures like hairpins or three-stranded (triplex DNA) and four-stranded (quadruplex DNA) conformations. This review focuses on recent evidence showing that G-quadruplex-forming sequences in particular are often present in specific parts of TEs in plants and humans. We discuss the potential role of these structures in the TE life cycle as well as the impact of G-quadruplexes on replication, transcription, translation, chromatin status, and recombination. The aim of this review is to emphasize that TEs may serve as vehicles for the genomic spread of G-quadruplexes. These non-canonical DNA structures and their conformational switches may constitute another regulatory system that, together with small and long non-coding RNA molecules and proteins, contribute to the complex cellular network resulting in the large diversity of eukaryotes.
Baxter, Laura L; Hsu, Benjamin J; Umayam, Lowell; Wolfsberg, Tyra G; Larson, Denise M; Frith, Martin C; Kawai, Jun; Hayashizaki, Yoshihide; Carninci, Piero; Pavan, William J
2007-06-01
As part of the RIKEN mouse encyclopedia project, two cDNA libraries were prepared from melanocyte-derived cell lines, using techniques of full-length clone selection and subtraction/normalization to enrich for rare transcripts. End sequencing showed that these libraries display over 83% complete coding sequence at the 5' end and 96-97% complete coding sequence at the 3' end. Evaluation of the libraries, derived from B16F10Y tumor cells and melan-c cells, revealed that they contain clones for a majority of the genes previously demonstrated to function in melanocyte biology. Analysis of genomic locations for transcripts revealed that the distribution of melanocyte genes is non-random throughout the genome. Three genomic regions identified that showed significant clustering of melanocyte-expressed genes contain one or more genes previously shown to regulate melanocyte development or function. A catalog of genes expressed in these libraries is presented, providing a valuable resource of cDNA clones and sequence information that can be used for identification of new genes important for melanocyte development, function, and disease.
NASA Astrophysics Data System (ADS)
Mackiewicz, P.; Gierlik, A.; Kowalczuk, M.; Szczepanik, D.; Dudek, M. R.; Cebrat, S.
1999-12-01
We have analysed protein coding and intergenic sequences in the Borrelia burgdorferi (the Lyme disease bacterium) genome using different kinds of DNA walks. Genes occupying the leading strand of DNA have significantly different nucleotide composition from genes occupying the lagging strand. Nucleotide compositional bias of the two DNA strands reflects the aminoacid composition of proteins. 96% of genes coding for ribosomal proteins lie on the leading DNA strand, which suggests that the positions of these as well as other genes are non-random. In the B. burgdorferi genome, the asymmetry in intergenic DNA sequences is lower than the asymmetry in the third positions in codons. All these characters of the B. burgdorferi genome suggest that both replication-associated mutational pressure and recombination mechanisms have established the specific structure of the genome and now any recombination leading to inversion of a gene in respect to the direction of replication is forbidden. This property of the genome allows us to assume that it is in a steady state, which enables us to fix some parameters for simulations of DNA evolution.
Identification of common, unique and polymorphic microsatellites among 73 cyanobacterial genomes.
Kabra, Ritika; Kapil, Aditi; Attarwala, Kherunnisa; Rai, Piyush Kant; Shanker, Asheesh
2016-04-01
Microsatellites also known as Simple Sequence Repeats are short tandem repeats of 1-6 nucleotides. These repeats are found in coding as well as non-coding regions of both prokaryotic and eukaryotic genomes and play a significant role in the study of gene regulation, genetic mapping, DNA fingerprinting and evolutionary studies. The availability of 73 complete genome sequences of cyanobacteria enabled us to mine and statistically analyze microsatellites in these genomes. The cyanobacterial microsatellites identified through bioinformatics analysis were stored in a user-friendly database named CyanoSat, which is an efficient data representation and query system designed using ASP.net. The information in CyanoSat comprises of perfect, imperfect and compound microsatellites found in coding, non-coding and coding-non-coding regions. Moreover, it contains PCR primers with 200 nucleotides long flanking region. The mined cyanobacterial microsatellites can be freely accessed at www.compubio.in/CyanoSat/home.aspx. In addition to this 82 polymorphic, 13,866 unique and 2390 common microsatellites were also detected. These microsatellites will be useful in strain identification and genetic diversity studies of cyanobacteria.
DNABIT Compress – Genome compression algorithm
Rajarajeswari, Pothuraju; Apparao, Allam
2011-01-01
Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, “DNABIT Compress” for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that “DNABIT Compress” algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases. PMID:21383923
Jiang, Jiming
2015-04-01
Sequencing of complete plant genomes has become increasingly more routine since the advent of the next-generation sequencing technology. Identification and annotation of large amounts of noncoding but functional DNA sequences, including cis-regulatory DNA elements (CREs), have become a new frontier in plant genome research. Genomic regions containing active CREs bound to regulatory proteins are hypersensitive to DNase I digestion and are called DNase I hypersensitive sites (DHSs). Several recent DHS studies in plants illustrate that DHS datasets produced by DNase I digestion followed by next-generation sequencing (DNase-seq) are highly valuable for the identification and characterization of CREs associated with plant development and responses to environmental cues. DHS-based genomic profiling has opened a door to identify and annotate the 'dark matter' in sequenced plant genomes. Copyright © 2015 Elsevier Ltd. All rights reserved.
DNA-based watermarks using the DNA-Crypt algorithm.
Heider, Dominik; Barnekow, Angelika
2007-05-29
The aim of this paper is to demonstrate the application of watermarks based on DNA sequences to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. Predicted mutations in the genome can be corrected by the DNA-Crypt program leaving the encrypted information intact. Existing DNA cryptographic and steganographic algorithms use synthetic DNA sequences to store binary information however, although these sequences can be used for authentication, they may change the target DNA sequence when introduced into living organisms. The DNA-Crypt algorithm and image steganography are based on the same watermark-hiding principle, namely using the least significant base in case of DNA-Crypt and the least significant bit in case of the image steganography. It can be combined with binary encryption algorithms like AES, RSA or Blowfish. DNA-Crypt is able to correct mutations in the target DNA with several mutation correction codes such as the Hamming-code or the WDH-code. Mutations which can occur infrequently may destroy the encrypted information, however an integrated fuzzy controller decides on a set of heuristics based on three input dimensions, and recommends whether or not to use a correction code. These three input dimensions are the length of the sequence, the individual mutation rate and the stability over time, which is represented by the number of generations. In silico experiments using the Ypt7 in Saccharomyces cerevisiae shows that the DNA watermarks produced by DNA-Crypt do not alter the translation of mRNA into protein. The program is able to store watermarks in living organisms and can maintain the original information by correcting mutations itself. Pairwise or multiple sequence alignments show that DNA-Crypt produces few mismatches between the sequences similar to all steganographic algorithms.
DNA-based watermarks using the DNA-Crypt algorithm
Heider, Dominik; Barnekow, Angelika
2007-01-01
Background The aim of this paper is to demonstrate the application of watermarks based on DNA sequences to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. Predicted mutations in the genome can be corrected by the DNA-Crypt program leaving the encrypted information intact. Existing DNA cryptographic and steganographic algorithms use synthetic DNA sequences to store binary information however, although these sequences can be used for authentication, they may change the target DNA sequence when introduced into living organisms. Results The DNA-Crypt algorithm and image steganography are based on the same watermark-hiding principle, namely using the least significant base in case of DNA-Crypt and the least significant bit in case of the image steganography. It can be combined with binary encryption algorithms like AES, RSA or Blowfish. DNA-Crypt is able to correct mutations in the target DNA with several mutation correction codes such as the Hamming-code or the WDH-code. Mutations which can occur infrequently may destroy the encrypted information, however an integrated fuzzy controller decides on a set of heuristics based on three input dimensions, and recommends whether or not to use a correction code. These three input dimensions are the length of the sequence, the individual mutation rate and the stability over time, which is represented by the number of generations. In silico experiments using the Ypt7 in Saccharomyces cerevisiae shows that the DNA watermarks produced by DNA-Crypt do not alter the translation of mRNA into protein. Conclusion The program is able to store watermarks in living organisms and can maintain the original information by correcting mutations itself. Pairwise or multiple sequence alignments show that DNA-Crypt produces few mismatches between the sequences similar to all steganographic algorithms. PMID:17535434
Is a Genome a Codeword of an Error-Correcting Code?
Kleinschmidt, João H.; Silva-Filho, Márcio C.; Bim, Edson; Herai, Roberto H.; Yamagishi, Michel E. B.; Palazzo, Reginaldo
2012-01-01
Since a genome is a discrete sequence, the elements of which belong to a set of four letters, the question as to whether or not there is an error-correcting code underlying DNA sequences is unavoidable. The most common approach to answering this question is to propose a methodology to verify the existence of such a code. However, none of the methodologies proposed so far, although quite clever, has achieved that goal. In a recent work, we showed that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as Hamming codes. In this paper, we show that a complete intron-exon gene, and even a plasmid genome, can be identified as a Hamming code codeword as well. Although this does not constitute a definitive proof that there is an error-correcting code underlying DNA sequences, it is the first evidence in this direction. PMID:22649495
Xie, Guosen; Mo, Zhongxi
2011-01-21
In this article, we introduce three 3D graphical representations of DNA primary sequences, which we call RY-curve, MK-curve and SW-curve, based on three classifications of the DNA bases. The advantages of our representations are that (i) these 3D curves are strictly non-degenerate and there is no loss of information when transferring a DNA sequence to its mathematical representation and (ii) the coordinates of every node on these 3D curves have clear biological implication. Two applications of these 3D curves are presented: (a) a simple formula is derived to calculate the content of the four bases (A, G, C and T) from the coordinates of nodes on the curves; and (b) a 12-component characteristic vector is constructed to compare similarity among DNA sequences from different species based on the geometrical centers of the 3D curves. As examples, we examine similarity among the coding sequences of the first exon of beta-globin gene from eleven species and validate similarity of cDNA sequences of beta-globin gene from eight species. Copyright © 2010 Elsevier Ltd. All rights reserved.
Living Organisms Author Their Read-Write Genomes in Evolution.
Shapiro, James A
2017-12-06
Evolutionary variations generating phenotypic adaptations and novel taxa resulted from complex cellular activities altering genome content and expression: (i) Symbiogenetic cell mergers producing the mitochondrion-bearing ancestor of eukaryotes and chloroplast-bearing ancestors of photosynthetic eukaryotes; (ii) interspecific hybridizations and genome doublings generating new species and adaptive radiations of higher plants and animals; and, (iii) interspecific horizontal DNA transfer encoding virtually all of the cellular functions between organisms and their viruses in all domains of life. Consequently, assuming that evolutionary processes occur in isolated genomes of individual species has become an unrealistic abstraction. Adaptive variations also involved natural genetic engineering of mobile DNA elements to rewire regulatory networks. In the most highly evolved organisms, biological complexity scales with "non-coding" DNA content more closely than with protein-coding capacity. Coincidentally, we have learned how so-called "non-coding" RNAs that are rich in repetitive mobile DNA sequences are key regulators of complex phenotypes. Both biotic and abiotic ecological challenges serve as triggers for episodes of elevated genome change. The intersections of cell activities, biosphere interactions, horizontal DNA transfers, and non-random Read-Write genome modifications by natural genetic engineering provide a rich molecular and biological foundation for understanding how ecological disruptions can stimulate productive, often abrupt, evolutionary transformations.
Sikorav, J L; Duval, N; Anselmet, A; Bon, S; Krejci, E; Legay, C; Osterlund, M; Reimund, B; Massoulié, J
1988-01-01
In this paper, we show the existence of alternative splicing in the 3' region of the coding sequence of Torpedo acetylcholinesterase (AChE). We describe two cDNA structures which both diverge from the previously described coding sequence of the catalytic subunit of asymmetric (A) forms (Schumacher et al., 1986; Sikorav et al., 1987). They both contain a coding sequence followed by a non-coding sequence and a poly(A) stretch. Both of these structures were shown to exist in poly(A)+ RNAs, by S1 mapping experiments. The divergent region encoded by the first sequence corresponds to the precursor of the globular dimeric form (G2a), since it contains the expected C-terminal amino acids, Ala-Cys. These amino acids are followed by a 29 amino acid extension which contains a hydrophobic segment and must be replaced by a glycolipid in the mature protein. Analyses of intact G2a AChE showed that the common domain of the protein contains intersubunit disulphide bonds. The divergent region of the second type of cDNA consists of an adjacent genomic sequence, which is removed as an intron in A and Ga mRNAs, but may encode a distinct, less abundant catalytic subunit. The structures of the cDNA clones indicate that they are derived from minor mRNAs, shorter than the three major transcripts which have been described previously (14.5, 10.5 and 5.5 kb). Oligonucleotide probes specific for the asymmetric and globular terminal regions hybridize with the three major transcripts, indicating that their size is determined by 3'-untranslated regions which are not related to the differential splicing leading to A and Ga forms. Images PMID:3181125
Krzeminska, Urszula; Wilson, Robyn; Rahman, Sadequr; Song, Beng Kah; Seneviratne, Sampath; Gan, Han Ming; Austin, Christopher M
2016-07-01
The complete mitochondrial genomes of two jungle crows (Corvus macrorhynchos) were sequenced. DNA was extracted from tissue samples obtained from shed feathers collected in the field in Sri Lanka and sequenced using the Illumina MiSeq Personal Sequencer. Jungle crow mitogenomes have a structural organization typical of the genus Corvus and are 16,927 bp and 17,066 bp in length, both comprising 13 protein-coding genes, 22 transfer RNA genes, 2 ribosomal subunit genes, and a non-coding control region. In addition, we complement already available house crow (Corvus spelendens) mitogenome resources by sequencing an individual from Singapore. A phylogenetic tree constructed from Corvidae family mitogenome sequences available on GenBank is presented. We confirm the monophyly of the genus Corvus and propose to use complete mitogenome resources for further intra- and interspecies genetic studies.
Regulatory sequence analysis tools.
van Helden, Jacques
2003-07-01
The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.
Yin, Changchuan
2015-04-01
To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.
RNA Editing in Plant Mitochondria
NASA Astrophysics Data System (ADS)
Hiesel, Rudolf; Wissinger, Bernd; Schuster, Wolfgang; Brennicke, Axel
1989-12-01
Comparative sequence analysis of genomic and complementary DNA clones from several mitochondrial genes in the higher plant Oenothera revealed nucleotide sequence divergences between the genomic and the messenger RNA-derived sequences. These sequence alterations could be most easily explained by specific post-transcriptional nucleotide modifications. Most of the nucleotide exchanges in coding regions lead to altered codons in the mRNA that specify amino acids better conserved in evolution than those encoded by the genomic DNA. Several instances show that the genomic arginine codon CGG is edited in the mRNA to the tryptophan codon TGG in amino acid positions that are highly conserved as tryptophan in the homologous proteins of other species. This editing suggests that the standard genetic code is used in plant mitochondria and resolves the frequent coincidence of CGG codons and tryptophan in different plant species. The apparently frequent and non-species-specific equivalency of CGG and TGG codons in particular suggests that RNA editing is a common feature of all higher plant mitochondria.
Decoding DNA labels by melting curve analysis using real-time PCR.
Balog, József A; Fehér, Liliána Z; Puskás, László G
2017-12-01
Synthetic DNA has been used as an authentication code for a diverse number of applications. However, existing decoding approaches are based on either DNA sequencing or the determination of DNA length variations. Here, we present a simple alternative protocol for labeling different objects using a small number of short DNA sequences that differ in their melting points. Code amplification and decoding can be done in two steps using quantitative PCR (qPCR). To obtain a DNA barcode with high complexity, we defined 8 template groups, each having 4 different DNA templates, yielding 158 (>2.5 billion) combinations of different individual melting temperature (Tm) values and corresponding ID codes. The reproducibility and specificity of the decoding was confirmed by using the most complex template mixture, which had 32 different products in 8 groups with different Tm values. The industrial applicability of our protocol was also demonstrated by labeling a drone with an oil-based paint containing a predefined DNA code, which was then successfully decoded. The method presented here consists of a simple code system based on a small number of synthetic DNA sequences and a cost-effective, rapid decoding protocol using a few qPCR reactions, enabling a wide range of authentication applications.
What Information is Stored in DNA: Does it Contain Digital Error Correcting Codes?
NASA Astrophysics Data System (ADS)
Liebovitch, Larry
1998-03-01
The longest term correlations in living systems are the information stored in DNA which reflects the evolutionary history of an organism. The 4 bases (A,T,G,C) encode sequences of amino acids as well as locations of binding sites for proteins that regulate DNA. The fidelity of this important information is maintained by ANALOG error check mechanisms. When a single strand of DNA is replicated the complementary base is inserted in the new strand. Sometimes the wrong base is inserted that sticks out disrupting the phosphate backbone. The new base is not yet methylated, so repair enzymes, that slide along the DNA, can tear out the wrong base and replace it with the right one. The bases in DNA form a sequence of 4 different symbols and so the information is encoded in a DIGITAL form. All the digital codes in our society (ISBN book numbers, UPC product codes, bank account numbers, airline ticket numbers) use error checking code, where some digits are functions of other digits to maintain the fidelity of transmitted informaiton. Does DNA also utitlize a DIGITAL error chekcing code to maintain the fidelity of its information and increase the accuracy of replication? That is, are some bases in DNA functions of other bases upstream or downstream? This raises the interesting mathematical problem: How does one determine whether some symbols in a sequence of symbols are a function of other symbols. It also bears on the issue of determining algorithmic complexity: What is the function that generates the shortest algorithm for reproducing the symbol sequence. The error checking codes most used in our technology are linear block codes. We developed an efficient method to test for the presence of such codes in DNA. We coded the 4 bases as (0,1,2,3) and used Gaussian elimination, modified for modulus 4, to test if some bases are linear combinations of other bases. We used this method to analyze the base sequence in the genes from the lac operon and cytochrome C. We did not find evidence for such error correcting codes in these genes. However, we analyzed only a small amount of DNA and if digitial error correcting schemes are present in DNA, they may be more subtle than such simple linear block codes. The basic issue we raise here, is how information is stored in DNA and an appreciation that digital symbol sequences, such as DNA, admit of interesting schemes to store and protect the fidelity of their information content. Liebovitch, Tao, Todorov, Levine. 1996. Biophys. J. 71:1539-1544. Supported by NIH grant EY6234.
Cellulases and coding sequences
Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong
2001-02-20
The present invention provides three fungal cellulases, their coding sequences, recombinant DNA molecules comprising the cellulase coding sequences, recombinant host cells and methods for producing same. The present cellulases are from Orpinomyces PC-2.
Cellulases and coding sequences
Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong
2001-01-01
The present invention provides three fungal cellulases, their coding sequences, recombinant DNA molecules comprising the cellulase coding sequences, recombinant host cells and methods for producing same. The present cellulases are from Orpinomyces PC-2.
Applications of statistical physics and information theory to the analysis of DNA sequences
NASA Astrophysics Data System (ADS)
Grosse, Ivo
2000-10-01
DNA carries the genetic information of most living organisms, and the of genome projects is to uncover that genetic information. One basic task in the analysis of DNA sequences is the recognition of protein coding genes. Powerful computer programs for gene recognition have been developed, but most of them are based on statistical patterns that vary from species to species. In this thesis I address the question if there exist universal statistical patterns that are different in coding and noncoding DNA of all living species, regardless of their phylogenetic origin. In search for such species-independent patterns I study the mutual information function of genomic DNA sequences, and find that it shows persistent period-three oscillations. To understand the biological origin of the observed period-three oscillations, I compare the mutual information function of genomic DNA sequences to the mutual information function of stochastic model sequences. I find that the pseudo-exon model is able to reproduce the mutual information function of genomic DNA sequences. Moreover, I find that a generalization of the pseudo-exon model can connect the existence and the functional form of long-range correlations to the presence and the length distributions of coding and noncoding regions. Based on these theoretical studies I am able to find an information-theoretical quantity, the average mutual information (AMI), whose probability distributions are significantly different in coding and noncoding DNA, while they are almost identical in all studied species. These findings show that there exist universal statistical patterns that are different in coding and noncoding DNA of all studied species, and they suggest that the AMI may be used to identify genes in different living species, irrespective of their taxonomic origin.
Hiding message into DNA sequence through DNA coding and chaotic maps.
Liu, Guoyan; Liu, Hongjun; Kadir, Abdurahman
2014-09-01
The paper proposes an improved reversible substitution method to hide data into deoxyribonucleic acid (DNA) sequence, and four measures have been taken to enhance the robustness and enlarge the hiding capacity, such as encode the secret message by DNA coding, encrypt it by pseudo-random sequence, generate the relative hiding locations by piecewise linear chaotic map, and embed the encoded and encrypted message into a randomly selected DNA sequence using the complementary rule. The key space and the hiding capacity are analyzed. Experimental results indicate that the proposed method has a better performance compared with the competing methods with respect to robustness and capacity.
DNA as a Binary Code: How the Physical Structure of Nucleotide Bases Carries Information
ERIC Educational Resources Information Center
McCallister, Gary
2005-01-01
The DNA triplet code also functions as a binary code. Because double-ring compounds cannot bind to double-ring compounds in the DNA code, the sequence of bases classified simply as purines or pyrimidines can encode for smaller groups of possible amino acids. This is an intuitive approach to teaching the DNA code. (Contains 6 figures.)
On fuzzy semantic similarity measure for DNA coding.
Ahmad, Muneer; Jung, Low Tang; Bhuiyan, Md Al-Amin
2016-02-01
A coding measure scheme numerically translates the DNA sequence to a time domain signal for protein coding regions identification. A number of coding measure schemes based on numerology, geometry, fixed mapping, statistical characteristics and chemical attributes of nucleotides have been proposed in recent decades. Such coding measure schemes lack the biologically meaningful aspects of nucleotide data and hence do not significantly discriminate coding regions from non-coding regions. This paper presents a novel fuzzy semantic similarity measure (FSSM) coding scheme centering on FSSM codons׳ clustering and genetic code context of nucleotides. Certain natural characteristics of nucleotides i.e. appearance as a unique combination of triplets, preserving special structure and occurrence, and ability to own and share density distributions in codons have been exploited in FSSM. The nucleotides׳ fuzzy behaviors, semantic similarities and defuzzification based on the center of gravity of nucleotides revealed a strong correlation between nucleotides in codons. The proposed FSSM coding scheme attains a significant enhancement in coding regions identification i.e. 36-133% as compared to other existing coding measure schemes tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms. Copyright © 2015 Elsevier Ltd. All rights reserved.
Conserved Non-Coding Sequences are Associated with Rates of mRNA Decay in Arabidopsis.
Spangler, Jacob B; Feltus, Frank Alex
2013-01-01
Steady-state mRNA levels are tightly regulated through a combination of transcriptional and post-transcriptional control mechanisms. The discovery of cis-acting DNA elements that encode these control mechanisms is of high importance. We have investigated the influence of conserved non-coding sequences (CNSs), DNA patterns retained after an ancient whole genome duplication event, on the breadth of gene expression and the rates of mRNA decay in Arabidopsis thaliana. The absence of CNSs near α duplicate genes was associated with a decrease in breadth of gene expression and slower mRNA decay rates while the presence CNSs near α duplicates was associated with an increase in breadth of gene expression and faster mRNA decay rates. The observed difference in mRNA decay rate was fastest in genes with CNSs in both non-transcribed and transcribed regions, albeit through an unknown mechanism. This study supports the notion that some Arabidopsis CNSs regulate the steady-state mRNA levels through post-transcriptional control mechanisms and that CNSs also play a role in controlling the breadth of gene expression.
Conserved Non-Coding Sequences are Associated with Rates of mRNA Decay in Arabidopsis
Spangler, Jacob B.; Feltus, Frank Alex
2013-01-01
Steady-state mRNA levels are tightly regulated through a combination of transcriptional and post-transcriptional control mechanisms. The discovery of cis-acting DNA elements that encode these control mechanisms is of high importance. We have investigated the influence of conserved non-coding sequences (CNSs), DNA patterns retained after an ancient whole genome duplication event, on the breadth of gene expression and the rates of mRNA decay in Arabidopsis thaliana. The absence of CNSs near α duplicate genes was associated with a decrease in breadth of gene expression and slower mRNA decay rates while the presence CNSs near α duplicates was associated with an increase in breadth of gene expression and faster mRNA decay rates. The observed difference in mRNA decay rate was fastest in genes with CNSs in both non-transcribed and transcribed regions, albeit through an unknown mechanism. This study supports the notion that some Arabidopsis CNSs regulate the steady-state mRNA levels through post-transcriptional control mechanisms and that CNSs also play a role in controlling the breadth of gene expression. PMID:23675377
Noncoding sequence classification based on wavelet transform analysis: part I
NASA Astrophysics Data System (ADS)
Paredes, O.; Strojnik, M.; Romo-Vázquez, R.; Vélez Pérez, H.; Ranta, R.; Garcia-Torales, G.; Scholl, M. K.; Morales, J. A.
2017-09-01
DNA sequences in human genome can be divided into the coding and noncoding ones. Coding sequences are those that are read during the transcription. The identification of coding sequences has been widely reported in literature due to its much-studied periodicity. Noncoding sequences represent the majority of the human genome. They play an important role in gene regulation and differentiation among the cells. However, noncoding sequences do not exhibit periodicities that correlate to their functions. The ENCODE (Encyclopedia of DNA elements) and Epigenomic Roadmap Project projects have cataloged the human noncoding sequences into specific functions. We study characteristics of noncoding sequences with wavelet analysis of genomic signals.
Oh, Chang Seok; Lee, Soong Deok; Kim, Yi-Suk; Shin, Dong Hoon
2015-01-01
Previous study showed that East Asian mtDNA haplogroups, especially those of Koreans, could be successfully assigned by the coupled use of analyses on coding region SNP markers and control region mutation motifs. In this study, we tried to see if the same triple multiplex analysis for coding regions SNPs could be also applicable to ancient samples from East Asia as the complementation for sequence analysis of mtDNA control region. By the study on Joseon skeleton samples, we know that mtDNA haplogroup determined by coding region SNP markers successfully falls within the same haplogroup that sequence analysis on control region can assign. Considering that ancient samples in previous studies make no small number of errors in control region mtDNA sequencing, coding region SNP analysis can be used as good complimentary to the conventional haplogroup determination, especially of archaeological human bone samples buried underground over long periods. PMID:26345190
Prevalence of transcription promoters within archaeal operons and coding sequences
Koide, Tie; Reiss, David J; Bare, J Christopher; Pang, Wyming Lee; Facciotti, Marc T; Schmid, Amy K; Pan, Min; Marzolf, Bruz; Van, Phu T; Lo, Fang-Yin; Pratap, Abhishek; Deutsch, Eric W; Peterson, Amelia; Martin, Dan; Baliga, Nitin S
2009-01-01
Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of ∼64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein–DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3′ ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes—events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements. PMID:19536208
Prevalence of transcription promoters within archaeal operons and coding sequences.
Koide, Tie; Reiss, David J; Bare, J Christopher; Pang, Wyming Lee; Facciotti, Marc T; Schmid, Amy K; Pan, Min; Marzolf, Bruz; Van, Phu T; Lo, Fang-Yin; Pratap, Abhishek; Deutsch, Eric W; Peterson, Amelia; Martin, Dan; Baliga, Nitin S
2009-01-01
Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of approximately 64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein-DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3' ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes-events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements.
Genomic Sequence around Butterfly Wing Development Genes: Annotation and Comparative Analysis
Conceição, Inês C.; Long, Anthony D.; Gruber, Jonathan D.; Beldade, Patrícia
2011-01-01
Background Analysis of genomic sequence allows characterization of genome content and organization, and access beyond gene-coding regions for identification of functional elements. BAC libraries, where relatively large genomic regions are made readily available, are especially useful for species without a fully sequenced genome and can increase genomic coverage of phylogenetic and biological diversity. For example, no butterfly genome is yet available despite the unique genetic and biological properties of this group, such as diversified wing color patterns. The evolution and development of these patterns is being studied in a few target species, including Bicyclus anynana, where a whole-genome BAC library allows targeted access to large genomic regions. Methodology/Principal Findings We characterize ∼1.3 Mb of genomic sequence around 11 selected genes expressed in B. anynana developing wings. Extensive manual curation of in silico predictions, also making use of a large dataset of expressed genes for this species, identified repetitive elements and protein coding sequence, and highlighted an expansion of Alcohol dehydrogenase genes. Comparative analysis with orthologous regions of the lepidopteran reference genome allowed assessment of conservation of fine-scale synteny (with detection of new inversions and translocations) and of DNA sequence (with detection of high levels of conservation of non-coding regions around some, but not all, developmental genes). Conclusions The general properties and organization of the available B. anynana genomic sequence are similar to the lepidopteran reference, despite the more than 140 MY divergence. Our results lay the groundwork for further studies of new interesting findings in relation to both coding and non-coding sequence: 1) the Alcohol dehydrogenase expansion with higher similarity between the five tandemly-repeated B. anynana paralogs than with the corresponding B. mori orthologs, and 2) the high conservation of non-coding sequence around the genes wingless and Ecdysone receptor, both involved in multiple developmental processes including wing pattern formation. PMID:21909358
Novel numerical and graphical representation of DNA sequences and proteins.
Randić, M; Novic, M; Vikić-Topić, D; Plavsić, D
2006-12-01
We have introduced novel numerical and graphical representations of DNA, which offer a simple and unique characterization of DNA sequences. The numerical representation of a DNA sequence is given as a sequence of real numbers derived from a unique graphical representation of the standard genetic code. There is no loss of information on the primary structure of a DNA sequence associated with this numerical representation. The novel representations are illustrated with the coding sequences of the first exon of beta-globin gene of half a dozen species in addition to human. The method can be extended to proteins as is exemplified by humanin, a 24-aa peptide that has recently been identified as a specific inhibitor of neuronal cell death induced by familial Alzheimer's disease mutant genes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brinson, E.C.; Adriano, T.; Bloch, W.
1994-09-01
We have developed a rapid, single-tube, non-isotopic assay that screens a patient sample for the presence of 31 cystic fibrosis (CF) mutations. This assay can identify these mutations in a single reaction tube and a single electrophoresis run. Sample preparation is a simple, boil-and-go procedure, completed in less than an hour. The assay is composed of a 15-plex PCR, followed by a 61-plex oligonucleotide ligation assay (OLA), and incorporates a novel detection scheme, Sequence Coded Separation. Initially, the multiplex PCR amplifies 15 relevant segments of the CFTR gene, simultaneously. These PCR amplicons serve as templates for the multiplex OLA, whichmore » detects the normal or mutant allele at all loci, simultaneously. Each polymorphic site is interrogated by three oligonucleotide probes, a common probe and two allele-specific probes. Each common probe is tagged with a fluorescent dye, and the competing normal and mutant allelic probes incorporate different, non-nucleotide, mobility modifiers. These modifiers are composed of hexaethylene oxide (HEO) units, incorporated as HEO phosphoramidite monomers during automated DNA synthesis. The OLA is based on both probe hybridization and the ability of DNA ligase to discriminate single base mismatches at the junction between paired probes. Each single tube assay is electrophoresed in a single gel lane of a 4-color fluorescent DNA sequencer (Applied Biosystems, Model 373A). Each of the ligation products is identified by its unique combination of electrophoretic mobility and one of three colors. The fourth color is reserved for the in-lane size standard, used by GENESCAN{sup TM} software (Applied Biosystems) to size the OLA electrophoresis products. The Genotyper{sub TM} software (Applied Biosystems) decodes these Sequence-Coded-Separation data to create a patient summary report for all loci tested.« less
Kawano, Tomonori
2013-03-01
There have been a wide variety of approaches for handling the pieces of DNA as the "unplugged" tools for digital information storage and processing, including a series of studies applied to the security-related area, such as DNA-based digital barcodes, water marks and cryptography. In the present article, novel designs of artificial genes as the media for storing the digitally compressed data for images are proposed for bio-computing purpose while natural genes principally encode for proteins. Furthermore, the proposed system allows cryptographical application of DNA through biochemically editable designs with capacity for steganographical numeric data embedment. As a model case of image-coding DNA technique application, numerically and biochemically combined protocols are employed for ciphering the given "passwords" and/or secret numbers using DNA sequences. The "passwords" of interest were decomposed into single letters and translated into the font image coded on the separate DNA chains with both the coding regions in which the images are encoded based on the novel run-length encoding rule, and the non-coding regions designed for biochemical editing and the remodeling processes revealing the hidden orientation of letters composing the original "passwords." The latter processes require the molecular biological tools for digestion and ligation of the fragmented DNA molecules targeting at the polymerase chain reaction-engineered termini of the chains. Lastly, additional protocols for steganographical overwriting of the numeric data of interests over the image-coding DNA are also discussed.
Fan, SiGang; Hu, ChaoQun; Wen, Jing; Zhang, LvPing
2011-05-01
The complete mitochondrial DNA sequence contains useful information for phylogenetic analyses of metazoa. In this study, the complete mitochondrial DNA sequence of sea cucumber Stichopus horrens (Holothuroidea: Stichopodidae: Stichopus) is presented. The complete sequence was determined using normal and long PCRs. The mitochondrial genome of Stichopus horrens is a circular molecule 16257 bps long, composed of 13 protein-coding genes, two ribosomal RNA genes and 22 transfer RNA genes. Most of these genes are coded on the heavy strand except for one protein-coding gene (nad6) and five tRNA genes (tRNA ( Ser(UCN) ), tRNA ( Gln ), tRNA ( Ala ), tRNA ( Val ), tRNA ( Asp )) which are coded on the light strand. The composition of the heavy strand is 30.8% A, 23.7% C, 16.2% G, and 29.3% T bases (AT skew=0.025; GC skew=-0.188). A non-coding region of 675 bp was identified as a putative control region because of its location and AT richness. The intergenic spacers range from 1 to 50 bp in size, totaling 227 bp. A total of 25 overlapping nucleotides, ranging from 1 to 10 bp in size, exist among 11 genes. All 13 protein-coding genes are initiated with an ATG. The TAA codon is used as the stop codon in all the protein coding genes except nad3 and nad4 that use TAG as their termination codon. The most frequently used amino acids are Leu (16.29%), Ser (10.34%) and Phe (8.37%). All of the tRNA genes have the potential to fold into typical cloverleaf secondary structures. We also compared the order of the genes in the mitochondrial DNA from the five holothurians that are now available and found a novel gene arrangement in the mitochondrial DNA of Stichopus horrens.
Long non-coding RNA produced by RNA polymerase V determines boundaries of heterochromatin
Böhmdorfer, Gudrun; Sethuraman, Shriya; Rowley, M Jordan; Krzyszton, Michal; Rothi, M Hafiz; Bouzit, Lilia; Wierzbicki, Andrzej T
2016-01-01
RNA-mediated transcriptional gene silencing is a conserved process where small RNAs target transposons and other sequences for repression by establishing chromatin modifications. A central element of this process are long non-coding RNAs (lncRNA), which in Arabidopsis thaliana are produced by a specialized RNA polymerase known as Pol V. Here we show that non-coding transcription by Pol V is controlled by preexisting chromatin modifications located within the transcribed regions. Most Pol V transcripts are associated with AGO4 but are not sliced by AGO4. Pol V-dependent DNA methylation is established on both strands of DNA and is tightly restricted to Pol V-transcribed regions. This indicates that chromatin modifications are established in close proximity to Pol V. Finally, Pol V transcription is preferentially enriched on edges of silenced transposable elements, where Pol V transcribes into TEs. We propose that Pol V may play an important role in the determination of heterochromatin boundaries. DOI: http://dx.doi.org/10.7554/eLife.19092.001 PMID:27779094
Genome Analysis of the Domestic Dog (Korean Jindo) by Massively Parallel Sequencing
Kim, Ryong Nam; Kim, Dae-Soo; Choi, Sang-Haeng; Yoon, Byoung-Ha; Kang, Aram; Nam, Seong-Hyeuk; Kim, Dong-Wook; Kim, Jong-Joo; Ha, Ji-Hong; Toyoda, Atsushi; Fujiyama, Asao; Kim, Aeri; Kim, Min-Young; Park, Kun-Hyang; Lee, Kang Seon; Park, Hong-Seog
2012-01-01
Although pioneering sequencing projects have shed light on the boxer and poodle genomes, a number of challenges need to be met before the sequencing and annotation of the dog genome can be considered complete. Here, we present the DNA sequence of the Jindo dog genome, sequenced to 45-fold average coverage using Illumina massively parallel sequencing technology. A comparison of the sequence to the reference boxer genome led to the identification of 4 675 437 single nucleotide polymorphisms (SNPs, including 3 346 058 novel SNPs), 71 642 indels and 8131 structural variations. Of these, 339 non-synonymous SNPs and 3 indels are located within coding sequences (CDS). In particular, 3 non-synonymous SNPs and a 26-bp deletion occur in the TCOF1 locus, implying that the difference observed in cranial facial morphology between Jindo and boxer dogs might be influenced by those variations. Through the annotation of the Jindo olfactory receptor gene family, we found 2 unique olfactory receptor genes and 236 olfactory receptor genes harbouring non-synonymous homozygous SNPs that are likely to affect smelling capability. In addition, we determined the DNA sequence of the Jindo dog mitochondrial genome and identified Jindo dog-specific mtDNA genotypes. This Jindo genome data upgrade our understanding of dog genomic architecture and will be a very valuable resource for investigating not only dog genetics and genomics but also human and dog disease genetics and comparative genomics. PMID:22474061
Identification of G-quadruplex forming sequences in three manatee papillomaviruses
Zahin, Maryam; Dean, William L.; Ghim, Shin-je; Joh, Joongho; Gray, Robert D.; Khanal, Sujita; Bossart, Gregory D.; Mignucci-Giannoni, Antonio A.; Rouchka, Eric C.; Jenson, Alfred B.; Trent, John O.; Chaires, Jonathan B.
2018-01-01
The Florida manatee (Trichechus manatus latirotris) is a threatened aquatic mammal in United States coastal waters. Over the past decade, the appearance of papillomavirus-induced lesions and viral papillomatosis in manatees has been a concern for those involved in the management and rehabilitation of this species. To date, three manatee papillomaviruses (TmPVs) have been identified in Florida manatees, one forming cutaneous lesions (TmPV1) and two forming genital lesions (TmPV3 and TmPV4). We identified DNA sequences with the potential to form G-quadruplex structures (G4) across the three genomes. G4 were located on both DNA strands and across coding and non-coding regions on all TmPVs, offering multiple targets for viral control. Although G4 have been identified in several viral genomes, including human PVs, most research has focused on canonical structures comprised of three G-tetrads. In contrast, the vast majority of sequences we identified would allow the formation of non-canonical structures with only two G-tetrads. Our biophysical analysis confirmed the formation of G4 with parallel topology in three such sequences from the E2 region. Two of the structures appear comprised of multiple stacked two G-tetrad structures, perhaps serving to increase structural stability. Computational analysis demonstrated enrichment of G4 sequences on all TmPVs on the reverse strand in the E2/E4 region and on both strands in the L2 region. Several G4 sequences occurred at similar regional locations on all PVs, most notably on the reverse strand in the E2 region. In other cases, G4 were identified at similar regional locations only on PVs forming genital lesions. On all TmPVs, G4 sequences were located in the non-coding region near putative E2 binding sites. Together, these findings suggest that G4 are possible regulatory elements in TmPVs. PMID:29630682
Fortin, Connor H; Schulze, Katharina V; Babbitt, Gregory A
2015-01-01
It is now widely-accepted that DNA sequences defining DNA-protein interactions functionally depend upon local biophysical features of DNA backbone that are important in defining sites of binding interaction in the genome (e.g. DNA shape, charge and intrinsic dynamics). However, these physical features of DNA polymer are not directly apparent when analyzing and viewing Shannon information content calculated at single nucleobases in a traditional sequence logo plot. Thus, sequence logos plots are severely limited in that they convey no explicit information regarding the structural dynamics of DNA backbone, a feature often critical to binding specificity. We present TRX-LOGOS, an R software package and Perl wrapper code that interfaces the JASPAR database for computational regulatory genomics. TRX-LOGOS extends the traditional sequence logo plot to include Shannon information content calculated with regard to the dinucleotide-based BI-BII conformation shifts in phosphate linkages on the DNA backbone, thereby adding a visual measure of intrinsic DNA flexibility that can be critical for many DNA-protein interactions. TRX-LOGOS is available as an R graphics module offered at both SourceForge and as a download supplement at this journal. To demonstrate the general utility of TRX logo plots, we first calculated the information content for 416 Saccharomyces cerevisiae transcription factor binding sites functionally confirmed in the Yeastract database and matched to previously published yeast genomic alignments. We discovered that flanking regions contain significantly elevated information content at phosphate linkages than can be observed at nucleobases. We also examined broader transcription factor classifications defined by the JASPAR database, and discovered that many general signatures of transcription factor binding are locally more information rich at the level of DNA backbone dynamics than nucleobase sequence. We used TRX-logos in combination with MEGA 6.0 software for molecular evolutionary genetics analysis to visually compare the human Forkhead box/FOX protein evolution to its binding site evolution. We also compared the DNA binding signatures of human TP53 tumor suppressor determined by two different laboratory methods (SELEX and ChIP-seq). Further analysis of the entire yeast genome, center aligned at the start codon, also revealed a distinct sequence-independent 3 bp periodic pattern in information content, present only in coding region, and perhaps indicative of the non-random organization of the genetic code. TRX-LOGOS is useful in any situation in which important information content in DNA can be better visualized at the positions of phosphate linkages (i.e. dinucleotides) where the dynamic properties of the DNA backbone functions to facilitate DNA-protein interaction.
Liaw, Yu-Ching; Chen, Cheng-Hsu; Shu, Kuo-Hsiung; Fang, Chiung-Yao; Ou, Wei-Chih; Chen, Pei-Lain; Shen, Cheng-Huang; Lin, Mien-Chun; Chang, Deching; Wang, Meilin
2012-12-01
Kidney cells are the common host for JC virus (JCV) and BK virus (BKV). Reactivation of JCV and/or BKV in patients after organ transplantation, such as renal transplantation, may cause hemorrhagic cystitis and polyomavirus-associated nephropathy. Furthermore, JCV and BKV may be shed in the urine after reactivation in the kidney. Rearranged as well as archetypal non-coding control regions (NCCRs) of JCV and BKV have been frequently identified in human samples. In this study, three JC/BK recombined NCCR sequences were identified in the urine of a patient who had undergone renal transplantation. They were designated as JC-BK hybrids 1, 2, and 3. The three JC/BK recombinant NCCRs contain up-stream JCV as well as down-stream BKV sequences. Deletions of both JCV and BKV sequences were found in these recombined NCCRs. Recombination of DNA sequences between JCV and BKV may occur during co-infection due to the relatively high homology of the two viral genomes.
Physics behind the mechanical nucleosome positioning code
NASA Astrophysics Data System (ADS)
Zuiddam, Martijn; Everaers, Ralf; Schiessel, Helmut
2017-11-01
The positions along DNA molecules of nucleosomes, the most abundant DNA-protein complexes in cells, are influenced by the sequence-dependent DNA mechanics and geometry. This leads to the "nucleosome positioning code", a preference of nucleosomes for certain sequence motives. Here we introduce a simplified model of the nucleosome where a coarse-grained DNA molecule is frozen into an idealized superhelical shape. We calculate the exact sequence preferences of our nucleosome model and find it to reproduce qualitatively all the main features known to influence nucleosome positions. Moreover, using well-controlled approximations to this model allows us to come to a detailed understanding of the physics behind the sequence preferences of nucleosomes.
Pollier, Jacob; González-Guzmán, Miguel; Ardiles-Diaz, Wilson; Geelen, Danny; Goossens, Alain
2011-01-01
cDNA-Amplified Fragment Length Polymorphism (cDNA-AFLP) is a commonly used technique for genome-wide expression analysis that does not require prior sequence knowledge. Typically, quantitative expression data and sequence information are obtained for a large number of differentially expressed gene tags. However, most of the gene tags do not correspond to full-length (FL) coding sequences, which is a prerequisite for subsequent functional analysis. A medium-throughput screening strategy, based on integration of polymerase chain reaction (PCR) and colony hybridization, was developed that allows in parallel screening of a cDNA library for FL clones corresponding to incomplete cDNAs. The method was applied to screen for the FL open reading frames of a selection of 163 cDNA-AFLP tags from three different medicinal plants, leading to the identification of 109 (67%) FL clones. Furthermore, the protocol allows for the use of multiple probes in a single hybridization event, thus significantly increasing the throughput when screening for rare transcripts. The presented strategy offers an efficient method for the conversion of incomplete expressed sequence tags (ESTs), such as cDNA-AFLP tags, to FL-coding sequences.
Habenicht, A; Quesada, A; Cerff, R
1997-10-01
A cDNA-library has been constructed from Nicotiana plumbaginifolia seedlings, and the non-phosphorylating glyceraldehyde-3-phosphate dehydrogenase (GapN, EC 1.2.1.9) was isolated by plaque hybridization using the cDNA from pea as a heterologous probe. The cDNA comprises the entire GapN coding region. A putative polyadenylation signal is identified. Phylogenetic analysis based on the deduced amino acid sequences revealed that the GapN gene family represents a separate ancient branch within the aldehyde dehydrogenase superfamily. It can be shown that the GapN gene family and other distinct branches of the superfamily have its phylogenetic origin before the separation of primary life-forms. This further demonstrates that already very early in evolution, a broad diversification of the aldehyde dehydrogenases led to the formation of the superfamily.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Helfenbein, Kevin G.; Fourcade, H. Matthew; Vanjani, Rohit G.
2004-05-01
We report the first complete mitochondrial (mt) DNA sequence from a member of the phylum Chaetognatha (arrow worms). The Paraspadella gotoi mtDNA is highly unusual, missing 23 of the genes commonly found in animal mtDNAs, including atp6, which has otherwise been found universally to be present. Its 14 genes are unusually arranged into two groups, one on each strand. One group is punctuated by numerous non-coding intergenic nucleotides, while the other group is tightly packed, having no non-coding nucleotides, leading to speculation that there are two transcription units with differing modes of expression. The phylogenetic position of the Chaetognatha withinmore » the Metazoa has long been uncertain, with conflicting or equivocal results from various morphological analyses and rRNA sequence comparisons. Comparisons here of amino acid sequences from mitochondrially encoded proteins gives a single most parsimonious tree that supports a position of Chaetognatha as sister to the protostomes studied here. From this, one can more clearly interpret the patterns of evolution of various developmental features, especially regarding the embryological fate of the blastopore.« less
Morchikh, Mehdi; Cribier, Alexandra; Raffel, Raoul; Amraoui, Sonia; Cau, Julien; Severac, Dany; Dubois, Emeric; Schwartz, Olivier; Bennasser, Yamina; Benkirane, Monsef
2017-08-03
The DNA-mediated innate immune response underpins anti-microbial defenses and certain autoimmune diseases. Here we used immunoprecipitation, mass spectrometry, and RNA sequencing to identify a ribonuclear complex built around HEXIM1 and the long non-coding RNA NEAT1 that we dubbed the HEXIM1-DNA-PK-paraspeckle components-ribonucleoprotein complex (HDP-RNP). The HDP-RNP contains DNA-PK subunits (DNAPKc, Ku70, and Ku80) and paraspeckle proteins (SFPQ, NONO, PSPC1, RBM14, and MATRIN3). We show that binding of HEXIM1 to NEAT1 is required for its assembly. We further demonstrate that the HDP-RNP is required for the innate immune response to foreign DNA, through the cGAS-STING-IRF3 pathway. The HDP-RNP interacts with cGAS and its partner PQBP1, and their interaction is remodeled by foreign DNA. Remodeling leads to the release of paraspeckle proteins, recruitment of STING, and activation of DNAPKc and IRF3. Our study establishes the HDP-RNP as a key nuclear regulator of DNA-mediated activation of innate immune response through the cGAS-STING pathway. Copyright © 2017 Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, O.; Masters, C.; Lewis, M.B.
1994-09-01
In an 8-year-old girl and her father, both of whom have severe type III OI, we have previously used RNA/RNA hybrid analysis to demonstrate a mismatch in the region of {alpha}1(I) mRNA coding for aa 558-861. We used SSCP to further localize the abnormality to a subregion coding for aa 579-679. This region was subcloned and sequenced. Each patient`s cDNA has a deletion of the sequences coding for the last residue of exon 34, and all of exons 35 and 36 (aa 604-639), followed by an insertion of 156 nt from the 3{prime}-end of intron 36. PCR amplification of leukocytemore » DNA from the patients and the clinically normal paternal grandmother yielded two fragments: a 1007 bp fragment predicted from normal genomic sequences and a 445 bp fragment. Subcloning and sequencing of the shorter genomic PCR product confirmed the presence of a 565 bp genomic deletion from the end of exon 34 to the middle of intron 36. The abnormal protein is apparently synthesized and incorporated into helix. The inserted nucleotides are in frame with the collagenous sequence and contain no stop codons. They encode a 52 aa non-collagenous region. The fibroblast procollagen of the patients has both normal and electrophoretically delayed pro{alpha}(I) bands. The electrophoretically delayed procollagen is very sensitive to pepsin or trypsin digestion, as predicted by its non-collagenous sequence, and cannot be visualized as collagen. This unique OI collagen mutation is an excellent candidate for molecular targeting to {open_quotes}turn off{close_quotes} a dominant mutant allele.« less
Ikegami, Kohta; Ohgane, Jun; Tanaka, Satoshi; Yagi, Shintaro; Shiota, Kunio
2009-01-01
Genes constitute only a small proportion of the mammalian genome, the majority of which is composed of non-genic repetitive elements including interspersed repeats and satellites. A unique feature of the mammalian genome is that there are numerous tissue-dependent, differentially methylated regions (T-DMRs) in the non-repetitive sequences, which include genes and their regulatory elements. The epigenetic status of T-DMRs varies from that of repetitive elements and constitutes the DNA methylation profile genome-wide. Since the DNA methylation profile is specific to each cell and tissue type, much like a fingerprint, it can be used as a means of identification. The formation of DNA methylation profiles is the basis for cell differentiation and development in mammals. The epigenetic status of each T-DMR is regulated by the interplay between DNA methyltransferases, histone modification enzymes, histone subtypes, non-histone nuclear proteins and non-coding RNAs. In this review, we will discuss how these epigenetic factors cooperate to establish cell- and tissue-specific DNA methylation profiles.
2014-01-01
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method. PMID:24386895
Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics
NASA Technical Reports Server (NTRS)
Mantegna, R. N.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Peng, C. K.; Simons, M.; Stanley, H. E.
1995-01-01
We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0, as well as the recently published sequences of C. elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of coding regions. In particular, (i) a n-tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval, while (ii) an n-gram entropy measurement shows that the noncoding regions have a lower n-gram entropy (and hence a larger "n-gram redundancy") than the coding regions. In contrast to the three chromosomes, we find that for vertebrates such as primates and rodents and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic limitations of the n-gram redundancy analysis, we also briefly discuss the failure of the zeroth- and first-order Markovian models or simple nucleotide repeats to account fully for these "linguistic" features of DNA. Finally, we emphasize that our results by no means prove the existence of a "language" in noncoding DNA.
Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P
1988-02-01
Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators.
Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P
1988-01-01
Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators. Images PMID:3257578
Natural Antisense Transcripts: Molecular Mechanisms and Implications in Breast Cancers
Latgé, Guillaume; Poulet, Christophe; Bours, Vincent; Jerusalem, Guy
2018-01-01
Natural antisense transcripts are RNA sequences that can be transcribed from both DNA strands at the same locus but in the opposite direction from the gene transcript. Because strand-specific high-throughput sequencing of the antisense transcriptome has only been available for less than a decade, many natural antisense transcripts were first described as long non-coding RNAs. Although the precise biological roles of natural antisense transcripts are not known yet, an increasing number of studies report their implication in gene expression regulation. Their expression levels are altered in many physiological and pathological conditions, including breast cancers. Among the potential clinical utilities of the natural antisense transcripts, the non-coding|coding transcript pairs are of high interest for treatment. Indeed, these pairs can be targeted by antisense oligonucleotides to specifically tune the expression of the coding-gene. Here, we describe the current knowledge about natural antisense transcripts, their varying molecular mechanisms as gene expression regulators, and their potential as prognostic or predictive biomarkers in breast cancers. PMID:29301303
Natural Antisense Transcripts: Molecular Mechanisms and Implications in Breast Cancers.
Latgé, Guillaume; Poulet, Christophe; Bours, Vincent; Josse, Claire; Jerusalem, Guy
2018-01-02
Natural antisense transcripts are RNA sequences that can be transcribed from both DNA strands at the same locus but in the opposite direction from the gene transcript. Because strand-specific high-throughput sequencing of the antisense transcriptome has only been available for less than a decade, many natural antisense transcripts were first described as long non-coding RNAs. Although the precise biological roles of natural antisense transcripts are not known yet, an increasing number of studies report their implication in gene expression regulation. Their expression levels are altered in many physiological and pathological conditions, including breast cancers. Among the potential clinical utilities of the natural antisense transcripts, the non-coding|coding transcript pairs are of high interest for treatment. Indeed, these pairs can be targeted by antisense oligonucleotides to specifically tune the expression of the coding-gene. Here, we describe the current knowledge about natural antisense transcripts, their varying molecular mechanisms as gene expression regulators, and their potential as prognostic or predictive biomarkers in breast cancers.
Pompei, Fiorenza; Ciminelli, Bianca Maria; Bombieri, Cristina; Ciccacci, Cinzia; Koudova, Monika; Giorgi, Silvia; Belpinati, Francesca; Begnini, Angela; Cerny, Milos; Des Georges, Marie; Claustres, Mireille; Ferec, Claude; Macek, Milan; Modiano, Guido; Pignatti, Pier Franco
2006-01-01
An average of about 1700 CFTR (cystic fibrosis transmembrane conductance regulator) alleles from normal individuals from different European populations were extensively screened for DNA sequence variation. A total of 80 variants were observed: 61 coding SNSs (results already published), 13 noncoding SNSs, three STRs, two short deletions, and one nucleotide insertion. Eight DNA variants were classified as non-CF causing due to their high frequency of occurrence. Through this survey the CFTR has become the most exhaustively studied gene for its coding sequence variability and, though to a lesser extent, for its noncoding sequence variability as well. Interestingly, most variation was associated with the M470 allele, while the V470 allele showed an 'extended haplotype homozygosity' (EHH). These findings make us suggest a role for selection acting either on the M470V itself or through an hitchhiking mechanism involving a second site. The possible ancient origin of the V allele in an 'out of Africa' time frame is discussed.
Lichenase and coding sequences
Li, Xin-Liang; Ljungdahl, Lars G.; Chen, Huizhong
2000-08-15
The present invention provides a fungal lichenase, i.e., an endo-1,3-1,4-.beta.-D-glucanohydrolase, its coding sequence, recombinant DNA molecules comprising the lichenase coding sequences, recombinant host cells and methods for producing same. The present lichenase is from Orpinomyces PC-2.
Caruccio, Nicholas
2011-01-01
DNA library preparation is a common entry point and bottleneck for next-generation sequencing. Current methods generally consist of distinct steps that often involve significant sample loss and hands-on time: DNA fragmentation, end-polishing, and adaptor-ligation. In vitro transposition with Nextera™ Transposomes simultaneously fragments and covalently tags the target DNA, thereby combining these three distinct steps into a single reaction. Platform-specific sequencing adaptors can be added, and the sample can be enriched and bar-coded using limited-cycle PCR to prepare di-tagged DNA fragment libraries. Nextera technology offers a streamlined, efficient, and high-throughput method for generating bar-coded libraries compatible with multiple next-generation sequencing platforms.
Ding, Yanqiang; Fang, Yang; Guo, Ling; Li, Zhidan; He, Kaize; Zhao, Yun; Zhao, Hai
2017-01-01
Phylogenetic relationship within different genera of Lemnoideae, a kind of small aquatic monocotyledonous plants, was not well resolved, using either morphological characters or traditional markers. Given that rich genetic information in chloroplast genome makes them particularly useful for phylogenetic studies, we used chloroplast genomes to clarify the phylogeny within Lemnoideae. DNAs were sequenced with next-generation sequencing. The duckweeds chloroplast genomes were indirectly filtered from the total DNA data, or directly obtained from chloroplast DNA data. To test the reliability of assembling the chloroplast genome based on the filtration of the total DNA, two methods were used to assemble the chloroplast genome of Landoltia punctata strain ZH0202. A phylogenetic tree was built on the basis of the whole chloroplast genome sequences using MrBayes v.3.2.6 and PhyML 3.0. Eight complete duckweeds chloroplast genomes were assembled, with lengths ranging from 165,775 bp to 171,152 bp, and each contains 80 protein-coding sequences, four rRNAs, 30 tRNAs and two pseudogenes. The identity of L. punctata strain ZH0202 chloroplast genomes assembled through two methods was 100%, and their sequences and lengths were completely identical. The chloroplast genome comparison demonstrated that the differences in chloroplast genome sizes among the Lemnoideae primarily resulted from variation in non-coding regions, especially from repeat sequence variation. The phylogenetic analysis demonstrated that the different genera of Lemnoideae are derived from each other in the following order: Spirodela , Landoltia , Lemna , Wolffiella , and Wolffia . This study demonstrates potential of whole chloroplast genome DNA as an effective option for phylogenetic studies of Lemnoideae. It also showed the possibility of using chloroplast DNA data to elucidate those phylogenies which were not yet solved well by traditional methods even in plants other than duckweeds.
Phylogenic study of Lemnoideae (duckweeds) through complete chloroplast genomes for eight accessions
Ding, Yanqiang; Fang, Yang; Guo, Ling; Li, Zhidan; He, Kaize
2017-01-01
Background Phylogenetic relationship within different genera of Lemnoideae, a kind of small aquatic monocotyledonous plants, was not well resolved, using either morphological characters or traditional markers. Given that rich genetic information in chloroplast genome makes them particularly useful for phylogenetic studies, we used chloroplast genomes to clarify the phylogeny within Lemnoideae. Methods DNAs were sequenced with next-generation sequencing. The duckweeds chloroplast genomes were indirectly filtered from the total DNA data, or directly obtained from chloroplast DNA data. To test the reliability of assembling the chloroplast genome based on the filtration of the total DNA, two methods were used to assemble the chloroplast genome of Landoltia punctata strain ZH0202. A phylogenetic tree was built on the basis of the whole chloroplast genome sequences using MrBayes v.3.2.6 and PhyML 3.0. Results Eight complete duckweeds chloroplast genomes were assembled, with lengths ranging from 165,775 bp to 171,152 bp, and each contains 80 protein-coding sequences, four rRNAs, 30 tRNAs and two pseudogenes. The identity of L. punctata strain ZH0202 chloroplast genomes assembled through two methods was 100%, and their sequences and lengths were completely identical. The chloroplast genome comparison demonstrated that the differences in chloroplast genome sizes among the Lemnoideae primarily resulted from variation in non-coding regions, especially from repeat sequence variation. The phylogenetic analysis demonstrated that the different genera of Lemnoideae are derived from each other in the following order: Spirodela, Landoltia, Lemna, Wolffiella, and Wolffia. Discussion This study demonstrates potential of whole chloroplast genome DNA as an effective option for phylogenetic studies of Lemnoideae. It also showed the possibility of using chloroplast DNA data to elucidate those phylogenies which were not yet solved well by traditional methods even in plants other than duckweeds. PMID:29302399
COOLAIR Antisense RNAs Form Evolutionarily Conserved Elaborate Secondary Structures
Hawkes, Emily J.; Hennelly, Scott P.; Novikova, Irina V.; ...
2016-09-20
There is considerable debate about the functionality of long non-coding RNAs (lncRNAs). Lack of sequence conservation has been used to argue against functional relevance. Here, we investigated antisense lncRNAs, called COOLAIR, at the A. thaliana FLC locus and experimentally determined their secondary structure. The major COOLAIR variants are highly structured, organized by exon. The distally polyadenylated transcript has a complex multi-domain structure, altered by a single non-coding SNP defining a functionally distinct A. thaliana FLC haplotype. The A. thaliana COOLAIR secondary structure was used to predict COOLAIR exons in evolutionarily divergent Brassicaceae species. These predictions were validated through chemical probingmore » and cloning. Despite the relatively low nucleotide sequence identity, the structures, including multi-helix junctions, show remarkable evolutionary conservation. In a number of places, the structure is conserved through covariation of a non-contiguous DNA sequence. This structural conservation supports a functional role for COOLAIR transcripts rather than, or in addition to, antisense transcription.« less
COOLAIR Antisense RNAs Form Evolutionarily Conserved Elaborate Secondary Structures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hawkes, Emily J.; Hennelly, Scott P.; Novikova, Irina V.
There is considerable debate about the functionality of long non-coding RNAs (lncRNAs). Lack of sequence conservation has been used to argue against functional relevance. Here, we investigated antisense lncRNAs, called COOLAIR, at the A. thaliana FLC locus and experimentally determined their secondary structure. The major COOLAIR variants are highly structured, organized by exon. The distally polyadenylated transcript has a complex multi-domain structure, altered by a single non-coding SNP defining a functionally distinct A. thaliana FLC haplotype. The A. thaliana COOLAIR secondary structure was used to predict COOLAIR exons in evolutionarily divergent Brassicaceae species. These predictions were validated through chemical probingmore » and cloning. Despite the relatively low nucleotide sequence identity, the structures, including multi-helix junctions, show remarkable evolutionary conservation. In a number of places, the structure is conserved through covariation of a non-contiguous DNA sequence. This structural conservation supports a functional role for COOLAIR transcripts rather than, or in addition to, antisense transcription.« less
The complete mitochondrial genome of Hydra vulgaris (Hydroida: Hydridae).
Pan, Hong-Chun; Fang, Hong-Yan; Li, Shi-Wei; Liu, Jun-Hong; Wang, Ying; Wang, An-Tai
2014-12-01
The complete mitochondrial genome of Hydra vulgaris (Hydroida: Hydridae) is composed of two linear DNA molecules. The mitochondrial DNA (mtDNA) molecule 1 is 8010 bp long and contains six protein-coding genes, large subunit rRNA, methionine and tryptophan tRNAs, two pseudogenes consisting respectively of a partial copy of COI, and terminal sequences at two ends of the linear mtDNA, while the mtDNA molecule 2 is 7576 bp long and contains seven protein-coding genes, small subunit rRNA, methionine tRNA, a pseudogene consisting of a partial copy of COI and terminal sequences at two ends of the linear mtDNA. COI gene begins with GTG as start codon, whereas other 12 protein-coding genes start with a typical ATG initiation codon. In addition, all protein-coding genes are terminated with TAA as stop codon.
DNA-binding proteins from marine bacteria expand the known sequence diversity of TALE-like repeats
de Lange, Orlando; Wolf, Christina; Thiel, Philipp; Krüger, Jens; Kleusch, Christian; Kohlbacher, Oliver; Lahaye, Thomas
2015-01-01
Transcription Activator-Like Effectors (TALEs) of Xanthomonas bacteria are programmable DNA binding proteins with unprecedented target specificity. Comparative studies into TALE repeat structure and function are hindered by the limited sequence variation among TALE repeats. More sequence-diverse TALE-like proteins are known from Ralstonia solanacearum (RipTALs) and Burkholderia rhizoxinica (Bats), but RipTAL and Bat repeats are conserved with those of TALEs around the DNA-binding residue. We study two novel marine-organism TALE-like proteins (MOrTL1 and MOrTL2), the first to date of non-terrestrial origin. We have assessed their DNA-binding properties and modelled repeat structures. We found that repeats from these proteins mediate sequence specific DNA binding conforming to the TALE code, despite low sequence similarity to TALE repeats, and with novel residues around the BSR. However, MOrTL1 repeats show greater sequence discriminating power than MOrTL2 repeats. Sequence alignments show that there are only three residues conserved between repeats of all TALE-like proteins including the two new additions. This conserved motif could prove useful as an identifier for future TALE-likes. Additionally, comparing MOrTL repeats with those of other TALE-likes suggests a common evolutionary origin for the TALEs, RipTALs and Bats. PMID:26481363
Samuels, David C.; Boys, Richard J.; Henderson, Daniel A.; Chinnery, Patrick F.
2003-01-01
We applied a hidden Markov model segmentation method to the human mitochondrial genome to identify patterns in the sequence, to compare these patterns to the gene structure of mtDNA and to see whether these patterns reveal additional characteristics important for our understanding of genome evolution, structure and function. Our analysis identified three segmentation categories based upon the sequence transition probabilities. Category 2 segments corresponded to the tRNA and rRNA genes, with a greater strand-symmetry in these segments. Category 1 and 3 segments covered the protein- coding genes and almost all of the non-coding D-loop. Compared to category 1, the mtDNA segments assigned to category 3 had much lower guanine abundance. A comparison to two independent databases of mitochondrial mutations and polymorphisms showed that the high substitution rate of guanine in human mtDNA is largest in the category 3 segments. Analysis of synonymous mutations showed the same pattern. This suggests that this heterogeneity in the mutation rate is partly independent of respiratory chain function and is a direct property of the genome sequence itself. This has important implications for our understanding of mtDNA evolution and its use as a ‘molecular clock’ to determine the rate of population and species divergence. PMID:14530452
Schmouth, Jean-François; Castellarin, Mauro; Laprise, Stéphanie; Banks, Kathleen G; Bonaguro, Russell J; McInerny, Simone C; Borretta, Lisa; Amirabbasi, Mahsa; Korecki, Andrea J; Portales-Casamar, Elodie; Wilson, Gary; Dreolini, Lisa; Jones, Steven J M; Wasserman, Wyeth W; Goldowitz, Daniel; Holt, Robert A; Simpson, Elizabeth M
2013-10-14
The next big challenge in human genetics is understanding the 98% of the genome that comprises non-coding DNA. Hidden in this DNA are sequences critical for gene regulation, and new experimental strategies are needed to understand the functional role of gene-regulation sequences in health and disease. In this study, we build upon our HuGX ('high-throughput human genes on the X chromosome') strategy to expand our understanding of human gene regulation in vivo. In all, ten human genes known to express in therapeutically important brain regions were chosen for study. For eight of these genes, human bacterial artificial chromosome clones were identified, retrofitted with a reporter, knocked single-copy into the Hprt locus in mouse embryonic stem cells, and mouse strains derived. Five of these human genes expressed in mouse, and all expressed in the adult brain region for which they were chosen. This defined the boundaries of the genomic DNA sufficient for brain expression, and refined our knowledge regarding the complexity of gene regulation. We also characterized for the first time the expression of human MAOA and NR2F2, two genes for which the mouse homologs have been extensively studied in the central nervous system (CNS), and AMOTL1 and NOV, for which roles in CNS have been unclear. We have demonstrated the use of the HuGX strategy to functionally delineate non-coding-regulatory regions of therapeutically important human brain genes. Our results also show that a careful investigation, using publicly available resources and bioinformatics, can lead to accurate predictions of gene expression.
Ali, S; Azfer, M A; Bashamboo, A; Mathur, P K; Malik, P K; Mathur, V B; Raha, A K; Ansari, S
1999-03-04
We have cloned and sequenced a 906bp EcoRI repeat DNA fraction from Rhinoceros unicornis genome. The contig pSS(R)2 is AT rich with 340 A (37.53%), 187 C (20.64%), 173 G (19.09%) and 206 T (22.74%). The sequence contains MALT box, NF-E1, Poly-A signal, lariat consensus sequences, TATA box, translational initiation sequences and several stop codons. Translation of the contig showed seven different types of protein motifs, among which, EGF-like domain cysteine pattern signatures and Bowman-Birk serine protease inhibitor family signatures were prominent. The presence of eukaryotic transcriptional elements, protein signatures and analysis of subset sequences in the 5' region from 1 to 165nt indicating coding potential (test code value=0.97) suggest possible regulatory and/or functional role(s) of these sequences in the rhino genome. Translation of the complementary strand from 906 to 706nt and 190 to 2nt showed proteins of more than 7kDa rich in non-polar residues. This suggests that pSS(R)2 is either a part of, or adjacent to, a functional gene. The contig contains mostly non-consecutive simple repeat units from 2 to 17nt with varying frequencies, of which four base motifs were found to be predominant. Zoo-blot hybridization revealed that pSS(R)2 sequences are unique to R. unicornis genome because they do not cross-hybridize, even with the genomic DNA of South African black rhino Diceros bicornis. Southern blot analysis of R. unicornis genomic DNA with pSS(R)2 and other synthetic oligo probes revealed a high level of genetic homogeneity, which was also substantiated by microsatellite associated sequence amplification (MASA). Owing to its uniqueness, the pSS(R)2 probe has a potential application in the area of conservation biology for unequivocal identification of horn or other body tissues of R. unicornis. The evolutionary aspect of this repeat fraction in the context of comparative genome analysis is discussed.
Lourenco-Jaramillo, Diana Lelidett; Sifuentes-Rincón, Ana María; Parra-Bracamonte, Gaspar Manuel; de la Rosa-Reyna, Xochitl Fabiola; Segura-Cabrera, Aldo; Arellano-Vera, Williams
2012-01-01
DNA from four cattle breeds was used to re-sequence all of the exons and 56% of the introns of the bovine tyrosine hydroxylase (TH) gene and 97% and 13% of the bovine dopamine β-hydroxylase (DBH) coding and non-coding sequences, respectively. Two novel single nucleotide polymorphisms (SNPs) and a microsatellite motif were found in the TH sequences. The DBH sequences contained 62 nucleotide changes, including eight non-synonymous SNPs (nsSNPs) that are of particular interest because they may alter protein function and therefore affect the phenotype. These DBH nsSNPs resulted in amino acid substitutions that were predicted to destabilize the protein structure. Six SNPs (one from TH and five from DBH non-synonymous SNPs) were genotyped in 140 animals; all of them were polymorphic and had a minor allele frequency of > 9%. There were significant differences in the intra- and inter-population haplotype distributions. The haplotype differences between Brahman cattle and the three B. t. taurus breeds (Charolais, Holstein and Lidia) were interesting from a behavioural point of view because of the differences in temperament between these breeds. PMID:22888292
Shen, Kang-Ning; Chen, Ching-Hung; Hsiao, Chung-Der
2016-05-01
In this study, the complete mitogenome sequence of hornlip mullet Plicomugil labiosus (Teleostei: Mugilidae) has been sequenced by next-generation sequencing method. The assembled mitogenome, consisting of 16,829 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein coding genes, 22 transfer RNAs, 2 ribosomal RNAs genes and a non-coding control region of D-loop. D-loop contains 1057 bp length is located between tRNA-Pro and tRNA-Phe. The overall base composition of P. labiosus is 28.0% for A, 29.3% for C, 15.5% for G and 27.2% for T. The complete mitogenome may provide essential and important DNA molecular data for further population, phylogenetic and evolutionary analysis for Mugilidae.
Shen, Kang-Ning; Tsai, Shiou-Yi; Chen, Ching-Hung; Hsiao, Chung-Der; Durand, Jean-Dominique
2016-11-01
In this study, the complete mitogenome sequence of largescale mullet (Teleostei: Mugilidae) has been sequenced by the next-generation sequencing method. The assembled mitogenome, consisting of 16,832 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein-coding genes, 22 transfer RNAs, two ribosomal RNAs genes, and a non-coding control region of D-loop. D-loop which has a length of 1094 bp is located between tRNA-Pro and tRNA-Phe. The overall base composition of largescale mullet is 27.8% for A, 30.1% for C, 16.2% for G, and 25.9% for T. The complete mitogenome may provide essential and important DNA molecular data for further phylogenetic and evolutionary analysis for Mugilidae.
Kawano, Tomonori
2013-01-01
There have been a wide variety of approaches for handling the pieces of DNA as the “unplugged” tools for digital information storage and processing, including a series of studies applied to the security-related area, such as DNA-based digital barcodes, water marks and cryptography. In the present article, novel designs of artificial genes as the media for storing the digitally compressed data for images are proposed for bio-computing purpose while natural genes principally encode for proteins. Furthermore, the proposed system allows cryptographical application of DNA through biochemically editable designs with capacity for steganographical numeric data embedment. As a model case of image-coding DNA technique application, numerically and biochemically combined protocols are employed for ciphering the given “passwords” and/or secret numbers using DNA sequences. The “passwords” of interest were decomposed into single letters and translated into the font image coded on the separate DNA chains with both the coding regions in which the images are encoded based on the novel run-length encoding rule, and the non-coding regions designed for biochemical editing and the remodeling processes revealing the hidden orientation of letters composing the original “passwords.” The latter processes require the molecular biological tools for digestion and ligation of the fragmented DNA molecules targeting at the polymerase chain reaction-engineered termini of the chains. Lastly, additional protocols for steganographical overwriting of the numeric data of interests over the image-coding DNA are also discussed. PMID:23750303
Rizk, Francine; Laverdure, Sylvain; d'Alençon, Emmanuelle; Bossin, Hervé; Dupressoir, Thierry
2018-01-01
The Lepidopteran ambidensovirus 1 isolated from Junonia coenia (hereafter JcDV) is an invertebrate parvovirus considered as a viral transduction vector as well as a potential tool for the biological control of insect pests. Previous works showed that JcDV-based circular plasmids experimentally integrate into insect cells genomic DNA. In order to approach the natural conditions of infection and possible integration, we generated linear JcDV- gfp based molecules which were transfected into non permissive Spodoptera frugiperda ( Sf9 ) cultured cells. Cells were monitored for the expression of green fluorescent protein (GFP) and DNA was analyzed for integration of transduced viral sequences. Non-structural protein modulation of the VP-gene cassette promoter activity was additionally assayed. We show that linear JcDV-derived molecules are capable of long term genomic integration and sustained transgene expression in Sf9 cells. As expected, only the deletion of both inverted terminal repeats (ITR) or the polyadenylation signals of NS and VP genes dramatically impairs the global transduction/expression efficiency. However, all the integrated viral sequences we characterized appear "scrambled" whatever the viral content of the transfected vector. Despite a strong GFP expression, we were unable to recover any full sequence of the original constructs and found rearranged viral and non-viral sequences as well. Cellular flanking sequences were identified as non-coding ones. On the other hand, the kinetics of GFP expression over time led us to investigate the apparent down-regulation by non-structural proteins of the VP-gene cassette promoter. Altogether, our results show that JcDV-derived sequences included in linear DNA molecules are able to drive efficiently the integration and expression of a foreign gene into the genome of insect cells, whatever their composition, provided that at least one ITR is present. However, the transfected sequences were extensively rearranged with cellular DNA during or after random integration in the host cell genome. Lastly, the non-structural proteins seem to participate in the regulation of p9 promoter activity rather than to the integration of viral sequences.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Helfenbein, Kevin G.; Brown, Wesley M.; Boore, Jeffrey L.
We have sequenced the complete mitochondrial DNA (mtDNA) of the articulate brachiopod Terebratalia transversa. The circular genome is 14,291 bp in size, relatively small compared to other published metazoan mtDNAs. The 37 genes commonly found in animal mtDNA are present; the size decrease is due to the truncation of several tRNA, rRNA, and protein genes, to some nucleotide overlaps, and to a paucity of non-coding nucleotides. Although the gene arrangement differs radically from those reported for other metazoans, some gene junctions are shared with two other articulate brachiopods, Laqueus rubellus and Terebratulina retusa. All genes in the T. transversa mtDNA,more » unlike those in most metazoan mtDNAs reported, are encoded by the same strand. The A+T content (59.1 percent) is low for a metazoan mtDNA, and there is a high propensity for homopolymer runs and a strong base-compositional strand bias. The coding strand is quite G+T-rich, a skew that is shared by the confamilial (laqueid) specie s L. rubellus, but opposite to that found in T. retusa, a cancellothyridid. These compositional skews are strongly reflected in the codon usage patterns and the amino acid compositions of the mitochondrial proteins, with markedly different usage observed between T. retusa and the two laqueids. This observation, plus the similarity of the laqueid non-coding regions to the reverse complement of the non-coding region of the cancellothyridid, suggest that an inversion that resulted in a reversal in the direction of first-strand replication has occurred in one of the two lineages. In addition to the presence of one non-coding region in T. transversa that is comparable to those in the other brachiopod mtDNAs, there are two others with the potential to form secondary structures; one or both of these may be involved in the process of transcript cleavage.« less
Non-codingRNA sequence variations in human chronic lymphocytic leukemia and colorectal cancer.
Wojcik, Sylwia E; Rossi, Simona; Shimizu, Masayoshi; Nicoloso, Milena S; Cimmino, Amelia; Alder, Hansjuerg; Herlea, Vlad; Rassenti, Laura Z; Rai, Kanti R; Kipps, Thomas J; Keating, Michael J; Croce, Carlo M; Calin, George A
2010-02-01
Cancer is a genetic disease in which the interplay between alterations in protein-coding genes and non-coding RNAs (ncRNAs) plays a fundamental role. In recent years, the full coding component of the human genome was sequenced in various cancers, whereas such attempts related to ncRNAs are still fragmentary. We screened genomic DNAs for sequence variations in 148 microRNAs (miRNAs) and ultraconserved regions (UCRs) loci in patients with chronic lymphocytic leukemia (CLL) or colorectal cancer (CRC) by Sanger technique and further tried to elucidate the functional consequences of some of these variations. We found sequence variations in miRNAs in both sporadic and familial CLL cases, mutations of UCRs in CLLs and CRCs and, in certain instances, detected functional effects of these variations. Furthermore, by integrating our data with previously published data on miRNA sequence variations, we have created a catalog of DNA sequence variations in miRNAs/ultraconserved genes in human cancers. These findings argue that ncRNAs are targeted by both germ line and somatic mutations as well as by single-nucleotide polymorphisms with functional significance for human tumorigenesis. Sequence variations in ncRNA loci are frequent and some have functional and biological significance. Such information can be exploited to further investigate on a genome-wide scale the frequency of genetic variations in ncRNAs and their functional meaning, as well as for the development of new diagnostic and prognostic markers for leukemias and carcinomas.
Non-codingRNA sequence variations in human chronic lymphocytic leukemia and colorectal cancer
Wojcik, Sylwia E.; Rossi, Simona; Shimizu, Masayoshi; Nicoloso, Milena S.; Cimmino, Amelia; Alder, Hansjuerg; Herlea, Vlad; Rassenti, Laura Z.; Rai, Kanti R.; Kipps, Thomas J.; Keating, Michael J.
2010-01-01
Cancer is a genetic disease in which the interplay between alterations in protein-coding genes and non-coding RNAs (ncRNAs) plays a fundamental role. In recent years, the full coding component of the human genome was sequenced in various cancers, whereas such attempts related to ncRNAs are still fragmentary. We screened genomic DNAs for sequence variations in 148 microRNAs (miRNAs) and ultraconserved regions (UCRs) loci in patients with chronic lymphocytic leukemia (CLL) or colorectal cancer (CRC) by Sanger technique and further tried to elucidate the functional consequences of some of these variations. We found sequence variations in miRNAs in both sporadic and familial CLL cases, mutations of UCRs in CLLs and CRCs and, in certain instances, detected functional effects of these variations. Furthermore, by integrating our data with previously published data on miRNA sequence variations, we have created a catalog of DNA sequence variations in miRNAs/ultraconserved genes in human cancers. These findings argue that ncRNAs are targeted by both germ line and somatic mutations as well as by single-nucleotide polymorphisms with functional significance for human tumorigenesis. Sequence variations in ncRNA loci are frequent and some have functional and biological significance. Such information can be exploited to further investigate on a genome-wide scale the frequency of genetic variations in ncRNAs and their functional meaning, as well as for the development of new diagnostic and prognostic markers for leukemias and carcinomas. PMID:19926640
Trofimova, Irina; Krasikova, Alla
2016-12-01
Tandemly organized highly repetitive DNA sequences are crucial structural and functional elements of eukaryotic genomes. Despite extensive evidence, satellite DNA remains an enigmatic part of the eukaryotic genome, with biological role and significance of tandem repeat transcripts remaining rather obscure. Data on tandem repeats transcription in amphibian and avian model organisms is fragmentary despite their genomes being thoroughly characterized. Review systematically covers historical and modern data on transcription of amphibian and avian satellite DNA in somatic cells and during meiosis when chromosomes acquire special lampbrush form. We highlight how transcription of tandemly repetitive DNA sequences is organized in interphase nucleus and on lampbrush chromosomes. We offer LTR-activation hypotheses of widespread satellite DNA transcription initiation during oogenesis. Recent explanations are provided for the significance of high-yield production of non-coding RNA derived from tandemly organized highly repetitive DNA. In many cases the data on the transcription of satellite DNA can be extrapolated from lampbrush chromosomes to interphase chromosomes. Lampbrush chromosomes with applied novel technical approaches such as superresolution imaging, chromosome microdissection followed by high-throughput sequencing, dynamic observation in life-like conditions provide amazing opportunities for investigation mechanisms of the satellite DNA transcription.
Krasikova, Alla
2016-01-01
ABSTRACT Tandemly organized highly repetitive DNA sequences are crucial structural and functional elements of eukaryotic genomes. Despite extensive evidence, satellite DNA remains an enigmatic part of the eukaryotic genome, with biological role and significance of tandem repeat transcripts remaining rather obscure. Data on tandem repeats transcription in amphibian and avian model organisms is fragmentary despite their genomes being thoroughly characterized. Review systematically covers historical and modern data on transcription of amphibian and avian satellite DNA in somatic cells and during meiosis when chromosomes acquire special lampbrush form. We highlight how transcription of tandemly repetitive DNA sequences is organized in interphase nucleus and on lampbrush chromosomes. We offer LTR-activation hypotheses of widespread satellite DNA transcription initiation during oogenesis. Recent explanations are provided for the significance of high-yield production of non-coding RNA derived from tandemly organized highly repetitive DNA. In many cases the data on the transcription of satellite DNA can be extrapolated from lampbrush chromosomes to interphase chromosomes. Lampbrush chromosomes with applied novel technical approaches such as superresolution imaging, chromosome microdissection followed by high-throughput sequencing, dynamic observation in life-like conditions provide amazing opportunities for investigation mechanisms of the satellite DNA transcription. PMID:27763817
LaPolla, R J; Mayne, K M; Davidson, N
1984-01-01
A mouse cDNA clone has been isolated that contains the complete coding region of a protein highly homologous to the delta subunit of the Torpedo acetylcholine receptor (AcChoR). The cDNA library was constructed in the vector lambda 10 from membrane-associated poly(A)+ RNA from BC3H-1 mouse cells. Surprisingly, the delta clone was selected by hybridization with cDNA encoding the gamma subunit of the Torpedo AcChoR. The nucleotide sequence of the mouse cDNA clone contains an open reading frame of 520 amino acids. This amino acid sequence exhibits 59% and 50% sequence homology to the Torpedo AcChoR delta and gamma subunits, respectively. However, the mouse nucleotide sequence has several stretches of high homology with the Torpedo gamma subunit cDNA, but not with delta. The mouse protein has the same general structural features as do the Torpedo subunits. It is encoded by a 3.3-kilobase mRNA. There is probably only one, but at most two, chromosomal genes coding for this or closely related sequences. Images PMID:6096870
Wu, Yueh-Lung; Wu, Carol-P; Huang, Yu-Hui; Huang, Sheng-Ping; Lo, Huei-Ru; Chang, Hao-Shuo; Lin, Pi-Hsiu; Wu, Ming-Cheng; Chang, Chia-Jung; Chao, Yu-Chan
2014-11-01
The p143 gene from Autographa californica multinucleocapsid nucleopolyhedrovirus (AcMNPV) has been found to increase the expression of luciferase, which is driven by the polyhedrin gene promoter, in a plasmid with virus coinfection. Further study indicated that this is due to the presence of a replication origin (ori) in the coding region of this gene. Transient DNA replication assays showed that a specific fragment of the p143 coding sequence, p143-3, underwent virus-dependent DNA replication in Spodoptera frugiperda IPLB-Sf-21 (Sf-21) cells. Deletion analysis of the p143-3 fragment showed that subfragment p143-3.2a contained the essential sequence of this putative ori. Sequence analysis of this region revealed a unique distribution of imperfect palindromes with high AT contents. No sequence homology or similarity between p143-3.2a and any other known ori was detected, suggesting that it is a novel baculovirus ori. Further study showed that the p143-3.2a ori can replicate more efficiently in infected Sf-21 cells than baculovirus homologous regions (hrs), the major baculovirus ori, or non-hr oris during virus replication. Previously, hr on its own was unable to replicate in mammalian cells, and for mammalian viral oris, viral proteins are generally required for their proper replication in host cells. However, the p143-3.2a ori was, surprisingly, found to function as an efficient ori in mammalian cells without the need for any viral proteins. We conclude that p143 contains a unique sequence that can function as an ori to enhance gene expression in not only insect cells but also mammalian cells. Baculovirus DNA replication relies on both hr and non-hr oris; however, so far very little is known about the latter oris. Here we have identified a new non-hr ori, the p143 ori, which resides in the coding region of p143. By developing a novel DNA replication-enhanced reporter system, we have identified and located the core region required for the p143 ori. This ori contains a large number of imperfect inverted repeats and is the most active ori in the viral genome during virus infection in insect cells. We also found that it is a unique ori that can replicate in mammalian cells without the assistance of baculovirus gene products. The identification of this ori should contribute to a better understanding of baculovirus DNA replication. Also, this ori is very useful in assisting with gene expression in mammalian cells. Copyright © 2014, American Society for Microbiology. All Rights Reserved.
Bonen, Linda; Boer, Poppo H.; Gray, Michael W.
1984-01-01
We have determined the sequence of the wheat mitochondrial gene for cytochrome oxidase subunit II (COII) and find that its derived protein sequence differs from that of maize at only three amino acid positions. Unexpectedly, all three replacements are non-conservative ones. The wheat COII gene has a highly-conserved intron at the same position as in maize, but the wheat intron is 1.5 times longer because of an insert relative to its maize counterpart. Hybridization analysis of mitochondrial DNA from rye, pea, broad bean and cucumber indicates strong sequence conservation of COII coding sequences among all these higher plants. However, only rye and maize mitochondrial DNA show homology with wheat COII intron sequences and rye alone with intron-insert sequences. We find that a sequence identical to the region of the 5' exon corresponding to the transmembrane domain of the COII protein is present at a second genomic location in wheat mitochondria. These variations in COII gene structure and size, as well as the presence of repeated COII sequences, illustrate at the DNA sequence level, factors which contribute to higher plant mitochondrial DNA diversity and complexity. ImagesFig. 3.Fig. 4.Fig. 5. PMID:16453565
Scaling features of noncoding DNA
NASA Technical Reports Server (NTRS)
Stanley, H. E.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Peng, C. K.; Simons, M.
1999-01-01
We review evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range--indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene, and utilize this fact to build a Coding Sequence Finder Algorithm, which uses statistical ideas to locate the coding regions of an unknown DNA sequence. Finally, we describe briefly some recent work adapting to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the "redundancy" of a linguistic text in terms of a measurable entropy function, and reporting that noncoding regions in eukaryotes display a larger redundancy than coding regions. Specifically, we consider the possibility that this result is solely a consequence of nucleotide concentration differences as first noted by Bonhoeffer and his collaborators. We find that cytosine-guanine (CG) concentration does have a strong "background" effect on redundancy. However, we find that for the purine-pyrimidine binary mapping rule, which is not affected by the difference in CG concentration, the Shannon redundancy for the set of analyzed sequences is larger for noncoding regions compared to coding regions.
Origin, evolution, and biogeography of Juglans: a phylogenetic perspective
USDA-ARS?s Scientific Manuscript database
The eastern Asian and eastern North American disjunction in Juglans offers an opportunity to estimate the time since divergence of the Eurasian and American lineages and to compare it with paleobotanical evidences. Five chloroplast DNA non-coding spacer (NCS) sequences: trnT-trnF, psbA-trnH, atpB-r...
Detection of human microRNAs across miRNA Array and Next Generation DNA Sequencing Platforms
microRNA (miRNAs) are non-coding RNA molecules between 19 and 30 nucleotides in length that are believed to regulate approximately 30 per cent of all human genes. They act as negative regulators of their gene targets in many biological processes. Recent developments in microar...
Sequence Polishing Library (SPL) v10.0
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oberortner, Ernst
The Sequence Polishing Library (SPL) is a suite of software tools in order to automate "Design for Synthesis and Assembly" workflows. Specifically: The SPL "Converter" tool converts files among the following sequence data exchange formats: CSV, FASTA, GenBank, and Synthetic Biology Open Language (SBOL); The SPL "Juggler" tool optimizes the codon usages of DNA coding sequences according to an optimization strategy, a user-specific codon usage table and genetic code. In addition, the SPL "Juggler" can translate amino acid sequences into DNA sequences.:The SPL "Polisher" verifies NA sequences against DNA synthesis constraints, such as GC content, repeating k-mers, and restriction sites.more » In case of violations, the "Polisher" reports the violations in a comprehensive manner. The "Polisher" tool can also modify the violating regions according to an optimization strategy, a user-specific codon usage table and genetic code;The SPL "Partitioner" decomposes large DNA sequences into smaller building blocks with partial overlaps that enable an efficient assembly. The "Partitioner" enables the user to configure the characteristics of the overlaps, which are mostly determined by the utilized assembly protocol, such as length, GC content, or melting temperature.« less
Non-coding RNAs in lung cancer
Ricciuti, Biagio; Mecca, Carmen; Crinò, Lucio; Baglivo, Sara; Cenci, Matteo; Metro, Giulio
2014-01-01
The discovery that protein-coding genes represent less than 2% of all human genome, and the evidence that more than 90% of it is actively transcribed, changed the classical point of view of the central dogma of molecular biology, which was always based on the assumption that RNA functions mainly as an intermediate bridge between DNA sequences and protein synthesis machinery. Accumulating data indicates that non-coding RNAs are involved in different physiological processes, providing for the maintenance of cellular homeostasis. They are important regulators of gene expression, cellular differentiation, proliferation, migration, apoptosis, and stem cell maintenance. Alterations and disruptions of their expression or activity have increasingly been associated with pathological changes of cancer cells, this evidence and the prospect of using these molecules as diagnostic markers and therapeutic targets, make currently non-coding RNAs among the most relevant molecules in cancer research. In this paper we will provide an overview of non-coding RNA function and disruption in lung cancer biology, also focusing on their potential as diagnostic, prognostic and predictive biomarkers. PMID:25593996
Discrete Ramanujan transform for distinguishing the protein coding regions from other regions.
Hua, Wei; Wang, Jiasong; Zhao, Jian
2014-01-01
Based on the study of Ramanujan sum and Ramanujan coefficient, this paper suggests the concepts of discrete Ramanujan transform and spectrum. Using Voss numerical representation, one maps a symbolic DNA strand as a numerical DNA sequence, and deduces the discrete Ramanujan spectrum of the numerical DNA sequence. It is well known that of discrete Fourier power spectrum of protein coding sequence has an important feature of 3-base periodicity, which is widely used for DNA sequence analysis by the technique of discrete Fourier transform. It is performed by testing the signal-to-noise ratio at frequency N/3 as a criterion for the analysis, where N is the length of the sequence. The results presented in this paper show that the property of 3-base periodicity can be only identified as a prominent spike of the discrete Ramanujan spectrum at period 3 for the protein coding regions. The signal-to-noise ratio for discrete Ramanujan spectrum is defined for numerical measurement. Therefore, the discrete Ramanujan spectrum and the signal-to-noise ratio of a DNA sequence can be used for distinguishing the protein coding regions from the noncoding regions. All the exon and intron sequences in whole chromosomes 1, 2, 3 and 4 of Caenorhabditis elegans have been tested and the histograms and tables from the computational results illustrate the reliability of our method. In addition, we have analyzed theoretically and gotten the conclusion that the algorithm for calculating discrete Ramanujan spectrum owns the lower computational complexity and higher computational accuracy. The computational experiments show that the technique by using discrete Ramanujan spectrum for classifying different DNA sequences is a fast and effective method. Copyright © 2014 Elsevier Ltd. All rights reserved.
Zhao, A; Guo, A; Liu, Z; Pape, L
1997-01-01
The coding sequences for a Schizosaccharomyces pombe sequence-specific DNA binding protein, Reb1p, have been cloned. The predicted S. pombe Reb1p is 24-29% identical to mouse TTF-1 (transcription termination factor-1) and Saccharomyces cerevisiae REB1 protein, both of which direct termination of RNA polymerase I catalyzed transcripts. The S.pombe Reb1 cDNA encodes a predicted polypeptide of 504 amino acids with a predicted molecular weight of 58.4 kDa. The S. pombe Reb1p is unusual in that the bipartite DNA binding motif identified originally in S.cerevisiae and Klyveromyces lactis REB1 proteins is uninterrupted and thus S.pombe Reb1p may contain the smallest natural REB1 homologous DNA binding domain. Its genomic coding sequences were shown to be interrupted by two introns. A recombinant histidine-tagged Reb1 protein bearing the rDNA binding domain has two homologous, sequence-specific binding sites in the S. pomber DNA intergenic spacer, located between 289 and 480 nt downstream of the end of the approximately 25S rRNA coding sequences. Each binding site is 13-14 bp downstream of two of the three proposed in vivo termination sites. The core of this 17 bp site, AGGTAAGGGTAATGCAC, is specifically protected by Reb1p in footprinting analysis. PMID:9016645
Shi, Jiaqin; Huang, Shunmou; Fu, Donghui; Yu, Jinyin; Wang, Xinfa; Hua, Wei; Liu, Shengyi; Liu, Guihua; Wang, Hanzhong
2013-01-01
Despite their ubiquity and functional importance, microsatellites have been largely ignored in comparative genomics, mostly due to the lack of genomic information. In the current study, microsatellite distribution was characterized and compared in the whole genomes and both the coding and non-coding DNA sequences of the sequenced Brassica, Arabidopsis and other angiosperm species to investigate their evolutionary dynamics in plants. The variation in the microsatellite frequencies of these angiosperm species was much smaller than those for their microsatellite numbers and genome sizes, suggesting that microsatellite frequency may be relatively stable in plants. The microsatellite frequencies of these angiosperm species were significantly negatively correlated with both their genome sizes and transposable elements contents. The pattern of microsatellite distribution may differ according to the different genomic regions (such as coding and non-coding sequences). The observed differences in many important microsatellite characteristics (especially the distribution with respect to motif length, type and repeat number) of these angiosperm species were generally accordant with their phylogenetic distance, which suggested that the evolutionary dynamics of microsatellite distribution may be generally consistent with plant divergence/evolution. Importantly, by comparing these microsatellite characteristics (especially the distribution with respect to motif type) the angiosperm species (aside from a few species) all clustered into two obviously different groups that were largely represented by monocots and dicots, suggesting a complex and generally dichotomous evolutionary pattern of microsatellite distribution in angiosperms. Polyploidy may lead to a slight increase in microsatellite frequency in the coding sequences and a significant decrease in microsatellite frequency in the whole genome/non-coding sequences, but have little effect on the microsatellite distribution with respect to motif length, type and repeat number. Interestingly, several microsatellite characteristics seemed to be constant in plant evolution, which can be well explained by the general biological rules. PMID:23555856
Michael, Todd P; Bryant, Douglas; Gutierrez, Ryan; Borisjuk, Nikolai; Chu, Philomena; Zhang, Hanzhong; Xia, Jing; Zhou, Junfei; Peng, Hai; El Baidouri, Moaine; Ten Hallers, Boudewijn; Hastie, Alex R; Liang, Tiffany; Acosta, Kenneth; Gilbert, Sarah; McEntee, Connor; Jackson, Scott A; Mockler, Todd C; Zhang, Weixiong; Lam, Eric
2017-02-01
Spirodela polyrhiza is a fast-growing aquatic monocot with highly reduced morphology, genome size and number of protein-coding genes. Considering these biological features of Spirodela and its basal position in the monocot lineage, understanding its genome architecture could shed light on plant adaptation and genome evolution. Like many draft genomes, however, the 158-Mb Spirodela genome sequence has not been resolved to chromosomes, and important genome characteristics have not been defined. Here we deployed rapid genome-wide physical maps combined with high-coverage short-read sequencing to resolve the 20 chromosomes of Spirodela and to empirically delineate its genome features. Our data revealed a dramatic reduction in the number of the rDNA repeat units in Spirodela to fewer than 100, which is even fewer than that reported for yeast. Consistent with its unique phylogenetic position, small RNA sequencing revealed 29 Spirodela-specific microRNA, with only two being shared with Elaeis guineensis (oil palm) and Musa balbisiana (banana). Combining DNA methylation data and small RNA sequencing enabled the accurate prediction of 20.5% long terminal repeats (LTRs) that doubled the previous estimate, and revealed a high Solo:Intact LTR ratio of 8.2. Interestingly, we found that Spirodela has the lowest global DNA methylation levels (9%) of any plant species tested. Taken together our results reveal a genome that has undergone reduction, likely through eliminating non-essential protein coding genes, rDNA and LTRs. In addition to delineating the genome features of this unique plant, the methodologies described and large-scale genome resources from this work will enable future evolutionary and functional studies of this basal monocot family. © 2016 The Authors The Plant Journal © 2016 John Wiley & Sons Ltd.
The complete mitochondrial genome of Pomacea canaliculata (Gastropoda: Ampullariidae).
Zhou, Xuming; Chen, Yu; Zhu, Shanliang; Xu, Haigen; Liu, Yan; Chen, Lian
2016-01-01
The mitochondrial genome of Pomacea canaliculata (Gastropoda: Ampullariidae) is the first complete mtDNA sequence reported in the genus Pomacea. The total length of mtDNA is 15,707 bp, which containing 13 protein-coding genes, 2 ribosomal RNAs, 22 transfer RNAs, and a 359 bp non-coding region. The A + T content of the overall base composition of H-strand is 71.7% (T: 41%, C: 12.7%, A: 30.7%, G: 15.6%). ATP6, ATP8, CO1, CO2, ND1-3, ND5, ND6, ND4L and Cyt b genes begin with ATG as start codon, CO3 and ND4 begin with ATA. ATP8, CO2-3, ND4L, ND2-6 and Cyt b genes are terminated with TAA as stop codon, ATP6, ND1, and CO1 end with TAG. A long non-coding region is found and a 23 bp repeat unit repeat 11 times in this region.
Sanges, Remo; Hadzhiev, Yavor; Gueroult-Bellone, Marion; Roure, Agnes; Ferg, Marco; Meola, Nicola; Amore, Gabriele; Basu, Swaraj; Brown, Euan R.; De Simone, Marco; Petrera, Francesca; Licastro, Danilo; Strähle, Uwe; Banfi, Sandro; Lemaire, Patrick; Birney, Ewan; Müller, Ferenc; Stupka, Elia
2013-01-01
Co-option of cis-regulatory modules has been suggested as a mechanism for the evolution of expression sites during development. However, the extent and mechanisms involved in mobilization of cis-regulatory modules remains elusive. To trace the history of non-coding elements, which may represent candidate ancestral cis-regulatory modules affirmed during chordate evolution, we have searched for conserved elements in tunicate and vertebrate (Olfactores) genomes. We identified, for the first time, 183 non-coding sequences that are highly conserved between the two groups. Our results show that all but one element are conserved in non-syntenic regions between vertebrate and tunicate genomes, while being syntenic among vertebrates. Nevertheless, in all the groups, they are significantly associated with transcription factors showing specific functions fundamental to animal development, such as multicellular organism development and sequence-specific DNA binding. The majority of these regions map onto ultraconserved elements and we demonstrate that they can act as functional enhancers within the organism of origin, as well as in cross-transgenesis experiments, and that they are transcribed in extant species of Olfactores. We refer to the elements as ‘Olfactores conserved non-coding elements’. PMID:23393190
Fister, Karin; Fister, Iztok; Murovec, Jana; Bohanec, Borut
2017-02-01
Plant breeders' rights are undergoing dramatic changes due to changes in patent rights in terms of plant variety rights protection. Although differences in the interpretation of »breeder's exemption«, termed research exemption in the 1991 UPOV, did exist in the past in some countries, allowing breeders to use protected varieties as parents in the creation of new varieties of plants, current developments brought about by patenting conventionally bred varieties with the European Patent Office (such as EP2140023B1) have opened new challenges. Legal restrictions on germplasm availability are therefore imposed on breeders while, at the same time, no practical information on how to distinguish protected from non-protected varieties is given. We propose here a novel approach that would solve this problem by the insertion of short DNA stretches (labels) into protected plant varieties by genetic transformation. This information will then be available to breeders by a simple and standardized procedure. We propose that such a procedure should consist of using a pair of universal primers that will generate a sequence in a PCR reaction, which can be read and translated into ordinary text by a computer application. To demonstrate the feasibility of such approach, we conducted a case study. Using the Agrobacterium tumefaciens transformation protocol, we inserted a stretch of DNA code into Nicotiana benthamiana. We also developed an on-line application that enables coding of any text message into DNA nucleotide code and, on sequencing, decoding it back into text. In the presented case study, a short command line coding the phrase »Hello world« was transformed into a DNA sequence that was inserted in the plant genome. The encoded message was reconstructed from the resulting T1 seedlings with 100 % accuracy. The feasibility and possible other applications of this approach are discussed.
Polyomavirus BK non-coding control region rearrangements in health and disease.
Sharma, Preety M; Gupta, Gaurav; Vats, Abhay; Shapiro, Ron; Randhawa, Parmjeet S
2007-08-01
BK virus is an increasingly recognized pathogen in transplanted patients. DNA sequencing of this virus shows considerable genomic variability. To understand the clinical significance of rearrangements in the non-coding control region (NCCR) of BK virus (BKV), we report a meta-analysis of 507 sequences, including 40 sequences generated in our own laboratory, for associations between rearrangements and disease, tissue tropism, geographic origin, and viral genotype. NCCR rearrangements were less frequent in (a) asymptomatic BKV viruria compared to patients viral nephropathy (1.7% vs. 22.5%), and (b) viral genotype 1 compared to other genotypes (2.4% vs. 11.2%). Rearrangements were commoner in malignancy (78.6%), and Norwegians (45.7%), and less common in East Indians (0%), and Japanese (4.3%). A surprising number of rearranged sequences were reported from mononuclear cells of healthy subjects, whereas most plasma sequences were archetypal. This difference could not be related to potential recombinase activity in lymphocytes, as consensus recombination signal sequences could not be found in the NCCR region. NCCR rearrangements are neither required nor a sufficient condition to produce clinical disease. BKV nephropathy and hemorrhagic cystitis are not associated with any unique NCCR configuration or nucleotide sequence.
NASA Astrophysics Data System (ADS)
Karakatsanis, L. P.; Pavlos, G. P.; Iliopoulos, A. C.; Pavlos, E. G.; Clark, P. M.; Duke, J. L.; Monos, D. S.
2018-09-01
This study combines two independent domains of science, the high throughput DNA sequencing capabilities of Genomics and complexity theory from Physics, to assess the information encoded by the different genomic segments of exonic, intronic and intergenic regions of the Major Histocompatibility Complex (MHC) and identify possible interactive relationships. The dynamic and non-extensive statistical characteristics of two well characterized MHC sequences from the homozygous cell lines, PGF and COX, in addition to two other genomic regions of comparable size, used as controls, have been studied using the reconstructed phase space theorem and the non-extensive statistical theory of Tsallis. The results reveal similar non-linear dynamical behavior as far as complexity and self-organization features. In particular, the low-dimensional deterministic nonlinear chaotic and non-extensive statistical character of the DNA sequences was verified with strong multifractal characteristics and long-range correlations. The nonlinear indices repeatedly verified that MHC sequences, whether exonic, intronic or intergenic include varying levels of information and reveal an interaction of the genes with intergenic regions, whereby the lower the number of genes in a region, the less the complexity and information content of the intergenic region. Finally we showed the significance of the intergenic region in the production of the DNA dynamics. The findings reveal interesting content information in all three genomic elements and interactive relationships of the genes with the intergenic regions. The results most likely are relevant to the whole genome and not only to the MHC. These findings are consistent with the ENCODE project, which has now established that the non-coding regions of the genome remain to be of relevance, as they are functionally important and play a significant role in the regulation of expression of genes and coordination of the many biological processes of the cell.
Fukuda, Tomoyuki; Ohta, Kunihiro; Ohya, Yoshikazu
2006-06-01
VMA1-derived endonuclease (VDE), a homing endonuclease in Saccharomyces cerevisiae, is encoded by the mobile intein-coding sequence within the nuclear VMA1 gene. VDE recognizes and cleaves DNA at the 31-bp VDE recognition sequence (VRS) in the VMA1 gene lacking the intein-coding sequence during meiosis to insert a copy of the intein-coding sequence at the cleaved site. The mechanism underlying the meiosis specificity of VMA1 intein-coding sequence homing remains unclear. We studied various factors that might influence the cleavage activity in vivo and found that VDE binding to the VRS can be detected only when DNA cleavage by VDE takes place, implying that meiosis-specific DNA cleavage is regulated by the accessibility of VDE to its target site. As a possible candidate for the determinant of this accessibility, we analyzed chromatin structure around the VRS and revealed that local chromatin structure near the VRS is altered during meiosis. Although the meiotic chromatin alteration exhibits correlations with DNA binding and cleavage by VDE at the VMA1 locus, such a chromatin alteration is not necessarily observed when the VRS is embedded in ectopic gene loci. This suggests that nucleosome positioning or occupancy around the VRS by itself is not the sole mechanism for the regulation of meiosis-specific DNA cleavage by VDE and that other mechanisms are involved in the regulation.
Fukuda, Tomoyuki; Ohta, Kunihiro; Ohya, Yoshikazu
2006-01-01
VMA1-derived endonuclease (VDE), a homing endonuclease in Saccharomyces cerevisiae, is encoded by the mobile intein-coding sequence within the nuclear VMA1 gene. VDE recognizes and cleaves DNA at the 31-bp VDE recognition sequence (VRS) in the VMA1 gene lacking the intein-coding sequence during meiosis to insert a copy of the intein-coding sequence at the cleaved site. The mechanism underlying the meiosis specificity of VMA1 intein-coding sequence homing remains unclear. We studied various factors that might influence the cleavage activity in vivo and found that VDE binding to the VRS can be detected only when DNA cleavage by VDE takes place, implying that meiosis-specific DNA cleavage is regulated by the accessibility of VDE to its target site. As a possible candidate for the determinant of this accessibility, we analyzed chromatin structure around the VRS and revealed that local chromatin structure near the VRS is altered during meiosis. Although the meiotic chromatin alteration exhibits correlations with DNA binding and cleavage by VDE at the VMA1 locus, such a chromatin alteration is not necessarily observed when the VRS is embedded in ectopic gene loci. This suggests that nucleosome positioning or occupancy around the VRS by itself is not the sole mechanism for the regulation of meiosis-specific DNA cleavage by VDE and that other mechanisms are involved in the regulation. PMID:16757746
Giardina, P; Cannio, R; Martirani, L; Marzullo, L; Palmieri, G; Sannia, G
1995-01-01
The gene (pox1) encoding a phenol oxidase from Pleurotus ostreatus, a lignin-degrading basidiomycete, was cloned and sequenced, and the corresponding pox1 cDNA was also synthesized and sequenced. The isolated gene consists of 2,592 bp, with the coding sequence being interrupted by 19 introns and flanked by an upstream region in which putative CAAT and TATA consensus sequences could be identified at positions -174 and -84, respectively. The isolation of a second cDNA (pox2 cDNA), showing 84% similarity, and of the corresponding truncated genomic clones demonstrated the existence of a multigene family coding for isoforms of laccase in P. ostreatus. PCR amplifications of specific regions on the DNA of isolated monokaryons proved that the two genes are not allelic forms. The POX1 amino acid sequence deduced was compared with those of other known laccases from different fungi. PMID:7793961
Nagano, Yukio; Furuhashi, Hirofumi; Inaba, Takehito; Sasaki, Yukiko
2001-01-01
Complementary DNA encoding a DNA-binding protein, designated PLATZ1 (plant AT-rich sequence- and zinc-binding protein 1), was isolated from peas. The amino acid sequence of the protein is similar to those of other uncharacterized proteins predicted from the genome sequences of higher plants. However, no paralogous sequences have been found outside the plant kingdom. Multiple alignments among these paralogous proteins show that several cysteine and histidine residues are invariant, suggesting that these proteins are a novel class of zinc-dependent DNA-binding proteins with two distantly located regions, C-x2-H-x11-C-x2-C-x(4–5)-C-x2-C-x(3–7)-H-x2-H and C-x2-C-x(10–11)-C-x3-C. In an electrophoretic mobility shift assay, the zinc chelator 1,10-o-phenanthroline inhibited DNA binding, and two distant zinc-binding regions were required for DNA binding. A protein blot with 65ZnCl2 showed that both regions are required for zinc-binding activity. The PLATZ1 protein non-specifically binds to A/T-rich sequences, including the upstream region of the pea GTPase pra2 and plastocyanin petE genes. Expression of the PLATZ1 repressed those of the reporter constructs containing the coding sequence of luciferase gene driven by the cauliflower mosaic virus (CaMV) 35S90 promoter fused to the tandem repeat of the A/T-rich sequences. These results indicate that PLATZ1 is a novel class of plant-specific zinc-dependent DNA-binding protein responsible for A/T-rich sequence-mediated transcriptional repression. PMID:11600698
Hazes, Bart
2014-02-28
Protein-coding DNA sequences and their corresponding amino acid sequences are routinely used to study relationships between sequence, structure, function, and evolution. The rapidly growing size of sequence databases increases the power of such comparative analyses but it makes it more challenging to prepare high quality sequence data sets with control over redundancy, quality, completeness, formatting, and labeling. Software tools for some individual steps in this process exist but manual intervention remains a common and time consuming necessity. CDSbank is a database that stores both the protein-coding DNA sequence (CDS) and amino acid sequence for each protein annotated in Genbank. CDSbank also stores Genbank feature annotation, a flag to indicate incomplete 5' and 3' ends, full taxonomic data, and a heuristic to rank the scientific interest of each species. This rich information allows fully automated data set preparation with a level of sophistication that aims to meet or exceed manual processing. Defaults ensure ease of use for typical scenarios while allowing great flexibility when needed. Access is via a free web server at http://hazeslab.med.ualberta.ca/CDSbank/. CDSbank presents a user-friendly web server to download, filter, format, and name large sequence data sets. Common usage scenarios can be accessed via pre-programmed default choices, while optional sections give full control over the processing pipeline. Particular strengths are: extract protein-coding DNA sequences just as easily as amino acid sequences, full access to taxonomy for labeling and filtering, awareness of incomplete sequences, and the ability to take one protein sequence and extract all synonymous CDS or identical protein sequences in other species. Finally, CDSbank can also create labeled property files to, for instance, annotate or re-label phylogenetic trees.
Adachi, Noboru; Umetsu, Kazuo; Shojo, Hideki
2014-01-01
Mitochondrial DNA (mtDNA) is widely used for DNA analysis of highly degraded samples because of its polymorphic nature and high number of copies in a cell. However, as endogenous mtDNA in deteriorated samples is scarce and highly fragmented, it is not easy to obtain reliable data. In the current study, we report the risks of direct sequencing mtDNA in highly degraded material, and suggest a strategy to ensure the quality of sequencing data. It was observed that direct sequencing data of the hypervariable segment (HVS) 1 by using primer sets that generate an amplicon of 407 bp (long-primer sets) was different from results obtained by using newly designed primer sets that produce an amplicon of 120-139 bp (mini-primer sets). The data aligned with the results of mini-primer sets analysis in an amplicon length-dependent manner; the shorter the amplicon, the more evident the endogenous sequence became. Coding region analysis using multiplex amplified product-length polymorphisms revealed the incongruence of single nucleotide polymorphisms between the coding region and HVS 1 caused by contamination with exogenous mtDNA. Although the sequencing data obtained using long-primer sets turned out to be erroneous, it was unambiguous and reproducible. These findings suggest that PCR primers that produce amplicons shorter than those currently recognized should be used for mtDNA analysis in highly degraded samples. Haplogroup motif analysis of the coding region and HVS should also be performed to improve the reliability of forensic mtDNA data. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Cloning and sequence analysis of Hemonchus contortus HC58cDNA.
Muleke, Charles I; Ruofeng, Yan; Lixin, Xu; Xinwen, Bo; Xiangrui, Li
2007-06-01
The complete coding sequence of Hemonchus contortus HC58cDNA was generated by rapid amplification of cDNA ends and polymerase chain reaction using primers based on the 5' and 3' ends of the parasite mRNA, accession no. AF305964. The HC58cDNA gene was 851 bp long, with open reading frame of 717 bp, precursors to 239 amino acids coding for approximately 27 kDa protein. Analysis of amino acid sequence revealed conserved residues of cysteine, histidine, asparagine, occluding loop pattern, hemoglobinase motif and glutamine of the oxyanion hole characteristic of cathepsin B like proteases (CBL). Comparison of the predicted amino acid sequences showed the protein shared 33.5-58.7% identity to cathepsin B homologues in the papain clan CA family (family C1). Phylogenetic analysis revealed close evolutionary proximity of the protein sequence to counterpart sequences in the CBL, suggesting that HC58cDNA was a member of the papain family.
Jaramillo-Correa, J P; Bousquet, J; Beaulieu, J; Isabel, N; Perron, M; Bouillé, M
2003-05-01
Primers previously developed to amplify specific non-coding regions of the mitochondrial genome in Angiosperms, and new primers for additional non-coding mtDNA regions, were tested for their ability to direct DNA amplification in 12 conifer taxa and to detect sequence-tagged-site (STS) polymorphisms within and among eight species in Picea. Out of 12 primer pairs, nine were successful at amplifying mtDNA in most of the taxa surveyed. In conifers, indels and substitutions were observed for several loci, allowing them to distinguish between families, genera and, in some cases, between species within genera. In Picea, interspecific polymorphism was detected for four loci, while intraspecific variation was observed for three of the mtDNA regions studied. One of these (SSU rRNA V1 region) exhibited indel polymorphisms, and the two others ( nad1 intron b/c and nad5 intron1) revealed restriction differences after digestion with Sau3AI (PCR-RFLP). A fourth locus, the nad4L- orf25 intergenic region, showed a multibanding pattern for most of the spruce species, suggesting a possible gene duplication. Maternal inheritance, expected for mtDNA in conifers, was observed for all polymorphic markers except the intergenic region nad4L- orf25. Pooling of the variation observed with the remaining three markers resulted in two to six different mtDNA haplotypes within the different species of Picea. Evidence for intra-genomic recombination was observed in at least two taxa. Thus, these mitotypes are likely to be more informative than single-locus haplotypes. They should be particularly useful for the study of biogeography and the dynamics of hybrid zones.
DNA-binding proteins from marine bacteria expand the known sequence diversity of TALE-like repeats.
de Lange, Orlando; Wolf, Christina; Thiel, Philipp; Krüger, Jens; Kleusch, Christian; Kohlbacher, Oliver; Lahaye, Thomas
2015-11-16
Transcription Activator-Like Effectors (TALEs) of Xanthomonas bacteria are programmable DNA binding proteins with unprecedented target specificity. Comparative studies into TALE repeat structure and function are hindered by the limited sequence variation among TALE repeats. More sequence-diverse TALE-like proteins are known from Ralstonia solanacearum (RipTALs) and Burkholderia rhizoxinica (Bats), but RipTAL and Bat repeats are conserved with those of TALEs around the DNA-binding residue. We study two novel marine-organism TALE-like proteins (MOrTL1 and MOrTL2), the first to date of non-terrestrial origin. We have assessed their DNA-binding properties and modelled repeat structures. We found that repeats from these proteins mediate sequence specific DNA binding conforming to the TALE code, despite low sequence similarity to TALE repeats, and with novel residues around the BSR. However, MOrTL1 repeats show greater sequence discriminating power than MOrTL2 repeats. Sequence alignments show that there are only three residues conserved between repeats of all TALE-like proteins including the two new additions. This conserved motif could prove useful as an identifier for future TALE-likes. Additionally, comparing MOrTL repeats with those of other TALE-likes suggests a common evolutionary origin for the TALEs, RipTALs and Bats. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Linear and Nonlinear Statistical Characterization of DNA
NASA Astrophysics Data System (ADS)
Norio Oiwa, Nestor; Goldman, Carla; Glazier, James
2002-03-01
We find spatial order in the distribution of protein-coding (including RNAs) and control segments of GenBank genomic sequences, irrespective of ATCG content. This is achieved by correlations, histograms, fractal dimensions and singularity spectra. Estimates of these quantities in complete nuclear genome indicate that coding sequences are long-range correlated and their disposition are self-similar (multifractal) for eukaryotes. These characteristics are absent in prokaryotes, where there are few noncoding sequences, suggesting the `junk' DNA play a relevant role to the genome structure and function. Concerning the genetic message of ATCG sequences, we build a random walk (Levy flight), using DNA symmetry arguments, where we associate A, T, C and G as left, right, down and up steps, respectively. Nonlinear analysis of mitochondrial DNA walks reveal multifractal pattern based on palindromic sequences, which fold in hairpins and loops.
Kotoula, Vassiliki; Lyberopoulou, Aggeliki; Papadopoulou, Kyriaki; Charalambous, Elpida; Alexopoulou, Zoi; Gakou, Chryssa; Lakis, Sotiris; Tsolaki, Eleftheria; Lilakos, Konstantinos; Fountzilas, George
2015-01-01
Background—Aim Massively parallel sequencing (MPS) holds promise for expanding cancer translational research and diagnostics. As yet, it has been applied on paraffin DNA (FFPE) with commercially available highly multiplexed gene panels (100s of DNA targets), while custom panels of low multiplexing are used for re-sequencing. Here, we evaluated the performance of two highly multiplexed custom panels on FFPE DNA. Methods Two custom multiplex amplification panels (B, 373 amplicons; T, 286 amplicons) were coupled with semiconductor sequencing on DNA samples from FFPE breast tumors and matched peripheral blood samples (n samples: 316; n libraries: 332). The two panels shared 37% DNA targets (common or shifted amplicons). Panel performance was evaluated in paired sample groups and quartets of libraries, where possible. Results Amplicon read ratios yielded similar patterns per gene with the same panel in FFPE and blood samples; however, performance of common amplicons differed between panels (p<0.001). FFPE genotypes were compared for 1267 coding and non-coding variant replicates, 999 out of which (78.8%) were concordant in different paired sample combinations. Variant frequency was highly reproducible (Spearman’s rho 0.959). Repeatedly discordant variants were of high coverage / low frequency (p<0.001). Genotype concordance was (a) high, for intra-run duplicates with the same panel (mean±SD: 97.2±4.7, 95%CI: 94.8–99.7, p<0.001); (b) modest, when the same DNA was analyzed with different panels (mean±SD: 81.1±20.3, 95%CI: 66.1–95.1, p = 0.004); and (c) low, when different DNA samples from the same tumor were compared with the same panel (mean±SD: 59.9±24.0; 95%CI: 43.3–76.5; p = 0.282). Low coverage / low frequency variants were validated with Sanger sequencing even in samples with unfavourable DNA quality. Conclusions Custom MPS may yield novel information on genomic alterations, provided that data evaluation is adjusted to tumor tissue FFPE DNA. To this scope, eligibility of all amplicons along with variant coverage and frequency need to be assessed. PMID:26039550
Multiple tag labeling method for DNA sequencing
Mathies, Richard A.; Huang, Xiaohua C.; Quesada, Mark A.
1995-01-01
A DNA sequencing method described which uses single lane or channel electrophoresis. Sequencing fragments are separated in said lane and detected using a laser-excited, confocal fluorescence scanner. Each set of DNA sequencing fragments is separated in the same lane and then distinguished using a binary coding scheme employing only two different fluorescent labels. Also described is a method of using radio-isotope labels.
Botero, Adriana; Kapeller, Irit; Cooper, Crystal; Clode, Peta L; Shlomai, Joseph; Thompson, R C Andrew
2018-05-17
Kinetoplast DNA (kDNA) is the mitochondrial genome of trypanosomatids. It consists of a few dozen maxicircles and several thousand minicircles, all catenated topologically to form a two-dimensional DNA network. Minicircles are heterogeneous in size and sequence among species. They present one or several conserved regions that contain three highly conserved sequence blocks. CSB-1 (10 bp sequence) and CSB-2 (8 bp sequence) present lower interspecies homology, while CSB-3 (12 bp sequence) or the Universal Minicircle Sequence is conserved within most trypanosomatids. The Universal Minicircle Sequence is located at the replication origin of the minicircles, and is the binding site for the UMS binding protein, a protein involved in trypanosomatid survival and virulence. Here, we describe the structure and organisation of the kDNA of Trypanosoma copemani, a parasite that has been shown to infect mammalian cells and has been associated with the drastic decline of the endangered Australian marsupial, the woylie (Bettongia penicillata). Deep genomic sequencing showed that T. copemani presents two classes of minicircles that share sequence identity and organisation in the conserved sequence blocks with those of Trypanosoma cruzi and Trypanosoma lewisi. A 19,257 bp partial region of the maxicircle of T. copemani that contained the entire coding region was obtained. Comparative analysis of the T. copemani entire maxicircle coding region with the coding regions of T. cruzi and T. lewisi showed they share 71.05% and 71.28% identity, respectively. The shared features in the maxicircle/minicircle organisation and sequence between T. copemani and T. cruzi/T. lewisi suggest similarities in their process of kDNA replication, and are of significance in understanding the evolution of Australian trypanosomes. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
Artificial Intelligence, DNA Mimicry, and Human Health.
Stefano, George B; Kream, Richard M
2017-08-14
The molecular evolution of genomic DNA across diverse plant and animal phyla involved dynamic registrations of sequence modifications to maintain existential homeostasis to increasingly complex patterns of environmental stressors. As an essential corollary, driver effects of positive evolutionary pressure are hypothesized to effect concerted modifications of genomic DNA sequences to meet expanded platforms of regulatory controls for successful implementation of advanced physiological requirements. It is also clearly apparent that preservation of updated registries of advantageous modifications of genomic DNA sequences requires coordinate expansion of convergent cellular proofreading/error correction mechanisms that are encoded by reciprocally modified genomic DNA. Computational expansion of operationally defined DNA memory extends to coordinate modification of coding and previously under-emphasized noncoding regions that now appear to represent essential reservoirs of untapped genetic information amenable to evolutionary driven recruitment into the realm of biologically active domains. Additionally, expansion of DNA memory potential via chemical modification and activation of noncoding sequences is targeted to vertical augmentation and integration of an expanded cadre of transcriptional and epigenetic regulatory factors affecting linear coding of protein amino acid sequences within open reading frames.
Shapiro, James A
2016-06-08
The 21st century genomics-based analysis of evolutionary variation reveals a number of novel features impossible to predict when Dobzhansky and other evolutionary biologists formulated the neo-Darwinian Modern Synthesis in the middle of the last century. These include three distinct realms of cell evolution; symbiogenetic fusions forming eukaryotic cells with multiple genome compartments; horizontal organelle, virus and DNA transfers; functional organization of proteins as systems of interacting domains subject to rapid evolution by exon shuffling and exonization; distributed genome networks integrated by mobile repetitive regulatory signals; and regulation of multicellular development by non-coding lncRNAs containing repetitive sequence components. Rather than single gene traits, all phenotypes involve coordinated activity by multiple interacting cell molecules. Genomes contain abundant and functional repetitive components in addition to the unique coding sequences envisaged in the early days of molecular biology. Combinatorial coding, plus the biochemical abilities cells possess to rearrange DNA molecules, constitute a powerful toolbox for adaptive genome rewriting. That is, cells possess "Read-Write Genomes" they alter by numerous biochemical processes capable of rapidly restructuring cellular DNA molecules. Rather than viewing genome evolution as a series of accidental modifications, we can now study it as a complex biological process of active self-modification.
Shapiro, James A.
2016-01-01
The 21st century genomics-based analysis of evolutionary variation reveals a number of novel features impossible to predict when Dobzhansky and other evolutionary biologists formulated the neo-Darwinian Modern Synthesis in the middle of the last century. These include three distinct realms of cell evolution; symbiogenetic fusions forming eukaryotic cells with multiple genome compartments; horizontal organelle, virus and DNA transfers; functional organization of proteins as systems of interacting domains subject to rapid evolution by exon shuffling and exonization; distributed genome networks integrated by mobile repetitive regulatory signals; and regulation of multicellular development by non-coding lncRNAs containing repetitive sequence components. Rather than single gene traits, all phenotypes involve coordinated activity by multiple interacting cell molecules. Genomes contain abundant and functional repetitive components in addition to the unique coding sequences envisaged in the early days of molecular biology. Combinatorial coding, plus the biochemical abilities cells possess to rearrange DNA molecules, constitute a powerful toolbox for adaptive genome rewriting. That is, cells possess “Read–Write Genomes” they alter by numerous biochemical processes capable of rapidly restructuring cellular DNA molecules. Rather than viewing genome evolution as a series of accidental modifications, we can now study it as a complex biological process of active self-modification. PMID:27338490
DeWitt, D L; Smith, W L
1988-01-01
Prostaglandin G/H synthase (8,11,14-icosatrienoate, hydrogen-donor:oxygen oxidoreductase, EC 1.14.99.1) catalyzes the first step in the formation of prostaglandins and thromboxanes, the conversion of arachidonic acid to prostaglandin endoperoxides G and H. This enzyme is the site of action of nonsteroidal anti-inflammatory drugs. We have isolated a 2.7-kilobase complementary DNA (cDNA) encompassing the entire coding region of prostaglandin G/H synthase from sheep vesicular glands. This cDNA, cloned from a lambda gt 10 library prepared from poly(A)+ RNA of vesicular glands, hybridizes with a single 2.75-kilobase mRNA species. The cDNA clone was selected using oligonucleotide probes modeled from amino acid sequences of tryptic peptides prepared from the purified enzyme. The full-length cDNA encodes a protein of 600 amino acids, including a signal sequence of 24 amino acids. Identification of the cDNA as coding for prostaglandin G/H synthase is based on comparison of amino acid sequences of seven peptides comprising 103 amino acids with the amino acid sequence deduced from the nucleotide sequence of the cDNA. The molecular weight of the unglycosylated enzyme lacking the signal peptide is 65,621. The synthase is a glycoprotein, and there are three potential sites for N-glycosylation, two of them in the amino-terminal half of the molecule. The serine reported to be acetylated by aspirin is at position 530, near the carboxyl terminus. There is no significant similarity between the sequence of the synthase and that of any other protein in amino acid or nucleotide sequence libraries, and a heme binding site(s) is not apparent from the amino acid sequence. The availability of a full-length cDNA clone coding for prostaglandin G/H synthase should facilitate studies of the regulation of expression of this enzyme and the structural features important for catalysis and for interaction with anti-inflammatory drugs. Images PMID:3125548
Novel variants of the 5S rRNA genes in Eruca sativa.
Singh, K; Bhatia, S; Lakshmikumaran, M
1994-02-01
The 5S ribosomal RNA (rRNA) genes of Eruca sativa were cloned and characterized. They are organized into clusters of tandemly repeated units. Each repeat unit consists of a 119-bp coding region followed by a noncoding spacer region that separates it from the coding region of the next repeat unit. Our study reports novel gene variants of the 5S rRNA genes in plants. Two families of the 5S rDNA, the 0.5-kb size family and the 1-kb size family, coexist in the E. sativa genome. The 0.5-kb size family consists of the 5S rRNA genes (S4) that have coding regions similar to those of other reported plant 5S rDNA sequences, whereas the 1-kb size family consists of the 5S rRNA gene variants (S1) that exist as 1-kb BamHI tandem repeats. S1 is made up of two variant units (V1 and V2) of 5S rDNA where the BamHI site between the two units is mutated. Sequence heterogeneity among S4, V1, and V2 units exists throughout the sequence and is not limited to the noncoding spacer region only. The coding regions of V1 and V2 show approximately 20% dissimilarity to the coding regions of S4 and other reported plant 5S rDNA sequences. Such a large variation in the coding regions of the 5S rDNA units within the same plant species has been observed for the first time. Restriction site variation is observed between the two size classes of 5S rDNA in E. sativa.(ABSTRACT TRUNCATED AT 250 WORDS)
Non-Coding RNA Analysis Using the Rfam Database.
Kalvari, Ioanna; Nawrocki, Eric P; Argasinska, Joanna; Quinones-Olvera, Natalia; Finn, Robert D; Bateman, Alex; Petrov, Anton I
2018-06-01
Rfam is a database of non-coding RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. Using a combination of manual and literature-based curation and a custom software pipeline, Rfam converts descriptions of RNA families found in the scientific literature into computational models that can be used to annotate RNAs belonging to those families in any DNA or RNA sequence. Valuable research outputs that are often locked up in figures and supplementary information files are encapsulated in Rfam entries and made accessible through the Rfam Web site. The data produced by Rfam have a broad application, from genome annotation to providing training sets for algorithm development. This article gives an overview of how to search and navigate the Rfam Web site, and how to annotate sequences with RNA families. The Rfam database is freely available at http://rfam.org. © 2018 by John Wiley & Sons, Inc. Copyright © 2018 John Wiley & Sons, Inc.
TFIIS-Dependent Non-coding Transcription Regulates Developmental Genome Rearrangements
Maliszewska-Olejniczak, Kamila; Gruchota, Julita; Gromadka, Robert; Denby Wilkes, Cyril; Arnaiz, Olivier; Mathy, Nathalie; Duharcourt, Sandra; Bétermier, Mireille; Nowak, Jacek K.
2015-01-01
Because of their nuclear dimorphism, ciliates provide a unique opportunity to study the role of non-coding RNAs (ncRNAs) in the communication between germline and somatic lineages. In these unicellular eukaryotes, a new somatic nucleus develops at each sexual cycle from a copy of the zygotic (germline) nucleus, while the old somatic nucleus degenerates. In the ciliate Paramecium tetraurelia, the genome is massively rearranged during this process through the reproducible elimination of repeated sequences and the precise excision of over 45,000 short, single-copy Internal Eliminated Sequences (IESs). Different types of ncRNAs resulting from genome-wide transcription were shown to be involved in the epigenetic regulation of genome rearrangements. To understand how ncRNAs are produced from the entire genome, we have focused on a homolog of the TFIIS elongation factor, which regulates RNA polymerase II transcriptional pausing. Six TFIIS-paralogs, representing four distinct families, can be found in P. tetraurelia genome. Using RNA interference, we showed that TFIIS4, which encodes a development-specific TFIIS protein, is essential for the formation of a functional somatic genome. Molecular analyses and high-throughput DNA sequencing upon TFIIS4 RNAi demonstrated that TFIIS4 is involved in all kinds of genome rearrangements, including excision of ~48% of IESs. Localization of a GFP-TFIIS4 fusion revealed that TFIIS4 appears specifically in the new somatic nucleus at an early developmental stage, before IES excision. RT-PCR experiments showed that TFIIS4 is necessary for the synthesis of IES-containing non-coding transcripts. We propose that these IES+ transcripts originate from the developing somatic nucleus and serve as pairing substrates for germline-specific short RNAs that target elimination of their homologous sequences. Our study, therefore, connects the onset of zygotic non coding transcription to the control of genome plasticity in Paramecium, and establishes for the first time a specific role of TFIIS in non-coding transcription in eukaryotes. PMID:26177014
Multiple tag labeling method for DNA sequencing
Mathies, R.A.; Huang, X.C.; Quesada, M.A.
1995-07-25
A DNA sequencing method is described which uses single lane or channel electrophoresis. Sequencing fragments are separated in the lane and detected using a laser-excited, confocal fluorescence scanner. Each set of DNA sequencing fragments is separated in the same lane and then distinguished using a binary coding scheme employing only two different fluorescent labels. Also described is a method of using radioisotope labels. 5 figs.
2011-01-01
Background The melon belongs to the Cucurbitaceae family, whose economic importance among vegetable crops is second only to Solanaceae. The melon has a small genome size (454 Mb), which makes it suitable for molecular and genetic studies. Despite similar nuclear and chloroplast genome sizes, cucurbits show great variation when their mitochondrial genomes are compared. The melon possesses the largest plant mitochondrial genome, as much as eight times larger than that of other cucurbits. Results The nucleotide sequences of the melon chloroplast and mitochondrial genomes were determined. The chloroplast genome (156,017 bp) included 132 genes, with 98 single-copy genes dispersed between the small (SSC) and large (LSC) single-copy regions and 17 duplicated genes in the inverted repeat regions (IRa and IRb). A comparison of the cucumber and melon chloroplast genomes showed differences in only approximately 5% of nucleotides, mainly due to short indels and SNPs. Additionally, 2.74 Mb of mitochondrial sequence, accounting for 95% of the estimated mitochondrial genome size, were assembled into five scaffolds and four additional unscaffolded contigs. An 84% of the mitochondrial genome is contained in a single scaffold. The gene-coding region accounted for 1.7% (45,926 bp) of the total sequence, including 51 protein-coding genes, 4 conserved ORFs, 3 rRNA genes and 24 tRNA genes. Despite the differences observed in the mitochondrial genome sizes of cucurbit species, Citrullus lanatus (379 kb), Cucurbita pepo (983 kb) and Cucumis melo (2,740 kb) share 120 kb of sequence, including the predicted protein-coding regions. Nevertheless, melon contained a high number of repetitive sequences and a high content of DNA of nuclear origin, which represented 42% and 47% of the total sequence, respectively. Conclusions Whereas the size and gene organisation of chloroplast genomes are similar among the cucurbit species, mitochondrial genomes show a wide variety of sizes, with a non-conserved structure both in gene number and organisation, as well as in the features of the noncoding DNA. The transfer of nuclear DNA to the melon mitochondrial genome and the high proportion of repetitive DNA appear to explain the size of the largest mitochondrial genome reported so far. PMID:21854637
Molecular dynamics study of some non-hydrogen-bonding base pair DNA strands
NASA Astrophysics Data System (ADS)
Tiwari, Rakesh K.; Ojha, Rajendra P.; Tiwari, Gargi; Pandey, Vishnudatt; Mall, Vijaysree
2018-05-01
In order to elucidate the structural activity of hydrophobic modified DNA, the DMMO2-D5SICS, base pair is introduced as a constituent in different set of 12-mer and 14-mer DNA sequences for the molecular dynamics (MD) simulation in explicit water solvent. AMBER 14 force field was employed for each set of duplex during the 200ns production-dynamics simulation in orthogonal-box-water solvent by the Particle-Mesh-Ewald (PME) method in infinite periodic boundary conditions (PBC) to determine conformational parameters of the complex. The force-field parameters of modified base-pair were calculated by Gaussian-code using Hartree-Fock /ab-initio methodology. RMSD Results reveal that the conformation of the duplex is sequence dependent and the binding energy of the complex depends on the position of the modified base-pair in the nucleic acid strand. We found that non-bonding energy had a significant contribution to stabilising such type of duplex in comparison to electrostatic energy. The distortion produced within strands by such type of base-pair was local and destabilised the duplex integrity near to substitution, moreover the binding energy of duplex depends on the position of substitution of hydrophobic base-pair and the DNA sequence and strongly supports the corresponding experimental study.
GRID-seq reveals the global RNA-chromatin interactome
Li, Xiao; Zhou, Bing; Chen, Liang; Gou, Lan-Tao; Li, Hairi; Fu, Xiang-Dong
2017-01-01
Higher eukaryotic genomes are bound by a large number of coding and non-coding RNAs, but approaches to comprehensively map the identity and binding sites of these RNAs are lacking. Here we report a method to in situ capture global RNA interactions with DNA by deep sequencing (GRID-seq), which enables the comprehensive identification of the entire repertoire of chromatin-interacting RNAs and their respective binding sites. In human, mouse and Drosophila cells, we detected a large set of tissue-specific coding and non-coding RNAs that are bound to active promoters and enhancers, especially super-enhancers. Assuming that most mRNA-chromatin interactions indicate the physical proximity of a promoter and an enhancer, we constructed a three-dimensional global connectivity map of promoters and enhancers, revealing transcription activity-linked genomic interactions in the nucleus. PMID:28922346
DNA Multiple Sequence Alignment Guided by Protein Domains: The MSA-PAD 2.0 Method.
Balech, Bachir; Monaco, Alfonso; Perniola, Michele; Santamaria, Monica; Donvito, Giacinto; Vicario, Saverio; Maggi, Giorgio; Pesole, Graziano
2018-01-01
Multiple sequence alignment (MSA) is a fundamental component in many DNA sequence analyses including metagenomics studies and phylogeny inference. When guided by protein profiles, DNA multiple alignments assume a higher precision and robustness. Here we present details of the use of the upgraded version of MSA-PAD (2.0), which is a DNA multiple sequence alignment framework able to align DNA sequences coding for single/multiple protein domains guided by PFAM or user-defined annotations. MSA-PAD has two alignment strategies, called "Gene" and "Genome," accounting for coding domains order and genomic rearrangements, respectively. Novel options were added to the present version, where the MSA can be guided by protein profiles provided by the user. This allows MSA-PAD 2.0 to run faster and to add custom protein profiles sometimes not present in PFAM database according to the user's interest. MSA-PAD 2.0 is currently freely available as a Web application at https://recasgateway.cloud.ba.infn.it/ .
Shen, Kang-Ning; Chen, Ching-Hung; Hsiao, Chung-Der; Durand, Jean-Dominique
2016-09-01
In this study, the complete mitogenome sequence of a cryptic species from East Australia (Mugil sp. H) belonging to the worldwide Mugil cephalus species complex (Teleostei: Mugilidae) has been sequenced by next-generation sequencing method. The assembled mitogenome, consisting of 16,845 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein-coding genes, 22 transfer RNAs, 2 ribosomal RNAs genes and a non-coding control region of D-loop. D-loop consists of 1067 bp length, and is located between tRNA-Pro and tRNA-Phe. The overall base composition of East Australia M. cephalus is 28.4% for A, 29.3% for C, 15.4% for G and 26.9% for T. The complete mitogenome may provide essential and important DNA molecular data for further phylogenetic and evolutionary analysis for flathead mullet species complex.
Shen, Kang-Ning; Yen, Ta-Chi; Chen, Ching-Hung; Li, Huei-Ying; Chen, Pei-Lung; Hsiao, Chung-Der
2016-05-01
In this study, the complete mitogenome sequence of Northwestern Pacific 2 (NWP2) cryptic species of flathead mullet, Mugil cephalus (Teleostei: Mugilidae) has been amplified by long-range PCR and sequenced by next-generation sequencing method. The assembled mitogenome, consisting of 16,686 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein-coding genes, 22 transfer RNAs, 2 ribosomal RNAs genes and a non-coding control region of D-loop. D-loop was 909 bp length and was located between tRNA-Pro and tRNA-Phe. The overall base composition of NWP2 M. cephalus was 28.4% for A, 29.8% for C, 26.5% for T and 15.3% for G. The complete mitogenome may provide essential and important DNA molecular data for further phylogenetic and evolutionary analysis for flathead mullet species complex.
Gene and genon concept: coding versus regulation
2007-01-01
We analyse here the definition of the gene in order to distinguish, on the basis of modern insight in molecular biology, what the gene is coding for, namely a specific polypeptide, and how its expression is realized and controlled. Before the coding role of the DNA was discovered, a gene was identified with a specific phenotypic trait, from Mendel through Morgan up to Benzer. Subsequently, however, molecular biologists ventured to define a gene at the level of the DNA sequence in terms of coding. As is becoming ever more evident, the relations between information stored at DNA level and functional products are very intricate, and the regulatory aspects are as important and essential as the information coding for products. This approach led, thus, to a conceptual hybrid that confused coding, regulation and functional aspects. In this essay, we develop a definition of the gene that once again starts from the functional aspect. A cellular function can be represented by a polypeptide or an RNA. In the case of the polypeptide, its biochemical identity is determined by the mRNA prior to translation, and that is where we locate the gene. The steps from specific, but possibly separated sequence fragments at DNA level to that final mRNA then can be analysed in terms of regulation. For that purpose, we coin the new term “genon”. In that manner, we can clearly separate product and regulative information while keeping the fundamental relation between coding and function without the need to introduce a conceptual hybrid. In mRNA, the program regulating the expression of a gene is superimposed onto and added to the coding sequence in cis - we call it the genon. The complementary external control of a given mRNA by trans-acting factors is incorporated in its transgenon. A consequence of this definition is that, in eukaryotes, the gene is, in most cases, not yet present at DNA level. Rather, it is assembled by RNA processing, including differential splicing, from various pieces, as steered by the genon. It emerges finally as an uninterrupted nucleic acid sequence at mRNA level just prior to translation, in faithful correspondence with the amino acid sequence to be produced as a polypeptide. After translation, the genon has fulfilled its role and expires. The distinction between the protein coding information as materialised in the final polypeptide and the processing information represented by the genon allows us to set up a new information theoretic scheme. The standard sequence information determined by the genetic code expresses the relation between coding sequence and product. Backward analysis asks from which coding region in the DNA a given polypeptide originates. The (more interesting) forward analysis asks in how many polypeptides of how many different types a given DNA segment is expressed. This concerns the control of the expression process for which we have introduced the genon concept. Thus, the information theoretic analysis can capture the complementary aspects of coding and regulation, of gene and genon. PMID:18087760
Generate Optimized Genetic Rhythm for Enzyme Expression in Non-native systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
2016-11-03
Most amino acids are represented by more than one codon, resulting in redundancy in the genetic code. Silent codon substitutions that do not alter the amino acid sequence still have an effect on protein expression. We have developed an algorithm, GoGREEN, to enhance the expression of foreign proteins in a host organism. GoGREEN selects codons according to frequency patterns seen in the gene of interest using the codon usage table from the host organism. GoGREEN is also designed to accommodate gaps in the sequence.This software takes for input (1) the aligned protein sequences for genes the user wishes to express,more » (2) the codon usage table for the host organism, (3) and the DNA sequence for the target protein found in the host organism. The program will select codons based on codon usage patterns for the target DNA sequence. The program will also select codons for “gaps” found in the aligned protein sequences using the codon usage table from the host organism.« less
Vlahovicek, K; Munteanu, M G; Pongor, S
1999-01-01
Bending is a local conformational micropolymorphism of DNA in which the original B-DNA structure is only distorted but not extensively modified. Bending can be predicted by simple static geometry models as well as by a recently developed elastic model that incorporate sequence dependent anisotropic bendability (SDAB). The SDAB model qualitatively explains phenomena including affinity of protein binding, kinking, as well as sequence-dependent vibrational properties of DNA. The vibrational properties of DNA segments can be studied by finite element analysis of a model subjected to an initial bending moment. The frequency spectrum is obtained by applying Fourier analysis to the displacement values in the time domain. This analysis shows that the spectrum of the bending vibrations quite sensitively depends on the sequence, for example the spectrum of a curved sequence is characteristically different from the spectrum of straight sequence motifs of identical basepair composition. Curvature distributions are genome-specific, and pronounced differences are found between protein-coding and regulatory regions, respectively, that is, sites of extreme curvature and/or bendability are less frequent in protein-coding regions. A WWW server is set up for the prediction of curvature and generation of 3D models from DNA sequences (http:@www.icgeb.trieste.it/dna).
Context influences on TALE–DNA binding revealed by quantitative profiling
Rogers, Julia M.; Barrera, Luis A.; Reyon, Deepak; Sander, Jeffry D.; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L.
2015-01-01
Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE–DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000–20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE–DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design. PMID:26067805
Context influences on TALE-DNA binding revealed by quantitative profiling.
Rogers, Julia M; Barrera, Luis A; Reyon, Deepak; Sander, Jeffry D; Kellis, Manolis; Joung, J Keith; Bulyk, Martha L
2015-06-11
Transcription activator-like effector (TALE) proteins recognize DNA using a seemingly simple DNA-binding code, which makes them attractive for use in genome engineering technologies that require precise targeting. Although this code is used successfully to design TALEs to target specific sequences, off-target binding has been observed and is difficult to predict. Here we explore TALE-DNA interactions comprehensively by quantitatively assaying the DNA-binding specificities of 21 representative TALEs to ∼5,000-20,000 unique DNA sequences per protein using custom-designed protein-binding microarrays (PBMs). We find that protein context features exert significant influences on binding. Thus, the canonical recognition code does not fully capture the complexity of TALE-DNA binding. We used the PBM data to develop a computational model, Specificity Inference For TAL-Effector Design (SIFTED), to predict the DNA-binding specificity of any TALE. We provide SIFTED as a publicly available web tool that predicts potential genomic off-target sites for improved TALE design.
Molecular characterization of Banana streak virus isolate from Musa Acuminata in China.
Zhuang, Jun; Wang, Jian-Hua; Zhang, Xin; Liu, Zhi-Xin
2011-12-01
Banana streak virus (BSV), a member of genus Badnavirus, is a causal agent of banana streak disease throughout the world. The genetic diversity of BSVs from different regions of banana plantations has previously been investigated, but there are relatively few reports of the genetic characteristic of episomal (non-integrated) BSV genomes isolated from China. Here, the complete genome, a total of 7722bp (GenBank accession number DQ092436), of an isolate of Banana streak virus (BSV) on cultivar Cavendish (BSAcYNV) in Yunnan, China was determined. The genome organises in the typical manner of badnaviruses. The intergenic region of genomic DNA contains a large stem-loop, which may contribute to the ribosome shift into the following open reading frames (ORFs). The coding region of BSAcYNV consists of three overlapping ORFs, ORF1 with a non-AUG start codon and ORF2 encoding two small proteins are individually involved in viral movement and ORF3 encodes a polyprotein. Besides the complete genome, a defective genome lacking the whole RNA leader region and a majority of ORF1 and which encompasses 6525bp was also isolated and sequenced from this BSV DNA reservoir in infected banana plants. Sequence analyses showed that BSAcYNV has closest similarity in terms of genome organization and the coding assignments with an BSV isolate from Vietnam (BSAcVNV). The corresponding coding regions shared identities of 88% and -95% at nucleotide and amino acid levels, respectively. Phylogenetic analysis also indicated BSAcYNV shared the closest geographical evolutionary relationship to BSAcVNV among sequenced banana streak badnaviruses.
Stable CoT-1 repeat RNA is abundant and associated with euchromatic interphase chromosomes
Hall, Lisa L.; Carone, Dawn M.; Gomez, Alvin; Kolpa, Heather J.; Byron, Meg; Mehta, Nitish; Fackelmayer, Frank O.; Lawrence, Jeanne B.
2014-01-01
SUMMARY Recent studies recognize a vast diversity of non-coding RNAs with largely unknown functions, but few have examined interspersed repeat sequences, which constitute almost half our genome. RNA hybridization in situ using CoT-1 (highly repeated) DNA probes detects surprisingly abundant euchromatin-associated RNA comprised predominantly of repeat sequences (“CoT-1 RNA”), including LINE-1. CoT-1-hybridizing RNA strictly localizes to the interphase chromosome territory in cis, and remains stably associated with the chromosome territory following prolonged transcriptional inhibition. The CoT-1 RNA territory resists mechanical disruption and fractionates with the non-chromatin scaffold, but can be experimentally released. Loss of repeat-rich, stable nuclear RNAs from euchromatin corresponds to aberrant chromatin distribution and condensation. CoT-1 RNA has several properties similar to XIST chromosomal RNA, but is excluded from chromatin condensed by XIST. These findings impact two “black boxes” of genome science: the poorly understood diversity of non-coding RNA and the unexplained abundance of repetitive elements. PMID:24581492
Réfega, Susana; Girard-Misguich, Fabienne; Bourdieu, Christiane; Péry, Pierre; Labbé, Marie
2003-04-02
Specific antibodies were produced ex vivo from intestinal culture of Eimeria tenella infected chickens. The specificity of these intestinal antibodies was tested against different parasite stages. These antibodies were used to immunoscreen first generation schizont and sporozoite cDNA libraries permitting the identification of new E. tenella antigens. We obtained a total of 119 cDNA clones which were subjected to sequence analysis. The sequences coding for the proteins inducing local immune responses were compared with nucleotide or protein databases and with expressed sequence tags (ESTs) databases. We identified new Eimeria genes coding for heat shock proteins, a ribosomal protein, a pyruvate kinase and a pyridoxine kinase. Specific features of other sequences are discussed.
Greenberg, Jay R.; Perry, Robert P.
1971-01-01
The relationship of the DNA sequences from which polyribosomal messenger RNA (mRNA) and heterogeneous nuclear RNA (NRNA) of mouse L cells are transcribed was investigated by means of hybridization kinetics and thermal denaturation of the hybrids. Hybridization was performed in formamide solutions at DNA excess. Under these conditions most of the hybridizing mRNA and NRNA react at values of Dot (DNA concentration multiplied by time) expected for RNA transcribed from the nonrepeated or rarely repeated fraction of the genome. However, a fraction of both mRNA and NRNA hybridize at values of Dot about 10,000 times lower, and therefore must be transcribed from highly redundant DNA sequences. The fraction of NRNA hybridizing to highly repeated sequences is about 1.7 times greater than the corresponding fraction of mRNA. The hybrids formed by the rapidly reacting fractions of both NRNA and mRNA melt over a narrow temperature range with a midpoint about 11°C below that of native L cell DNA. This indicates that these hybrids consist of partially complementary sequences with approximately 11% mismatching of bases. Hybrids formed by the slowly reacting fraction of NRNA melt within 4°–6°C of native DNA, indicating very little, if any, mismatching of bases. Hybrids of the slowly reacting components of mRNA, formed under conditions of sufficiently low RNA input, have a high thermal stability, similar to that observed for hybrids of the slowly reacting NRNA component. However, when higher inputs of mRNA are used, hybrids are formed which have a strikingly lower thermal stability. This observation can be explained by assuming that there is sufficient similarity among the relatively rare DNA sequences coding for mRNA so that under hybridization conditions, in which these DNA sequences are not truly in excess, reversible hybrids exhibiting a considerable amount of mispairing are formed. The fact that a comparable phenomenon has not been observed for NRNA may mean that there is less similarity among the relatively rare DNA sequences coding for NRNA than there is among the rare sequences coding for mRNA. PMID:4999767
Variation in conserved non-coding sequences on chromosome 5q andsusceptibility to asthma and atopy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Donfack, Joseph; Schneider, Daniel H.; Tan, Zheng
2005-09-10
Background: Evolutionarily conserved sequences likely havebiological function. Methods: To determine whether variation in conservedsequences in non-coding DNA contributes to risk for human disease, westudied six conserved non-coding elements in the Th2 cytokine cluster onhuman chromosome 5q31 in a large Hutterite pedigree and in samples ofoutbred European American and African American asthma cases and controls.Results: Among six conserved non-coding elements (>100 bp,>70percent identity; human-mouse comparison), we identified one singlenucleotide polymorphism (SNP) in each of two conserved elements and sixSNPs in the flanking regions of three conserved elements. We genotypedour samples for four of these SNPs and an additional three SNPs eachmore » inthe IL13 and IL4 genes. While there was only modest evidence forassociation with single SNPs in the Hutterite and European Americansamples (P<0.05), there were highly significant associations inEuropean Americans between asthma and haplotypes comprised of SNPs in theIL4 gene (P<0.001), including a SNP in a conserved non-codingelement. Furthermore, variation in the IL13 gene was strongly associatedwith total IgE (P = 0.00022) and allergic sensitization to mold allergens(P = 0.00076) in the Hutterites, and more modestly associated withsensitization to molds in the European Americans and African Americans (P<0.01). Conclusion: These results indicate that there is overalllittle variation in the conserved non-coding elements on 5q31, butvariation in IL4 and IL13, including possibly one SNP in a conservedelement, influence asthma and atopic phenotypes in diversepopulations.« less
Mitochondrial DNA repairs double-strand breaks in yeast chromosomes.
Ricchetti, M; Fairhead, C; Dujon, B
1999-11-04
The endosymbiotic theory for the origin of eukaryotic cells proposes that genetic information can be transferred from mitochondria to the nucleus of a cell, and genes that are probably of mitochondrial origin have been found in nuclear chromosomes. Occasionally, short or rearranged sequences homologous to mitochondrial DNA are seen in the chromosomes of different organisms including yeast, plants and humans. Here we report a mechanism by which fragments of mitochondrial DNA, in single or tandem array, are transferred to yeast chromosomes under natural conditions during the repair of double-strand breaks in haploid mitotic cells. These repair insertions originate from noncontiguous regions of the mitochondrial genome. Our analysis of the Saccharomyces cerevisiae mitochondrial genome indicates that the yeast nuclear genome does indeed contain several short sequences of mitochondrial origin which are similar in size and composition to those that repair double-strand breaks. These sequences are located predominantly in non-coding regions of the chromosomes, frequently in the vicinity of retrotransposon long terminal repeats, and appear as recent integration events. Thus, colonization of the yeast genome by mitochondrial DNA is an ongoing process.
Resurrection of DNA Function In Vivo from an Extinct Genome
Pask, Andrew J.; Behringer, Richard R.; Renfree, Marilyn B.
2008-01-01
There is a burgeoning repository of information available from ancient DNA that can be used to understand how genomes have evolved and to determine the genetic features that defined a particular species. To assess the functional consequences of changes to a genome, a variety of methods are needed to examine extinct DNA function. We isolated a transcriptional enhancer element from the genome of an extinct marsupial, the Tasmanian tiger (Thylacinus cynocephalus or thylacine), obtained from 100 year-old ethanol-fixed tissues from museum collections. We then examined the function of the enhancer in vivo. Using a transgenic approach, it was possible to resurrect DNA function in transgenic mice. The results demonstrate that the thylacine Col2A1 enhancer directed chondrocyte-specific expression in this extinct mammalian species in the same way as its orthologue does in mice. While other studies have examined extinct coding DNA function in vitro, this is the first example of the restoration of extinct non-coding DNA and examination of its function in vivo. Our method using transgenesis can be used to explore the function of regulatory and protein-coding sequences obtained from any extinct species in an in vivo model system, providing important insights into gene evolution and diversity. PMID:18493600
Doddapaneni, Harshavardhan; Yao, Jiqiang; Lin, Hong; Walker, M Andrew; Civerolo, Edwin L
2006-01-01
Background The Gram-negative, xylem-limited phytopathogenic bacterium Xylella fastidiosa is responsible for causing economically important diseases in grapevine, citrus and many other plant species. Despite its economic impact, relatively little is known about the genomic variations among strains isolated from different hosts and their influence on the population genetics of this pathogen. With the availability of genome sequence information for four strains, it is now possible to perform genome-wide analyses to identify and categorize such DNA variations and to understand their influence on strain functional divergence. Results There are 1,579 genes and 194 non-coding homologous sequences present in the genomes of all four strains, representing a 76. 2% conservation of the sequenced genome. About 60% of the X. fastidiosa unique sequences exist as tandem gene clusters of 6 or more genes. Multiple alignments identified 12,754 SNPs and 14,449 INDELs in the 1528 common genes and 20,779 SNPs and 10,075 INDELs in the 194 non-coding sequences. The average SNP frequency was 1.08 × 10-2 per base pair of DNA and the average INDEL frequency was 2.06 × 10-2 per base pair of DNA. On an average, 60.33% of the SNPs were synonymous type while 39.67% were non-synonymous type. The mutation frequency, primarily in the form of external INDELs was the main type of sequence variation. The relative similarity between the strains was discussed according to the INDEL and SNP differences. The number of genes unique to each strain were 60 (9a5c), 54 (Dixon), 83 (Ann1) and 9 (Temecula-1). A sub-set of the strain specific genes showed significant differences in terms of their codon usage and GC composition from the native genes suggesting their xenologous origin. Tandem repeat analysis of the genomic sequences of the four strains identified associations of repeat sequences with hypothetical and phage related functions. Conclusion INDELs and strain specific genes have been identified as the main source of variations among strains, with individual strains showing different rates of genome evolution. Based on these genome comparisons, it appears that the Pierce's disease strain Temecula-1 genome represents the ancestral genome of the X. fastidiosa. Results of this analysis are publicly available in the form of a web database. PMID:16948851
Vargas-Caro, Carolina; Bustamante, Carlos; Lamilla, Julio; Bennett, Michael B; Ovenden, Jennifer R
2016-07-01
The complete mitochondrial genome of the roughskin skate Dipturus trachyderma is described from 1 455 724 sequences obtained using Illumina NGS technology. Total length of the mitogenome was 16 909 base pairs, comprising 2 rRNAs, 13 protein-coding genes, 22 tRNAs and 2 non-coding regions. Phylogenetic analysis based on mtDNA revealed low genetic divergence among longnose skates, in particular, those dwelling the continental shelf and slope off the coasts of Chile and Argentina.
Specific minor groove solvation is a crucial determinant of DNA binding site recognition
Harris, Lydia-Ann; Williams, Loren Dean; Koudelka, Gerald B.
2014-01-01
The DNA sequence preferences of nearly all sequence specific DNA binding proteins are influenced by the identities of bases that are not directly contacted by protein. Discrimination between non-contacted base sequences is commonly based on the differential abilities of DNA sequences to allow narrowing of the DNA minor groove. However, the factors that govern the propensity of minor groove narrowing are not completely understood. Here we show that the differential abilities of various DNA sequences to support formation of a highly ordered and stable minor groove solvation network are a key determinant of non-contacted base recognition by a sequence-specific binding protein. In addition, disrupting the solvent network in the non-contacted region of the binding site alters the protein's ability to recognize contacted base sequences at positions 5–6 bases away. This observation suggests that DNA solvent interactions link contacted and non-contacted base recognition by the protein. PMID:25429976
The changing epitome of species identification – DNA barcoding
Ajmal Ali, M.; Gyulai, Gábor; Hidvégi, Norbert; Kerti, Balázs; Al Hemaid, Fahad M.A.; Pandey, Arun K.; Lee, Joongku
2014-01-01
The discipline taxonomy (the science of naming and classifying organisms, the original bioinformatics and a basis for all biology) is fundamentally important in ensuring the quality of life of future human generation on the earth; yet over the past few decades, the teaching and research funding in taxonomy have declined because of its classical way of practice which lead the discipline many a times to a subject of opinion, and this ultimately gave birth to several problems and challenges, and therefore the taxonomist became an endangered race in the era of genomics. Now taxonomy suddenly became fashionable again due to revolutionary approaches in taxonomy called DNA barcoding (a novel technology to provide rapid, accurate, and automated species identifications using short orthologous DNA sequences). In DNA barcoding, complete data set can be obtained from a single specimen irrespective to morphological or life stage characters. The core idea of DNA barcoding is based on the fact that the highly conserved stretches of DNA, either coding or non coding regions, vary at very minor degree during the evolution within the species. Sequences suggested to be useful in DNA barcoding include cytoplasmic mitochondrial DNA (e.g. cox1) and chloroplast DNA (e.g. rbcL, trnL-F, matK, ndhF, and atpB rbcL), and nuclear DNA (ITS, and house keeping genes e.g. gapdh). The plant DNA barcoding is now transitioning the epitome of species identification; and thus, ultimately helping in the molecularization of taxonomy, a need of the hour. The ‘DNA barcodes’ show promise in providing a practical, standardized, species-level identification tool that can be used for biodiversity assessment, life history and ecological studies, forensic analysis, and many more. PMID:24955007
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prody, C.A.; Zevin-Sonkin, D.; Gnatt, A.
1987-06-01
To study the primary structure and regulation of human cholinesterases, oligodeoxynucleotide probes were prepared according to a consensus peptide sequence present in the active site of both human serum pseudocholinesterase and Torpedo electric organ true acetylcholinesterase. Using these probes, the authors isolated several cDNA clones from lambdagt10 libraries of fetal brain and liver origins. These include 2.4-kilobase cDNA clones that code for a polypeptide containing a putative signal peptide and the N-terminal, active site, and C-terminal peptides of human BtChoEase, suggesting that they code either for BtChoEase itself or for a very similar but distinct fetal form of cholinesterase. Inmore » RNA blots of poly(A)/sup +/ RNA from the cholinesterase-producing fetal brain and liver, these cDNAs hybridized with a single 2.5-kilobase band. Blot hybridization to human genomic DNA revealed that these fetal BtChoEase cDNA clones hybridize with DNA fragments of the total length of 17.5 kilobases, and signal intensities indicated that these sequences are not present in many copies. Both the cDNA-encoded protein and its nucleotide sequence display striking homology to parallel sequences published for Torpedo AcChoEase. These finding demonstrate extensive homologies between the fetal BtChoEase encoded by these clones and other cholinesterases of various forms and species.« less
Alvarado, David M; Yang, Ping; Druley, Todd E; Lovett, Michael; Gurnett, Christina A
2014-06-01
Despite declining sequencing costs, few methods are available for cost-effective single-nucleotide polymorphism (SNP), insertion/deletion (INDEL) and copy number variation (CNV) discovery in a single assay. Commercially available methods require a high investment to a specific region and are only cost-effective for large samples. Here, we introduce a novel, flexible approach for multiplexed targeted sequencing and CNV analysis of large genomic regions called multiplexed direct genomic selection (MDiGS). MDiGS combines biotinylated bacterial artificial chromosome (BAC) capture and multiplexed pooled capture for SNP/INDEL and CNV detection of 96 multiplexed samples on a single MiSeq run. MDiGS is advantageous over other methods for CNV detection because pooled sample capture and hybridization to large contiguous BAC baits reduces sample and probe hybridization variability inherent in other methods. We performed MDiGS capture for three chromosomal regions consisting of ∼ 550 kb of coding and non-coding sequence with DNA from 253 patients with congenital lower limb disorders. PITX1 nonsense and HOXC11 S191F missense mutations were identified that segregate in clubfoot families. Using a novel pooled-capture reference strategy, we identified recurrent chromosome chr17q23.1q23.2 duplications and small HOXC 5' cluster deletions (51 kb and 12 kb). Given the current interest in coding and non-coding variants in human disease, MDiGS fulfills a niche for comprehensive and low-cost evaluation of CNVs, coding, and non-coding variants across candidate regions of interest. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Design and construction of functional AAV vectors.
Gray, John T; Zolotukhin, Serge
2011-01-01
Using the basic principles of molecular biology and laboratory techniques presented in this chapter, researchers should be able to create a wide variety of AAV vectors for both clinical and basic research applications. Basic vector design concepts are covered for both protein coding gene expression and small non-coding RNA gene expression cassettes. AAV plasmid vector backbones (available via AddGene) are described, along with critical sequence details for a variety of modular expression components that can be inserted as needed for specific applications. Protocols are provided for assembling the various DNA components into AAV vector plasmids in Escherichia coli, as well as for transferring these vector sequences into baculovirus genomes for large-scale production of AAV in the insect cell production system.
Cheng, Linzhao; Hansen, Nancy F.; Zhao, Ling; Du, Yutao; Zou, Chunlin; Donovan, Frank X.; Chou, Bin-Kuan; Zhou, Guangyu; Li, Shijie; Dowey, Sarah N.; Ye, Zhaohui; Chandrasekharappa, Settara C.; Yang, Huanming; Mullikin, James C.; Liu, P. Paul
2012-01-01
Summary The utility of induced pluripotent stem cells (iPSCs) as models to study diseases and as sources for cell therapy depends on the integrity of their genomes. Despite recent publications of DNA sequence variations in the iPSCs, the true scope of such changes for the entire genome is not clear. Here we report the whole-genome sequencing of three human iPSC lines derived from two cell types of an adult donor by episomal vectors. The vector sequence was undetectable in the deeply sequenced iPSC lines. We identified 1058–1808 heterozygous single nucleotide variants (SNVs), but no copy number variants, in each iPSC line. Six to twelve of these SNVs were within coding regions in each iPSC line, but ~50% of them are synonymous changes and the remaining are not selectively enriched for known genes associated with cancers. Our data thus suggest that episome-mediated reprogramming is not inherently mutagenic during integration-free iPSC induction. PMID:22385660
Huang, Ying; Chen, Shi-Yi; Deng, Feilong
2016-01-01
In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.
Cloning and expression of a cDNA coding for catalase from zebrafish (Danio rerio).
Ken, C F; Lin, C T; Wu, J L; Shaw, J F
2000-06-01
A full-length complementary DNA (cDNA) clone encoding a catalase was amplified by the rapid amplication of cDNA ends-polymerase chain reaction (RACE-PCR) technique from zebrafish (Danio rerio) mRNA. Nucleotide sequence analysis of this cDNA clone revealed that it comprised a complete open reading frame coding for 526 amino acid residues and that it had a molecular mass of 59 654 Da. The deduced amino acid sequence showed high similarity with the sequences of catalase from swine (86.9%), mouse (85.8%), rat (85%), human (83.7%), fruit fly (75.6%), nematode (71.1%), and yeast (58.6%). The amino acid residues for secondary structures are apparently conserved as they are present in other mammal species. Furthermore, the coding region of zebrafish catalase was introduced into an expression vector, pET-20b(+), and transformed into Escherichia coli expression host BL21(DE3)pLysS. A 60-kDa active catalase protein was expressed and detected by Coomassie blue staining as well as activity staining on polyacrylamide gel followed electrophoresis.
Chloroplast DNA Structural Variation, Phylogeny, and Age of Divergence among Diploid Cotton Species.
Chen, Zhiwen; Feng, Kun; Grover, Corrinne E; Li, Pengbo; Liu, Fang; Wang, Yumei; Xu, Qin; Shang, Mingzhao; Zhou, Zhongli; Cai, Xiaoyan; Wang, Xingxing; Wendel, Jonathan F; Wang, Kunbo; Hua, Jinping
2016-01-01
The cotton genus (Gossypium spp.) contains 8 monophyletic diploid genome groups (A, B, C, D, E, F, G, K) and a single allotetraploid clade (AD). To gain insight into the phylogeny of Gossypium and molecular evolution of the chloroplast genome in this group, we performed a comparative analysis of 19 Gossypium chloroplast genomes, six reported here for the first time. Nucleotide distance in non-coding regions was about three times that of coding regions. As expected, distances were smaller within than among genome groups. Phylogenetic topologies based on nucleotide and indel data support for the resolution of the 8 genome groups into 6 clades. Phylogenetic analysis of indel distribution among the 19 genomes demonstrates contrasting evolutionary dynamics in different clades, with a parallel genome downsizing in two genome groups and a biased accumulation of insertions in the clade containing the cultivated cottons leading to large (for Gossypium) chloroplast genomes. Divergence time estimates derived from the cpDNA sequence suggest that the major diploid clades had diverged approximately 10 to 11 million years ago. The complete nucleotide sequences of 6 cpDNA genomes are provided, offering a resource for cytonuclear studies in Gossypium.
Chloroplast DNA Structural Variation, Phylogeny, and Age of Divergence among Diploid Cotton Species
Li, Pengbo; Liu, Fang; Wang, Yumei; Xu, Qin; Shang, Mingzhao; Zhou, Zhongli; Cai, Xiaoyan; Wang, Xingxing; Wendel, Jonathan F.; Wang, Kunbo
2016-01-01
The cotton genus (Gossypium spp.) contains 8 monophyletic diploid genome groups (A, B, C, D, E, F, G, K) and a single allotetraploid clade (AD). To gain insight into the phylogeny of Gossypium and molecular evolution of the chloroplast genome in this group, we performed a comparative analysis of 19 Gossypium chloroplast genomes, six reported here for the first time. Nucleotide distance in non-coding regions was about three times that of coding regions. As expected, distances were smaller within than among genome groups. Phylogenetic topologies based on nucleotide and indel data support for the resolution of the 8 genome groups into 6 clades. Phylogenetic analysis of indel distribution among the 19 genomes demonstrates contrasting evolutionary dynamics in different clades, with a parallel genome downsizing in two genome groups and a biased accumulation of insertions in the clade containing the cultivated cottons leading to large (for Gossypium) chloroplast genomes. Divergence time estimates derived from the cpDNA sequence suggest that the major diploid clades had diverged approximately 10 to 11 million years ago. The complete nucleotide sequences of 6 cpDNA genomes are provided, offering a resource for cytonuclear studies in Gossypium. PMID:27309527
Design pattern mining using distributed learning automata and DNA sequence alignment.
Esmaeilpour, Mansour; Naderifar, Vahideh; Shukur, Zarina
2014-01-01
Over the last decade, design patterns have been used extensively to generate reusable solutions to frequently encountered problems in software engineering and object oriented programming. A design pattern is a repeatable software design solution that provides a template for solving various instances of a general problem. This paper describes a new method for pattern mining, isolating design patterns and relationship between them; and a related tool, DLA-DNA for all implemented pattern and all projects used for evaluation. DLA-DNA achieves acceptable precision and recall instead of other evaluated tools based on distributed learning automata (DLA) and deoxyribonucleic acid (DNA) sequences alignment. The proposed method mines structural design patterns in the object oriented source code and extracts the strong and weak relationships between them, enabling analyzers and programmers to determine the dependency rate of each object, component, and other section of the code for parameter passing and modular programming. The proposed model can detect design patterns better that available other tools those are Pinot, PTIDEJ and DPJF; and the strengths of their relationships. The result demonstrate that whenever the source code is build standard and non-standard, based on the design patterns, then the result of the proposed method is near to DPJF and better that Pinot and PTIDEJ. The proposed model is tested on the several source codes and is compared with other related models and available tools those the results show the precision and recall of the proposed method, averagely 20% and 9.6% are more than Pinot, 27% and 31% are more than PTIDEJ and 3.3% and 2% are more than DPJF respectively. The primary idea of the proposed method is organized in two following steps: the first step, elemental design patterns are identified, while at the second step, is composed to recognize actual design patterns.
Epigenetics of Peripheral B-Cell Differentiation and the Antibody Response
Zan, Hong; Casali, Paolo
2015-01-01
Epigenetic modifications, such as histone post-translational modifications, DNA methylation, and alteration of gene expression by non-coding RNAs, including microRNAs (miRNAs) and long non-coding RNAs (lncRNAs), are heritable changes that are independent from the genomic DNA sequence. These regulate gene activities and, therefore, cellular functions. Epigenetic modifications act in concert with transcription factors and play critical roles in B cell development and differentiation, thereby modulating antibody responses to foreign- and self-antigens. Upon antigen encounter by mature B cells in the periphery, alterations of these lymphocytes epigenetic landscape are induced by the same stimuli that drive the antibody response. Such alterations instruct B cells to undergo immunoglobulin (Ig) class switch DNA recombination (CSR) and somatic hypermutation (SHM), as well as differentiation to memory B cells or long-lived plasma cells for the immune memory. Inducible histone modifications, together with DNA methylation and miRNAs modulate the transcriptome, particularly the expression of activation-induced cytidine deaminase, which is essential for CSR and SHM, and factors central to plasma cell differentiation, such as B lymphocyte-induced maturation protein-1. These inducible B cell-intrinsic epigenetic marks guide the maturation of antibody responses. Combinatorial histone modifications also function as histone codes to target CSR and, possibly, SHM machinery to the Ig loci by recruiting specific adaptors that can stabilize CSR/SHM factors. In addition, lncRNAs, such as recently reported lncRNA-CSR and an lncRNA generated through transcription of the S region that form G-quadruplex structures, are also important for CSR targeting. Epigenetic dysregulation in B cells, including the aberrant expression of non-coding RNAs and alterations of histone modifications and DNA methylation, can result in aberrant antibody responses to foreign antigens, such as those on microbial pathogens, and generation of pathogenic autoantibodies, IgE in allergic reactions, as well as B cell neoplasia. Epigenetic marks would be attractive targets for new therapeutics for autoimmune and allergic diseases, and B cell malignancies. PMID:26697022
AP1 Keeps Chromatin Poised for Action | Center for Cancer Research
The human genome harbors gene-encoding DNA, the blueprint for building proteins that regulate cellular function. Embedded across the genome, in non-coding regions, are DNA elements to which regulatory factors bind. The interaction of regulatory factors with DNA at these sites modifies gene expression to modulate cell activity. In cells, DNA exists in a complex with proteins called chromatin that compacts the DNA in the nucleus, strongly restricting access to DNA sequences. As a result, regulatory factors only interact with a small subset of their potential binding elements in a given cell to regulate genes. How factors recognize and select sites in chromatin across the genome is not well understood -- but several discoveries in CCR’s Laboratory of Receptor Biology and Gene Expression (LRBGE) have shed light on the mechanisms that direct factors to DNA.
DDM1 represses noncoding RNA expression and RNA-directed DNA methylation in heterochromatin.
Tan, Feng; Lu, Yue; Jiang, Wei; Zhao, Yu; Wu, Tian; Zhang, Ruoyu; Zhou, Dao-Xiu
2018-05-24
Cytosine methylation of DNA, which occurs at CG, CHG, and CHH (H=A, C, or T) sequences in plants, is a hallmark for epigenetic repression of repetitive sequences. The chromatin remodeling factor DECREASE IN DNA METHYLATION1 (DDM1) is essential for DNA methylation, especially at CG and CHG sequences. However, its potential role in RNA-directed DNA methylation (RdDM) and in chromatin function is not completely understood in rice (Oryza sativa). In this work, we used high-throughput approaches to study the function of rice DDM1 (OsDDM1) in RdDM and the expression of non-coding RNA (ncRNA). We show that loss of function of OsDDM1 results in ectopic CHH methylation of transposable elements and repeats. The ectopic CHH methylation was dependent on rice DOMAINS REARRANGED METHYLTRANSFERASE2 (OsDRM2), a DNA methyltransferase involved in RdDM. Mutations in OsDDM1 lead to decreases of histone H3K9me2 and increases in the levels of heterochromatic small RNA (sRNA) and long noncoding RNA (lncRNA). In particular, OsDDM1 was found to be essential to repress transcription of the two repetitive sequences, Centromeric Retrotransposons of Rice1 (CRR1) and the dominant centromeric CentO repeats. These results suggest that OsDDM1 antagonizes RdDM at heterochromatin and represses tissue-specific expression of ncRNA from repetitive sequences in the rice genome. {copyright, serif} 2018 American Society of Plant Biologists. All rights reserved.
Liu, Zhandong; Venkatesh, Santosh S; Maley, Carlo C
2008-01-01
Background Genomes store information for building and maintaining organisms. Complete sequencing of many genomes provides the opportunity to study and compare global information properties of those genomes. Results We have analyzed aspects of the information content of Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli (K-12) genomes. Virtually all possible (> 98%) 12 bp oligomers appear in vertebrate genomes while < 2% of 19 bp oligomers are present. Other species showed different ranges of > 98% to < 2% of possible oligomers in D. melanogaster (12–17 bp), C. elegans (11–17 bp), A. thaliana (11–17 bp), S. cerevisiae (10–16 bp) and E. coli (9–15 bp). Frequencies of unique oligomers in the genomes follow similar patterns. We identified a set of 2.6 M 15-mers that are more than 1 nucleotide different from all 15-mers in the human genome and so could be used as probes to detect microbes in human samples. In a human sample, these probes would detect 100% of the 433 currently fully sequenced prokaryotes and 75% of the 3065 fully sequenced viruses. The human genome is significantly more compact in sequence space than a random genome. We identified the most frequent 5- to 20-mers in the human genome, which may prove useful as PCR primers. We also identified a bacterium, Anaeromyxobacter dehalogenans, which has an exceptionally low diversity of oligomers given the size of its genome and its GC content. The entropy of coding regions in the human genome is significantly higher than non-coding regions and chromosomes. However chromosomes 1, 2, 9, 12 and 14 have a relatively high proportion of coding DNA without high entropy, and chromosome 20 is the opposite with a low frequency of coding regions but relatively high entropy. Conclusion Measures of the frequency of oligomers are useful for designing PCR assays and for identifying chromosomes and organisms with hidden structure that had not been previously recognized. This information may be used to detect novel microbes in human tissues. PMID:18973670
Liu, Zhandong; Venkatesh, Santosh S; Maley, Carlo C
2008-10-30
Genomes store information for building and maintaining organisms. Complete sequencing of many genomes provides the opportunity to study and compare global information properties of those genomes. We have analyzed aspects of the information content of Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli (K-12) genomes. Virtually all possible (> 98%) 12 bp oligomers appear in vertebrate genomes while < 2% of 19 bp oligomers are present. Other species showed different ranges of > 98% to < 2% of possible oligomers in D. melanogaster (12-17 bp), C. elegans (11-17 bp), A. thaliana (11-17 bp), S. cerevisiae (10-16 bp) and E. coli (9-15 bp). Frequencies of unique oligomers in the genomes follow similar patterns. We identified a set of 2.6 M 15-mers that are more than 1 nucleotide different from all 15-mers in the human genome and so could be used as probes to detect microbes in human samples. In a human sample, these probes would detect 100% of the 433 currently fully sequenced prokaryotes and 75% of the 3065 fully sequenced viruses. The human genome is significantly more compact in sequence space than a random genome. We identified the most frequent 5- to 20-mers in the human genome, which may prove useful as PCR primers. We also identified a bacterium, Anaeromyxobacter dehalogenans, which has an exceptionally low diversity of oligomers given the size of its genome and its GC content. The entropy of coding regions in the human genome is significantly higher than non-coding regions and chromosomes. However chromosomes 1, 2, 9, 12 and 14 have a relatively high proportion of coding DNA without high entropy, and chromosome 20 is the opposite with a low frequency of coding regions but relatively high entropy. Measures of the frequency of oligomers are useful for designing PCR assays and for identifying chromosomes and organisms with hidden structure that had not been previously recognized. This information may be used to detect novel microbes in human tissues.
High-throughput sequencing of three Lemnoideae (duckweeds) chloroplast genomes from total DNA.
Wang, Wenqin; Messing, Joachim
2011-01-01
Chloroplast genomes provide a wealth of information for evolutionary and population genetic studies. Chloroplasts play a particularly important role in the adaption for aquatic plants because they float on water and their major surface is exposed continuously to sunlight. The subfamily of Lemnoideae represents such a collection of aquatic species that because of photosynthesis represents one of the fastest growing plant species on earth. We sequenced the chloroplast genomes from three different genera of Lemnoideae, Spirodela polyrhiza, Wolffiella lingulata and Wolffia australiana by high-throughput DNA sequencing of genomic DNA using the SOLiD platform. Unfractionated total DNA contains high copies of plastid DNA so that sequences from the nucleus and mitochondria can easily be filtered computationally. Remaining sequence reads were assembled into contiguous sequences (contigs) using SOLiD software tools. Contigs were mapped to a reference genome of Lemna minor and gaps, selected by PCR, were sequenced on the ABI3730xl platform. This combinatorial approach yielded whole genomic contiguous sequences in a cost-effective manner. Over 1,000-time coverage of chloroplast from total DNA were reached by the SOLiD platform in a single spot on a quadrant slide without purification. Comparative analysis indicated that the chloroplast genome was conserved in gene number and organization with respect to the reference genome of L. minor. However, higher nucleotide substitution, abundant deletions and insertions occurred in non-coding regions of these genomes, indicating a greater genomic dynamics than expected from the comparison of other related species in the Pooideae. Noticeably, there was no transition bias over transversion in Lemnoideae. The data should have immediate applications in evolutionary biology and plant taxonomy with increased resolution and statistical power.
High-Throughput Sequencing of Three Lemnoideae (Duckweeds) Chloroplast Genomes from Total DNA
Wang, Wenqin; Messing, Joachim
2011-01-01
Background Chloroplast genomes provide a wealth of information for evolutionary and population genetic studies. Chloroplasts play a particularly important role in the adaption for aquatic plants because they float on water and their major surface is exposed continuously to sunlight. The subfamily of Lemnoideae represents such a collection of aquatic species that because of photosynthesis represents one of the fastest growing plant species on earth. Methods We sequenced the chloroplast genomes from three different genera of Lemnoideae, Spirodela polyrhiza, Wolffiella lingulata and Wolffia australiana by high-throughput DNA sequencing of genomic DNA using the SOLiD platform. Unfractionated total DNA contains high copies of plastid DNA so that sequences from the nucleus and mitochondria can easily be filtered computationally. Remaining sequence reads were assembled into contiguous sequences (contigs) using SOLiD software tools. Contigs were mapped to a reference genome of Lemna minor and gaps, selected by PCR, were sequenced on the ABI3730xl platform. Conclusions This combinatorial approach yielded whole genomic contiguous sequences in a cost-effective manner. Over 1,000-time coverage of chloroplast from total DNA were reached by the SOLiD platform in a single spot on a quadrant slide without purification. Comparative analysis indicated that the chloroplast genome was conserved in gene number and organization with respect to the reference genome of L. minor. However, higher nucleotide substitution, abundant deletions and insertions occurred in non-coding regions of these genomes, indicating a greater genomic dynamics than expected from the comparison of other related species in the Pooideae. Noticeably, there was no transition bias over transversion in Lemnoideae. The data should have immediate applications in evolutionary biology and plant taxonomy with increased resolution and statistical power. PMID:21931804
Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis.
Buldyrev, S V; Goldberger, A L; Havlin, S; Mantegna, R N; Matsa, M E; Peng, C K; Simons, M; Stanley, H E
1995-05-01
An open question in computational molecular biology is whether long-range correlations are present in both coding and noncoding DNA or only in the latter. To answer this question, we consider all 33301 coding and all 29453 noncoding eukaryotic sequences--each of length larger than 512 base pairs (bp)--in the present release of the GenBank to dtermine whether there is any statistically significant distinction in their long-range correlation properties. Standard fast Fourier transform (FFT) analysis indicates that coding sequences have practically no correlations in the range from 10 bp to 100 bp (spectral exponent beta=0.00 +/- 0.04, where the uncertainty is two standard deviations). In contrast, for noncoding sequences, the average value of the spectral exponent beta is positive (0.16 +/- 0.05) which unambiguously shows the presence of long-range correlations. We also separately analyze the 874 coding and the 1157 noncoding sequences that have more than 4096 bp and find a larger region of power-law behavior. We calculate the probability that these two data sets (coding and noncoding) were drawn from the same distribution and we find that it is less than 10(-10). We obtain independent confirmation of these findings using the method of detrended fluctuation analysis (DFA), which is designed to treat sequences with statistical heterogeneity, such as DNA's known mosaic structure ("patchiness") arising from the nonstationarity of nucleotide concentration. The near-perfect agreement between the two independent analysis methods, FFT and DFA, increases the confidence in the reliability of our conclusion.
Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis
NASA Technical Reports Server (NTRS)
Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Matsa, M. E.; Peng, C. K.; Simons, M.; Stanley, H. E.
1995-01-01
An open question in computational molecular biology is whether long-range correlations are present in both coding and noncoding DNA or only in the latter. To answer this question, we consider all 33301 coding and all 29453 noncoding eukaryotic sequences--each of length larger than 512 base pairs (bp)--in the present release of the GenBank to dtermine whether there is any statistically significant distinction in their long-range correlation properties. Standard fast Fourier transform (FFT) analysis indicates that coding sequences have practically no correlations in the range from 10 bp to 100 bp (spectral exponent beta=0.00 +/- 0.04, where the uncertainty is two standard deviations). In contrast, for noncoding sequences, the average value of the spectral exponent beta is positive (0.16 +/- 0.05) which unambiguously shows the presence of long-range correlations. We also separately analyze the 874 coding and the 1157 noncoding sequences that have more than 4096 bp and find a larger region of power-law behavior. We calculate the probability that these two data sets (coding and noncoding) were drawn from the same distribution and we find that it is less than 10(-10). We obtain independent confirmation of these findings using the method of detrended fluctuation analysis (DFA), which is designed to treat sequences with statistical heterogeneity, such as DNA's known mosaic structure ("patchiness") arising from the nonstationarity of nucleotide concentration. The near-perfect agreement between the two independent analysis methods, FFT and DFA, increases the confidence in the reliability of our conclusion.
Analysis of the mitochondrial genome of cheetahs (Acinonyx jubatus) with neurodegenerative disease.
Burger, Pamela A; Steinborn, Ralf; Walzer, Christian; Petit, Thierry; Mueller, Mathias; Schwarzenberger, Franz
2004-08-18
The complete mitochondrial genome of Acinonyx jubatus was sequenced and mitochondrial DNA (mtDNA) regions were screened for polymorphisms as candidates for the cause of a neurodegenerative demyelinating disease affecting captive cheetahs. The mtDNA reference sequences were established on the basis of the complete sequences of two diseased and two nondiseased animals as well as partial sequences of 26 further individuals. The A. jubatus mitochondrial genome is 17,047-bp long and shows a high sequence similarity (91%) to the domestic cat. Based on single nucleotide polymorphisms (SNPs) in the control region (CR) and pedigree information, the 18 myelopathic and 12 non-myelopathic cheetahs included in this study were classified into haplotypes I, II and III. In view of the phenotypic comparability of the neurodegenerative disease observed in cheetahs and human mtDNA-associated diseases, specific coding regions including the tRNAs leucine UUR, lysine, serine UCN, and partial complex I and V sequences were screened. We identified a heteroplasmic and a homoplasmic SNP at codon 507 in the subunit 5 (MTND5) of complex I. The heteroplasmic haplotype I-specific valine to methionine substitution represents a nonconservative amino acid change and was found in 11 myelopathic and eight non-myelopathic cheetahs with levels ranging from 29% to 79%. The homoplasmic conservative amino acid substitution valine to alanine was identified in two myelopathic animals of haplotype II. In addition, a synonymous SNP in the codon 76 of the MTND4L gene was found in the single haplotype III animal. The amino acid exchanges in the MTND5 gene were not associated with the occurrence of neurodegenerative disease in captive cheetahs.
Herrnstadt, Corinna; Elson, Joanna L; Fahy, Eoin; Preston, Gwen; Turnbull, Douglass M; Anderson, Christen; Ghosh, Soumitra S; Olefsky, Jerrold M; Beal, M Flint; Davis, Robert E; Howell, Neil
2002-05-01
The evolution of the human mitochondrial genome is characterized by the emergence of ethnically distinct lineages or haplogroups. Nine European, seven Asian (including Native American), and three African mitochondrial DNA (mtDNA) haplogroups have been identified previously on the basis of the presence or absence of a relatively small number of restriction-enzyme recognition sites or on the basis of nucleotide sequences of the D-loop region. We have used reduced-median-network approaches to analyze 560 complete European, Asian, and African mtDNA coding-region sequences from unrelated individuals to develop a more complete understanding of sequence diversity both within and between haplogroups. A total of 497 haplogroup-associated polymorphisms were identified, 323 (65%) of which were associated with one haplogroup and 174 (35%) of which were associated with two or more haplogroups. Approximately one-half of these polymorphisms are reported for the first time here. Our results confirm and substantially extend the phylogenetic relationships among mitochondrial genomes described elsewhere from the major human ethnic groups. Another important result is that there were numerous instances both of parallel mutations at the same site and of reversion (i.e., homoplasy). It is likely that homoplasy in the coding region will confound evolutionary analysis of small sequence sets. By a linkage-disequilibrium approach, additional evidence for the absence of human mtDNA recombination is presented here.
Su, Huei-Jiun; Hu, Jer-Ming
2012-01-01
Background and Aims The holoparasitic flowering plant Balanophora displays extreme floral reduction and was previously found to have enormous rate acceleration in the nuclear 18S rDNA region. So far, it remains unclear whether non-ribosomal, protein-coding genes of Balanophora also evolve in an accelerated fashion and whether the genes with high substitution rates retain their functionality. To tackle these issues, six different genes were sequenced from two Balanophora species and their rate variation and expression patterns were examined. Methods Sequences including nuclear PI, euAP3, TM6, LFY and RPB2 and mitochondrial matR were determined from two Balanophora spp. and compared with selected hemiparasitic species of Santalales and autotrophic core eudicots. Gene expression was detected for the six protein-coding genes and the expression patterns of the three B-class genes (PI, AP3 and TM6) were further examined across different organs of B. laxiflora using RT-PCR. Key Results Balanophora mitochondrial matR is highly accelerated in both nonsynonymous (dN) and synonymous (dS) substitution rates, whereas the rate variation of nuclear genes LFY, PI, euAP3, TM6 and RPB2 are less dramatic. Significant dS increases were detected in Balanophora PI, TM6, RPB2 and dN accelerations in euAP3. All of the protein-coding genes are expressed in inflorescences, indicative of their functionality. PI is restrictively expressed in tepals, synandria and floral bracts, whereas AP3 and TM6 are widely expressed in both male and female inflorescences. Conclusions Despite the observation that rates of sequence evolution are generally higher in Balanophora than in hemiparasitic species of Santalales and autotrophic core eudicots, the five nuclear protein-coding genes are functional and are evolving at a much slower rate than 18S rDNA. The mechanism or mechanisms responsible for rapid sequence evolution and concomitant rate acceleration for 18S rDNA and matR are currently not well understood and require further study in Balanophora and other holoparasites. PMID:23041381
Dostie, Josée; Lemire, Edmond; Bouchard, Philippe; Field, Michael; Jones, Kristie; Lorenz, Birgit; Menten, Björn; Buysse, Karen; Pattyn, Filip; Friedli, Marc; Ucla, Catherine; Rossier, Colette; Wyss, Carine; Speleman, Frank; De Paepe, Anne; Dekker, Job; Antonarakis, Stylianos E.; De Baere, Elfride
2009-01-01
To date, the contribution of disrupted potentially cis-regulatory conserved non-coding sequences (CNCs) to human disease is most likely underestimated, as no systematic screens for putative deleterious variations in CNCs have been conducted. As a model for monogenic disease we studied the involvement of genetic changes of CNCs in the cis-regulatory domain of FOXL2 in blepharophimosis syndrome (BPES). Fifty-seven molecularly unsolved BPES patients underwent high-resolution copy number screening and targeted sequencing of CNCs. Apart from three larger distant deletions, a de novo deletion as small as 7.4 kb was found at 283 kb 5′ to FOXL2. The deletion appeared to be triggered by an H-DNA-induced double-stranded break (DSB). In addition, it disrupts a novel long non-coding RNA (ncRNA) PISRT1 and 8 CNCs. The regulatory potential of the deleted CNCs was substantiated by in vitro luciferase assays. Interestingly, Chromosome Conformation Capture (3C) of a 625 kb region surrounding FOXL2 in expressing cellular systems revealed physical interactions of three upstream fragments and the FOXL2 core promoter. Importantly, one of these contains the 7.4 kb deleted fragment. Overall, this study revealed the smallest distant deletion causing monogenic disease and impacts upon the concept of mutation screening in human disease and developmental disorders in particular. PMID:19543368
Jo, Yeong Deuk; Choi, Yoomi; Kim, Dong-Hwan; Kim, Byung-Dong; Kang, Byoung-Cheorl
2014-07-04
Cytoplasmic male sterility (CMS) is an inability to produce functional pollen that is caused by mutation of the mitochondrial genome. Comparative analyses of mitochondrial genomes of lines with and without CMS in several species have revealed structural differences between genomes, including extensive rearrangements caused by recombination. However, the mitochondrial genome structure and the DNA rearrangements that may be related to CMS have not been characterized in Capsicum spp. We obtained the complete mitochondrial genome sequences of the pepper CMS line FS4401 (507,452 bp) and the fertile line Jeju (511,530 bp). Comparative analysis between mitochondrial genomes of peppers and tobacco that are included in Solanaceae revealed extensive DNA rearrangements and poor conservation in non-coding DNA. In comparison between pepper lines, FS4401 and Jeju mitochondrial DNAs contained the same complement of protein coding genes except for one additional copy of an atp6 gene (ψatp6-2) in FS4401. In terms of genome structure, we found eighteen syntenic blocks in the two mitochondrial genomes, which have been rearranged in each genome. By contrast, sequences between syntenic blocks, which were specific to each line, accounted for 30,380 and 17,847 bp in FS4401 and Jeju, respectively. The previously-reported CMS candidate genes, orf507 and ψatp6-2, were located on the edges of the largest sequence segments that were specific to FS4401. In this region, large number of small sequence segments which were absent or found on different locations in Jeju mitochondrial genome were combined together. The incorporation of repeats and overlapping of connected sequence segments by a few nucleotides implied that extensive rearrangements by homologous recombination might be involved in evolution of this region. Further analysis using mtDNA pairs from other plant species revealed common features of DNA regions around CMS-associated genes. Although large portion of sequence context was shared by mitochondrial genomes of CMS and male-fertile pepper lines, extensive genome rearrangements were detected. CMS candidate genes located on the edges of highly-rearranged CMS-specific DNA regions and near to repeat sequences. These characteristics were detected among CMS-associated genes in other species, implying a common mechanism might be involved in the evolution of CMS-associated genes.
Cloning and characterization of a DNA polymerase beta gene from Trypanosoma cruzi.
Venegas, Juan A; Aslund, Lena; Solari, Aldo
2009-06-01
A gene coding for a DNA polymerase beta from the Trypanosoma cruzi Miranda clone, belonging to the TcI lineage, was cloned (Miranda Tcpol beta), using the information from eight peptides of the T. cruzi beta-like DNA polymerase purified previously. The gene encodes for a protein of 403 amino acids which is very similar to the two T. cruzi CL Brener (TcIIe lineage) sequences published, but has three different residues in highly conserved segments. At the amino acid level, the identity of TcI-pol beta with mitochondrial pol beta and pol beta-PAK from other trypanosomatids was between 68-80% and 22-30%, respectively. Miranda Tc-pol beta protein has an N-terminal sequence similar to that described in the mitochondrial Crithidia fasciculata pol beta, which suggests that the TcI-pol beta plays a role in the organelle. Northern and Western analyses showed that this T. cruzi gene is highly expressed both in proliferative and non-proliferative developmental forms. These results suggest that, in addition to replication of kDNA in proliferative cells, this enzyme may have another function in non-proliferative cells, such as DNA repair role similar to that which has extensively been described in a vast spectrum of eukaryotic cells.
α satellite DNA variation and function of the human centromere
Sullivan, Lori L.; Chew, Kimberline
2017-01-01
ABSTRACT Genomic variation is a source of functional diversity that is typically studied in genic and non-coding regulatory regions. However, the extent of variation within noncoding portions of the human genome, particularly highly repetitive regions, and the functional consequences are not well understood. Satellite DNA, including α satellite DNA found at human centromeres, comprises up to 10% of the genome, but is difficult to study because its repetitive nature hinders contiguous sequence assemblies. We recently described variation within α satellite DNA that affects centromere function. On human chromosome 17 (HSA17), we showed that size and sequence polymorphisms within primary array D17Z1 are associated with chromosome aneuploidy and defective centromere architecture. However, HSA17 can counteract this instability by assembling the centromere at a second, “backup” array lacking variation. Here, we discuss our findings in a broader context of human centromere assembly, and highlight areas of future study to uncover links between genomic and epigenetic features of human centromeres. PMID:28406740
Introduction to the Natural Anticipator and the Artificial Anticipator
NASA Astrophysics Data System (ADS)
Dubois, Daniel M.
2010-11-01
This short communication deals with the introduction of the concept of anticipator, which is one who anticipates, in the framework of computing anticipatory systems. The definition of anticipation deals with the concept of program. Indeed, the word program, comes from "pro-gram" meaning "to write before" by anticipation, and means a plan for the programming of a mechanism, or a sequence of coded instructions that can be inserted into a mechanism, or a sequence of coded instructions, as genes or behavioural responses, that is part of an organism. Any natural or artificial programs are thus related to anticipatory rewriting systems, as shown in this paper. All the cells in the body, and the neurons in the brain, are programmed by the anticipatory genetic code, DNA, in a low-level language with four signs. The programs in computers are also computing anticipatory systems. It will be shown, at one hand, that the genetic code DNA is a natural anticipator. As demonstrated by Nobel laureate McClintock [8], genomes are programmed. The fundamental program deals with the DNA genetic code. The properties of the DNA consist in self-replication and self-modification. The self-replicating process leads to reproduction of the species, while the self-modifying process leads to new species or evolution and adaptation in existing ones. The genetic code DNA keeps its instructions in memory in the DNA coding molecule. The genetic code DNA is a rewriting system, from DNA coding to DNA template molecule. The DNA template molecule is a rewriting system to the Messenger RNA molecule. The information is not destroyed during the execution of the rewriting program. On the other hand, it will be demonstrated that Turing machine is an artificial anticipator. The Turing machine is a rewriting system. The head reads and writes, modifying the content of the tape. The information is destroyed during the execution of the program. This is an irreversible process. The input data are lost.
Schmidt-Chanasit, Jonas; Bialonski, Alexandra; Heinemann, Patrick; Ulrich, Rainer G; Günther, Stephan; Rabenau, Holger F; Doerr, Hans Wilhelm
2010-07-01
Recently two different herpes simplex virus type 2 (HSV-2) clades (A and B) were described on DNA sequence data of the glycoprotein E (gE), G (gG) and I (gI) genes. To type the circulating HSV-2 wild-type strains in Germany by a novel approach and to monitor potential changes in the molecular epidemiology between 1997 and 2008. A total of 64 clinical HSV-2 isolates were analyzed by a novel approach using the DNA sequences of the complete open reading frames of glycoprotein B (gB) and gG. Recombination analysis of the gB and gG gene sequences was performed to reveal intragenic recombinants. Based on the phylogenetic analysis of the gB coding DNA sequence 8 of 64 (12%) isolates were classified as clade A strains and 56 of 64 (88%) isolates were classified as clade B strains. Analysis of the gG coding DNA sequence classified 4 (6%) isolates as clade A strains and 60 (94%) isolates as clade B strains. In comparison, the 8 isolates classified as clade A strains using the gB sequence data were classified as clade B strains when using the gG coding DNA sequence, suggesting intergenic recombination events. Intragenic recombination events were not detected. The first molecular survey of clinical HSV-2 isolates from Germany demonstrated the circulation of clade A and B strains and of intergenic recombinants over a period of 12 years. Copyright (c) 2010 Elsevier B.V. All rights reserved.
2012-01-01
Background The feline genome is valuable to the veterinary and model organism genomics communities because the cat is an obligate carnivore and a model for endangered felids. The initial public release of the Felis catus genome assembly provided a framework for investigating the genomic basis of feline biology. However, the entire set of protein coding genes has not been elucidated. Results We identified and characterized 1227 protein coding feline sequences, of which 913 map to public sequences and 314 are novel. These sequences have been deposited into NCBI's genbank database and complement public genomic resources by providing additional protein coding sequences that fill in some of the gaps in the feline genome assembly. Through functional and comparative genomic analyses, we gained an understanding of the role of these sequences in feline development, nutrition and health. Specifically, we identified 104 orthologs of human genes associated with Mendelian disorders. We detected negative selection within sequences with gene ontology annotations associated with intracellular trafficking, cytoskeleton and muscle functions. We detected relatively less negative selection on protein sequences encoding extracellular networks, apoptotic pathways and mitochondrial gene ontology annotations. Additionally, we characterized feline cDNA sequences that have mouse orthologs associated with clinical, nutritional and developmental phenotypes. Together, this analysis provides an overview of the value of our cDNA sequences and enhances our understanding of how the feline genome is similar to, and different from other mammalian genomes. Conclusions The cDNA sequences reported here expand existing feline genomic resources by providing high-quality sequences annotated with comparative genomic information providing functional, clinical, nutritional and orthologous gene information. PMID:22257742
Epigenetics and obesity cardiomyopathy: From pathophysiology to prevention and management.
Zhang, Yingmei; Ren, Jun
2016-05-01
Uncorrected obesity has been associated with cardiac hypertrophy and contractile dysfunction. Several mechanisms for this cardiomyopathy have been identified, including oxidative stress, autophagy, adrenergic and renin-angiotensin aldosterone overflow. Another process that may regulate effects of obesity is epigenetics, which refers to the heritable alterations in gene expression or cellular phenotype that are not encoded on the DNA sequence. Advances in epigenome profiling have greatly improved the understanding of the epigenome in obesity, where environmental exposures during early life result in an increased health risk later on in life. Several mechanisms, including histone modification, DNA methylation and non-coding RNAs, have been reported in obesity and can cause transcriptional suppression or activation, depending on the location within the gene, contributing to obesity-induced complications. Through epigenetic modifications, the fetus may be prone to detrimental insults, leading to cardiac sequelae later in life. Important links between epigenetics and obesity include nutrition, exercise, adiposity, inflammation, insulin sensitivity and hepatic steatosis. Genome-wide studies have identified altered DNA methylation patterns in pancreatic islets, skeletal muscle and adipose tissues from obese subjects compared with non-obese controls. In addition, aging and intrauterine environment are associated with differential DNA methylation. Given the intense research on the molecular mechanisms of the etiology of obesity and its complications, this review will provide insights into the current understanding of epigenetics and pharmacological and non-pharmacological (such as exercise) interventions targeting epigenetics as they relate to treatment of obesity and its complications. Particular focus will be on DNA methylation, histone modification and non-coding RNAs. Copyright © 2016 Elsevier Inc. All rights reserved.
Aires-de-Sousa, João; Aires-de-Sousa, Luisa
2003-01-01
We propose representing individual positions in DNA sequences by virtual potentials generated by other bases of the same sequence. This is a compact representation of the neighbourhood of a base. The distribution of the virtual potentials over the whole sequence can be used as a representation of the entire sequence (SEQREP code). It is a flexible code, with a length independent of the sequence size, does not require previous alignment, and is convenient for processing by neural networks or statistical techniques. To evaluate its biological significance, the SEQREP code was used for training Kohonen self-organizing maps (SOMs) in two applications: (a) detection of Alu sequences, and (b) classification of sequences encoding for HIV-1 envelope glycoprotein (env) into subtypes A-G. It was demonstrated that SOMs clustered sequences belonging to different classes into distinct regions. For independent test sets, very high rates of correct predictions were obtained (97% in the first application, 91% in the second). Possible areas of application of SEQREP codes include functional genomics, phylogenetic analysis, detection of repetitions, database retrieval, and automatic alignment. Software for representing sequences by SEQREP code, and for training Kohonen SOMs is made freely available from http://www.dq.fct.unl.pt/qoa/jas/seqrep. Supplementary material is available at http://www.dq.fct.unl.pt/qoa/jas/seqrep/bioinf2002
Tamori, Akihiro; Yamanishi, Yoshihiro; Kawashima, Shuichi; Kanehisa, Minoru; Enomoto, Masaru; Tanaka, Hiromu; Kubo, Shoji; Shiomi, Susumu; Nishiguchi, Shuhei
2005-08-15
Integration of hepatitis B virus (HBV) DNA into the human genome is one of the most important steps in HBV-related carcinogenesis. This study attempted to find the link between HBV DNA, the adjoining cellular sequence, and altered gene expression in hepatocellular carcinoma (HCC) with integrated HBV DNA. We examined 15 cases of HCC infected with HBV by cassette ligation-mediated PCR. The human DNA adjacent to the integrated HBV DNA was sequenced. Protein coding sequences were searched for in the human sequence. In five cases with HBV DNA integration, from which good quality RNA was extracted, gene expression was examined by cDNA microarray analysis. The human DNA sequence successive to integrated HBV DNA was determined in the 15 HCCs. Eight protein-coding regions were involved: ras-responsive element binding protein 1, calmodulin 1, mixed lineage leukemia 2 (MLL2), FLJ333655, LOC220272, LOC255345, LOC220220, and LOC168991. The MLL2 gene was expressed in three cases with HBV DNA integrated into exon 3 of MLL2 and in one case with HBV DNA integrated into intron 3 of MLL2. Gene expression analysis suggested that two HCCs with HBV integrated into MLL2 had similar patterns of gene expression compared with three HCCs with HBV integrated into other loci of human chromosomes. HBV DNA was integrated at random sites of human DNA, and the MLL2 gene was one of the targets for integration. Our results suggest that HBV DNA might modulate human genes near integration sites, followed by integration site-specific expression of such genes during hepatocarcinogenesis.
Complete mitogenome sequencing and phylogenetic analysis of PaLi yak (Bos grunniens).
Bao, Pengjia; Guo, Xian; Pei, Jie; Liang, Chunnian; Ding, Xuezhi; Min, Chu; Wang, Hongbo; Wu, Xiaoyun; Yan, Ping
2016-11-01
PaLi yak is a very important local breed in China; as a year-round grazing animal, it plays a very important role for the economic and native herdsmen. The PaLi yak complete mitochondrial DNA is sequenced in this study, the total length is 16,324 bp, containing 13 protein-coding genes, 22 tRNA genes, 2 rRNA genes and a non-coding control region (D-loop region). The order and composition are similar to most of the other vertebrates. The base contents are: 33.72% A, 25.80% C, 13.21% G and 27.27% T; A + T (60.99%) was higher than G + C (39.01%). The phylogenetic relationships were analyzed using the complete mitogenome sequence, results showed that the genetic relationship between yak and cattle is distinct. These information provides useful data for further study on protection of genetic resources and the taxonomy of Bovinae.
Shen, Kang-Ning; Yen, Ta-Chi; Chen, Ching-Hung; Ye, Jeng-Jia; Hsiao, Chung-Der
2016-05-01
In this study, the complete mitogenome sequence of the cryptic "lineage B" big-fin reef squid, Sepioteuthis lessoniana (Cephalopoda: Loliginidae) has been sequenced by next-generation sequencing method. The assembled mitogenome consisting of 16,694 bp, includes 13 protein coding genes, 25 transfer RNAs, 2 ribosomal RNAs genes. The overall base composition of "lineage B" S. lessoniana is 36.7% for A, 18.9 % for C, 34.5 % for T and 9.8 % for G and show 90% identities to "lineage C" S. lessoniana. It is also exhibits high T + A content (71.2%), two non-coding regions with TA tandem repeats. The complete mitogenome of the cryptic "lineage B" S. lessoniana provides essential and important DNA molecular data for further phylogeography and evolutionary analysis for big-fin reef squid species complex.
Hsiao, Chung-Der; Shen, Kang-Ning; Ching, Tzu-Yun; Wang, Ya-Hsien; Ye, Jeng-Jia; Tsai, Shiou-Yi; Wu, Shan-Chun; Chen, Ching-Hung; Wang, Chia-Hui
2016-07-01
In this study, the complete mitogenome sequence of the cryptic "lineage A" big-fin reef squid, Sepioteuthis lessoniana (Cephalopoda: Loliginidae) has been sequenced by the next-generation sequencing method. The assembled mitogenome consists of 16,605 bp, which includes 13 protein-coding genes, 22 transfer RNAs, and 2 ribosomal RNAs genes. The overall base composition of "lineage A" S. lessoniana is 37.5% for A, 17.4% for C, 9.1% for G, and 35.9% for T and shows 87% identities to "lineage C" S. lessoniana. It is also noticed by its high T + A content (73.4%), two non-coding regions with TA tandem repeats. The complete mitogenome of the cryptic "lineage A" S. lessoniana provides essential and important DNA molecular data for further phylogeography and evolutionary analysis for big-fin reef squid species complex.
Nakamura, Mikiko; Suzuki, Ayako; Akada, Junko; Tomiyoshi, Keisuke; Hoshida, Hisashi; Akada, Rinji
2015-12-01
Mammalian gene expression constructs are generally prepared in a plasmid vector, in which a promoter and terminator are located upstream and downstream of a protein-coding sequence, respectively. In this study, we found that front terminator constructs-DNA constructs containing a terminator upstream of a promoter rather than downstream of a coding region-could sufficiently express proteins as a result of end joining of the introduced DNA fragment. By taking advantage of front terminator constructs, FLAG substitutions, and deletions were generated using mutagenesis primers to identify amino acids specifically recognized by commercial FLAG antibodies. A minimal epitope sequence for polyclonal FLAG antibody recognition was also identified. In addition, we analyzed the sequence of a C-terminal Ser-Lys-Leu peroxisome localization signal, and identified the key residues necessary for peroxisome targeting. Moreover, front terminator constructs of hepatitis B surface antigen were used for deletion analysis, leading to the identification of regions required for the particle formation. Collectively, these results indicate that front terminator constructs allow for easy manipulations of C-terminal protein-coding sequences, and suggest that direct gene expression with PCR-amplified DNA is useful for high-throughput protein analysis in mammalian cells.
Prody, C A; Zevin-Sonkin, D; Gnatt, A; Goldberg, O; Soreq, H
1987-01-01
To study the primary structure and regulation of human cholinesterases, oligodeoxynucleotide probes were prepared according to a consensus peptide sequence present in the active site of both human serum pseudocholinesterase (BtChoEase; EC 3.1.1.8) and Torpedo electric organ "true" acetylcholinesterase (AcChoEase; EC 3.1.1.7). Using these probes, we isolated several cDNA clones from lambda gt10 libraries of fetal brain and liver origins. These include 2.4-kilobase cDNA clones that code for a polypeptide containing a putative signal peptide and the N-terminal, active site, and C-terminal peptides of human BtChoEase, suggesting that they code either for BtChoEase itself or for a very similar but distinct fetal form of cholinesterase. In RNA blots of poly(A)+ RNA from the cholinesterase-producing fetal brain and liver, these cDNAs hybridized with a single 2.5-kilobase band. Blot hybridization to human genomic DNA revealed that these fetal BtChoEase cDNA clones hybridize with DNA fragments of the total length of 17.5 kilobases, and signal intensities indicated that these sequences are not present in many copies. Both the cDNA-encoded protein and its nucleotide sequence display striking homology to parallel sequences published for Torpedo AcChoEase. These findings demonstrate extensive homologies between the fetal BtChoEase encoded by these clones and other cholinesterases of various forms and species. Images PMID:3035536
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.
Li, Yifeng; Shi, Wenqiang; Wasserman, Wyeth W
2018-05-31
In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations.
Johnston, Christine; Magaret, Amalia; Roychoudhury, Pavitra; Greninger, Alexander L; Cheng, Anqi; Diem, Kurt; Fitzgibbon, Matthew P; Huang, Meei-Li; Selke, Stacy; Lingappa, Jairam R; Celum, Connie; Jerome, Keith R; Wald, Anna; Koelle, David M
2017-10-01
Understanding the variability in circulating herpes simplex virus type 2 (HSV-2) genomic sequences is critical to the development of HSV-2 vaccines. Genital lesion swabs containing ≥ 10 7 log 10 copies HSV DNA collected from Africa, the USA, and South America underwent next-generation sequencing, followed by K-mer based filtering and de novo genomic assembly. Sites of heterogeneity within coding regions in unique long and unique short (U L _U S ) regions were identified. Phylogenetic trees were created using maximum likelihood reconstruction. Among 46 samples from 38 persons, 1468 intragenic base-pair substitutions were identified. The maximum nucleotide distance between strains for concatenated U L_ U S segments was 0.4%. Phylogeny did not reveal geographic clustering. The most variable proteins had non-synonymous mutations in < 3% of amino acids. Unenriched HSV-2 DNA can undergo next-generation sequencing to identify intragenic variability. The use of clinical swabs for sequencing expands the information that can be gathered directly from these specimens. Copyright © 2017 Elsevier Inc. All rights reserved.
Investigation of a Sybr-Green-Based Method to Validate DNA Sequences for DNA Computing
2005-05-01
OF A SYBR-GREEN-BASED METHOD TO VALIDATE DNA SEQUENCES FOR DNA COMPUTING 6. AUTHOR(S) Wendy Pogozelski, Salvatore Priore, Matthew Bernard ...simulated annealing. Biochemistry, 35, 14077-14089. 15 Pogozelski, W.K., Bernard , M.P. and Macula, A. (2004) DNA code validation using...and Clark, B.F.C. (eds) In RNA Biochemistry and Biotechnology, NATO ASI Series, Kluwer Academic Publishers. Zucker, M. and Stiegler , P. (1981
The Genome of the Western Clawed Frog Xenopus tropicalis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hellsten, Uffe; Harland, Richard M.; Gilchrist, Michael J.
2009-10-01
The western clawed frog Xenopus tropicalis is an important model for vertebrate development that combines experimental advantages of the African clawed frog Xenopus laevis with more tractable genetics. Here we present a draft genome sequence assembly of X. tropicalis. This genome encodes over 20,000 protein-coding genes, including orthologs of at least 1,700 human disease genes. Over a million expressed sequence tags validated the annotation. More than one-third of the genome consists of transposable elements, with unusually prevalent DNA transposons. Like other tetrapods, the genome contains gene deserts enriched for conserved non-coding elements. The genome exhibits remarkable shared synteny with humanmore » and chicken over major parts of large chromosomes, broken by lineage-specific chromosome fusions and fissions, mainly in the mammalian lineage.« less
Shitara, M; Tsuboi, Y; Sekizuka, T; Tazumi, A; Moorei, J E; Millar, B C; Taneike, I; Matsuda, M
2008-01-01
Nucleotide sequences of approximately 3.1 kbp consisting of the full-length open reading frame (ORF) for grpE, a non-coding (NC) region and a putative ORF for the full-length dnaK gene (1860 bp) were identified from a urease-positive thermophilic Campylobacter (UPTC) CF89-12 isolate. Then, following the construction of a new degenerate polymerase chain reaction (PCR) primer pair for amplification of the dnaK structural gene, including the transcription terminator region of C. lari isolates, the dnaK region was amplified successfully, TA-cloned and sequenced in nine C. lari isolates. The dnaK gene sequences commenced with an ATG and terminated with a TAA in all 10 isolates, including CF89-12. In addition, the putative ORFs for the dnaK gene locus from seven UPTC isolates consisted of 1860 bases, and the four urease-negative (UN) C. lari isolates included C. lari RM2100 reference strain 1866. Interestingly, different probable ribosome binding sites and hypothetically intrinsic p-independent terminator structures were identified between the seven UPTC and four UN C. lari isolates, respectively. Moreover, it is interesting to note that 20 out of a total of 28 polymorphic sites occurred among amino acid sequences of the dnaK ORF from 11 C. lari isolates, identified to be alternatively UPTC-specific or UN C. lari-specific. In the neighbour-joining tree based on the nucleotide sequence information of the dnaK gene, C. lari forms two major distinct clusters consisting of UPTC and UN C. lari isolates, respectively, with UN C. lari being more closely related to other thermophilic campylobacters than to UPTC.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dolan, Kyle T.; Duguid, Erica M.; He, Chuan
2011-11-17
SlyA is a master virulence regulator that controls the transcription of numerous genes in Salmonella enterica. We present here crystal structures of SlyA by itself and bound to a high-affinity DNA operator sequence in the slyA gene. SlyA interacts with DNA through direct recognition of a guanine base by Arg-65, as well as interactions between conserved Arg-86 and the minor groove and a large network of non-base-specific contacts with the sugar phosphate backbone. Our structures, together with an unpublished structure of SlyA bound to the small molecule effector salicylate (Protein Data Bank code 3DEU), reveal that, unlike many other MarRmore » family proteins, SlyA dissociates from DNA without large conformational changes when bound to this effector. We propose that SlyA and other MarR global regulators rely more on indirect readout of DNA sequence to exert control over many genes, in contrast to proteins (such as OhrR) that recognize a single operator.« less
Singh, Vinod Kumar; Krishnamachari, Annangarachari
2016-09-01
Genome-wide experimental studies in Saccharomyces cerevisiae reveal that autonomous replicating sequence (ARS) requires an essential consensus sequence (ACS) for replication activity. Computational studies identified thousands of ACS like patterns in the genome. However, only a few hundreds of these sites act as replicating sites and the rest are considered as dormant or evolving sites. In a bid to understand the sequence makeup of replication sites, a content and context-based analysis was performed on a set of replicating ACS sequences that binds to origin-recognition complex (ORC) denoted as ORC-ACS and non-replicating ACS sequences (nrACS), that are not bound by ORC. In this study, DNA properties such as base composition, correlation, sequence dependent thermodynamic and DNA structural profiles, and their positions have been considered for characterizing ORC-ACS and nrACS. Analysis reveals that ORC-ACS depict marked differences in nucleotide composition and context features in its vicinity compared to nrACS. Interestingly, an A-rich motif was also discovered in ORC-ACS sequences within its nucleosome-free region. Profound changes in the conformational features, such as DNA helical twist, inclination angle and stacking energy between ORC-ACS and nrACS were observed. Distribution of ACS motifs in the non-coding segments points to the locations of ORC-ACS which are found far away from the adjacent gene start position compared to nrACS thereby enabling an accessible environment for ORC-proteins. Our attempt is novel in considering the contextual view of ACS and its flanking region along with nucleosome positioning in the S. cerevisiae genome and may be useful for any computational prediction scheme.
Garcia-Reyero, Natàlia; Griffitt, Robert J.; Liu, Li; Kroll, Kevin J.; Farmerie, William G.; Barber, David S.; Denslow, Nancy D.
2009-01-01
A novel custom microarray for largemouth bass (Micropterus salmoides) was designed with sequences obtained from a normalized cDNA library using the 454 Life Sciences GS-20 pyrosequencer. This approach yielded in excess of 58 million bases of high-quality sequence. The sequence information was combined with 2,616 reads obtained by traditional suppressive subtractive hybridizations to derive a total of 31,391 unique sequences. Annotation and coding sequences were predicted for these transcripts where possible. 16,350 annotated transcripts were selected as target sequences for the design of the custom largemouth bass oligonucleotide microarray. The microarray was validated by examining the transcriptomic response in male largemouth bass exposed to 17β-œstradiol. Transcriptomic responses were assessed in liver and gonad, and indicated gene expression profiles typical of exposure to œstradiol. The results demonstrate the potential to rapidly create the tools necessary to assess large scale transcriptional responses in non-model species, paving the way for expanded impact of toxicogenomics in ecotoxicology. PMID:19936325
Insertion of a self-splicing intron into the mtDNA of atriploblastic animal
DOE Office of Scientific and Technical Information (OSTI.GOV)
Valles, Y.; Halanych, K.; Boore, J.L.
2006-04-14
Nephtys longosetosa is a carnivorous polychaete worm that lives in the intertidal and subtidal zones with worldwide distribution (pleijel&rouse2001). Its mitochondrial genome has the characteristics typical of most metazoans: 37 genes; circular molecule; almost no intergenic sequence; and no significant gene rearrangements when compared to other annelid mtDNAs (booremoritz19981995). Ubiquitous features as small intergenic regions and lack of introns suggested that metazoan mtDNAs are under strong selective pressures to reduce their genome size allowing for faster replication requirements (booremoritz19981995Lynch2005). Yet, in 1996 two type I introns were found in the mtDNA of the basal metazoan Metridium senile (FigureX). Breaking amore » long-standing rule (absence of introns in metazoan mtDNA), this finding was later supported by the further presence of group I introns in other cnidarians. Interestingly, only the class Anthozoa within cnidarians seems to harbor such introns. Although several hundreds of triploblastic metazoan mtDNAs have been sequenced, this study is the first evidence of mitochondrial introns in triploblastic metazoans. The cox1 gene of N. longosetosa has an intron of almost 2 kbs in length. This finding represents as well the first instance of a group II intron (anthozoans harbor group I introns) in all metazoan lineages. Opposite trends are observed within plants, fungi and protist mtDNAs, where introns (both group I and II) and other non-coding sequences are widespread. Plant, fungal and protist mtDNA structure and organization differ enormously from that of metazoan mtDNA. Both, plant and fungal mtDNA are dynamic molecules that undergo high rates of recombination, contain long intergenic spacer regions and harbor both group I and group II introns. However, as metazoans they have a conserved gene content. Protists, on the other hand have a striking variation of gene content and introns that account for the genome size variation. In contrast to this mtDNA structure and organization diversity, current genome level studies point to a monophyletic origin of the mitochondria (REFS), raising questions such as: what are the pressures at work shaping the evolution of the mitochondrial genome at 'higher' levels? What drives the absence of introns and other non-coding spacers in metazoan mtDNA? What characteristics must have an intron to be maintained in an environment where 'extra chromosomes' are usually selected against?« less
Bauer, Bianca S.; Forsyth, George W.; Sandmeyer, Lynne S.; Grahn, Bruce H.
2011-01-01
Mitochondrial transcription factor A (Tfam) has been implicated in the pathogenesis of retinal dysplasia in miniature schnauzer dogs and it has been proposed that affected dogs have altered mitochondrial numbers, size, and morphology. To test these hypotheses the Tfam gene of affected and normal miniature schnauzer dogs with retinal dysplasia was sequenced and lymphocyte mitochondria were quantified, measured, and the morphology was compared in normal and affected dogs using transmission electron microscopy. For Tfam sequencing, retina, retinal pigment epithelium (RPE), and whole blood samples were collected. Total RNA was isolated from the retina and RPE and reverse transcribed to make cDNA. Genomic DNA was extracted from white blood cell pellets obtained from the whole blood samples. The Tfam coding sequence, 5′ promoter region, intron1 and the 3′ non-coding sequence of normal and affected dogs were amplified using polymerase chain reaction (PCR), cloned and sequenced. For electron microscopy, lymphocytes from affected and normal dogs were photographed and the mitochondria within each cross-section were identified, quantified, and the mitochondrial area (μm2) per lymphocyte cross-section was calculated. Lastly, using a masked technique, mitochondrial morphology was compared between the 2 groups. Sequencing of the miniature schnauzer Tfam gene revealed no functional sequence variation between affected and normal dogs. Lymphocyte and mitochondrial area, mitochondrial quantification, and morphology assessment also revealed no significant difference between the 2 groups. Further investigation into other candidate genes or factors causing retinal dysplasia in the miniature schnauzer is warranted. PMID:21731185
Bauer, Bianca S; Forsyth, George W; Sandmeyer, Lynne S; Grahn, Bruce H
2011-04-01
Mitochondrial transcription factor A (Tfam) has been implicated in the pathogenesis of retinal dysplasia in miniature schnauzer dogs and it has been proposed that affected dogs have altered mitochondrial numbers, size, and morphology. To test these hypotheses the Tfam gene of affected and normal miniature schnauzer dogs with retinal dysplasia was sequenced and lymphocyte mitochondria were quantified, measured, and the morphology was compared in normal and affected dogs using transmission electron microscopy. For Tfam sequencing, retina, retinal pigment epithelium (RPE), and whole blood samples were collected. Total RNA was isolated from the retina and RPE and reverse transcribed to make cDNA. Genomic DNA was extracted from white blood cell pellets obtained from the whole blood samples. The Tfam coding sequence, 5' promoter region, intron1 and the 3' non-coding sequence of normal and affected dogs were amplified using polymerase chain reaction (PCR), cloned and sequenced. For electron microscopy, lymphocytes from affected and normal dogs were photographed and the mitochondria within each cross-section were identified, quantified, and the mitochondrial area (μm²) per lymphocyte cross-section was calculated. Lastly, using a masked technique, mitochondrial morphology was compared between the 2 groups. Sequencing of the miniature schnauzer Tfam gene revealed no functional sequence variation between affected and normal dogs. Lymphocyte and mitochondrial area, mitochondrial quantification, and morphology assessment also revealed no significant difference between the 2 groups. Further investigation into other candidate genes or factors causing retinal dysplasia in the miniature schnauzer is warranted.
Algama, Manjula; Tasker, Edward; Williams, Caitlin; Parslow, Adam C; Bryson-Richardson, Robert J; Keith, Jonathan M
2017-03-27
Computational identification of non-coding RNAs (ncRNAs) is a challenging problem. We describe a genome-wide analysis using Bayesian segmentation to identify intronic elements highly conserved between three evolutionarily distant vertebrate species: human, mouse and zebrafish. We investigate the extent to which these elements include ncRNAs (or conserved domains of ncRNAs) and regulatory sequences. We identified 655 deeply conserved intronic sequences in a genome-wide analysis. We also performed a pathway-focussed analysis on genes involved in muscle development, detecting 27 intronic elements, of which 22 were not detected in the genome-wide analysis. At least 87% of the genome-wide and 70% of the pathway-focussed elements have existing annotations indicative of conserved RNA secondary structure. The expression of 26 of the pathway-focused elements was examined using RT-PCR, providing confirmation that they include expressed ncRNAs. Consistent with previous studies, these elements are significantly over-represented in the introns of transcription factors. This study demonstrates a novel, highly effective, Bayesian approach to identifying conserved non-coding sequences. Our results complement previous findings that these sequences are enriched in transcription factors. However, in contrast to previous studies which suggest the majority of conserved sequences are regulatory factor binding sites, the majority of conserved sequences identified using our approach contain evidence of conserved RNA secondary structures, and our laboratory results suggest most are expressed. Functional roles at DNA and RNA levels are not mutually exclusive, and many of our elements possess evidence of both. Moreover, ncRNAs play roles in transcriptional and post-transcriptional regulation, and this may contribute to the over-representation of these elements in introns of transcription factors. We attribute the higher sensitivity of the pathway-focussed analysis compared to the genome-wide analysis to improved alignment quality, suggesting that enhanced genomic alignments may reveal many more conserved intronic sequences.
Longkumer, Toshisangba; Kamireddy, Swetha; Muthyala, Venkateswar Reddy; Akbarpasha, Shaikh; Pitchika, Gopi Krishna; Kodetham, Gopinath; Ayaluru, Murali; Siddavattam, Dayananda
2013-01-01
While analyzing plasmids of Acinetobacter sp. DS002 we have detected a circular DNA molecule pTS236, which upon further investigation is identified as the genome of a phage. The phage genome has shown sequence similarity to the recently discovered Sphinx 2.36 DNA sequence co-purified with the Transmissible Spongiform Encephalopathy (TSE) particles isolated from infected brain samples collected from diverse geographical regions. As in Sphinx 2.36, the phage genome also codes for three proteins. One of them codes for RepA and is shown to be involved in replication of pTS236 through rolling circle (RC) mode. The other two translationally coupled ORFs, orf106 and orf96, code for coat proteins of the phage. Although an orf96 homologue was not previously reported in Sphinx 2.36, a closer examination of DNA sequence of Sphinx 2.36 revealed its presence downstream of orf106 homologue. TEM images and infection assays revealed existence of phage AbDs1 in Acinetobacter sp. DS002.
Longkumer, Toshisangba; Kamireddy, Swetha; Muthyala, Venkateswar Reddy; Akbarpasha, Shaikh; Pitchika, Gopi Krishna; Kodetham, Gopinath; Ayaluru, Murali; Siddavattam, Dayananda
2013-01-01
While analyzing plasmids of Acinetobacter sp. DS002 we have detected a circular DNA molecule pTS236, which upon further investigation is identified as the genome of a phage. The phage genome has shown sequence similarity to the recently discovered Sphinx 2.36 DNA sequence co-purified with the Transmissible Spongiform Encephalopathy (TSE) particles isolated from infected brain samples collected from diverse geographical regions. As in Sphinx 2.36, the phage genome also codes for three proteins. One of them codes for RepA and is shown to be involved in replication of pTS236 through rolling circle (RC) mode. The other two translationally coupled ORFs, orf106 and orf96, code for coat proteins of the phage. Although an orf96 homologue was not previously reported in Sphinx 2.36, a closer examination of DNA sequence of Sphinx 2.36 revealed its presence downstream of orf106 homologue. TEM images and infection assays revealed existence of phage AbDs1 in Acinetobacter sp. DS002. PMID:23867905
Gilchrist, Anthony Stuart; Shearman, Deborah C A; Frommer, Marianne; Raphael, Kathryn A; Deshpande, Nandan P; Wilkins, Marc R; Sherwin, William B; Sved, John A
2014-12-20
The tephritid fruit flies include a number of economically important pests of horticulture, with a large accumulated body of research on their biology and control. Amongst the Tephritidae, the genus Bactrocera, containing over 400 species, presents various species groups of potential utility for genetic studies of speciation, behaviour or pest control. In Australia, there exists a triad of closely-related, sympatric Bactrocera species which do not mate in the wild but which, despite distinct morphologies and behaviours, can be force-mated in the laboratory to produce fertile hybrid offspring. To exploit the opportunities offered by genomics, such as the efficient identification of genetic loci central to pest behaviour and to the earliest stages of speciation, investigators require genomic resources for future investigations. We produced a draft de novo genome assembly of Australia's major tephritid pest species, Bactrocera tryoni. The male genome (650-700 Mbp) includes approximately 150 Mb of interspersed repetitive DNA sequences and 60 Mb of satellite DNA. Assessment using conserved core eukaryotic sequences indicated 98% completeness. Over 16,000 MAKER-derived gene models showed a large degree of overlap with other Dipteran reference genomes. The sequence of the ribosomal RNA transcribed unit was also determined. Unscaffolded assemblies of B. neohumeralis and B. jarvisi were then produced; comparison with B. tryoni showed that the species are more closely related than any Drosophila species pair. The similarity of the genomes was exploited to identify 4924 potentially diagnostic indels between the species, all of which occur in non-coding regions. This first draft B. tryoni genome resembles other dipteran genomes in terms of size and putative coding sequences. For all three species included in this study, we have identified a comprehensive set of non-redundant repetitive sequences, including the ribosomal RNA unit, and have quantified the major satellite DNA families. These genetic resources will facilitate the further investigations of genetic mechanisms responsible for the behavioural and morphological differences between these three species and other tephritids. We have also shown how whole genome sequence data can be used to generate simple diagnostic tests between very closely-related species where only one of the species is scaffolded.
Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil
2015-02-01
The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
TIR-NBS-LRR genes are rare in monocots: evidence from diverse monocot orders
Tarr, D Ellen K; Alexander, Helen M
2009-01-01
Background Plant resistance (R) gene products recognize pathogen effector molecules. Many R genes code for proteins containing nucleotide binding site (NBS) and C-terminal leucine-rich repeat (LRR) domains. NBS-LRR proteins can be divided into two groups, TIR-NBS-LRR and non-TIR-NBS-LRR, based on the structure of the N-terminal domain. Although both classes are clearly present in gymnosperms and eudicots, only non-TIR sequences have been found consistently in monocots. Since most studies in monocots have been limited to agriculturally important grasses, it is difficult to draw conclusions. The purpose of our study was to look for evidence of these sequences in additional monocot orders. Findings Using degenerate PCR, we amplified NBS sequences from four monocot species (C. blanda, D. marginata, S. trifasciata, and Spathiphyllum sp.), a gymnosperm (C. revoluta) and a eudicot (C. canephora). We successfully amplified TIR-NBS-LRR sequences from dicot and gymnosperm DNA, but not from monocot DNA. Using databases, we obtained NBS sequences from additional monocots, magnoliids and basal angiosperms. TIR-type sequences were not present in monocot or magnoliid sequences, but were present in the basal angiosperms. Phylogenetic analysis supported a single TIR clade and multiple non-TIR clades. Conclusion We were unable to find monocot TIR-NBS-LRR sequences by PCR amplification or database searches. In contrast to previous studies, our results represent five monocot orders (Poales, Zingiberales, Arecales, Asparagales, and Alismatales). Our results establish the presence of TIR-NBS-LRR sequences in basal angiosperms and suggest that although these sequences were present in early land plants, they have been reduced significantly in monocots and magnoliids. PMID:19785756
Yang, Jian-Hua; Li, Jun-Hao; Jiang, Shan; Zhou, Hui; Qu, Liang-Hu
2013-01-01
Long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) represent two classes of important non-coding RNAs in eukaryotes. Although these non-coding RNAs have been implicated in organismal development and in various human diseases, surprisingly little is known about their transcriptional regulation. Recent advances in chromatin immunoprecipitation with next-generation DNA sequencing (ChIP-Seq) have provided methods of detecting transcription factor binding sites (TFBSs) with unprecedented sensitivity. In this study, we describe ChIPBase (http://deepbase.sysu.edu.cn/chipbase/), a novel database that we have developed to facilitate the comprehensive annotation and discovery of transcription factor binding maps and transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. The current release of ChIPBase includes high-throughput sequencing data that were generated by 543 ChIP-Seq experiments in diverse tissues and cell lines from six organisms. By analysing millions of TFBSs, we identified tens of thousands of TF-lncRNA and TF-miRNA regulatory relationships. Furthermore, two web-based servers were developed to annotate and discover transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. In addition, we developed two genome browsers, deepView and genomeView, to provide integrated views of multidimensional data. Moreover, our web implementation supports diverse query types and the exploration of TFs, lncRNAs, miRNAs, gene ontologies and pathways.
Duret, Laurent; Cohen, Jean; Jubin, Claire; Dessen, Philippe; Goût, Jean-François; Mousset, Sylvain; Aury, Jean-Marc; Jaillon, Olivier; Noël, Benjamin; Arnaiz, Olivier; Bétermier, Mireille; Wincker, Patrick; Meyer, Eric; Sperling, Linda
2008-01-01
Ciliates are the only unicellular eukaryotes known to separate germinal and somatic functions. Diploid but silent micronuclei transmit the genetic information to the next sexual generation. Polyploid macronuclei express the genetic information from a streamlined version of the genome but are replaced at each sexual generation. The macronuclear genome of Paramecium tetraurelia was recently sequenced by a shotgun approach, providing access to the gene repertoire. The 72-Mb assembly represents a consensus sequence for the somatic DNA, which is produced after sexual events by reproducible rearrangements of the zygotic genome involving elimination of repeated sequences, precise excision of unique-copy internal eliminated sequences (IES), and amplification of the cellular genes to high copy number. We report use of the shotgun sequencing data (>106 reads representing 13× coverage of a completely homozygous clone) to evaluate variability in the somatic DNA produced by these developmental genome rearrangements. Although DNA amplification appears uniform, both of the DNA elimination processes produce sequence heterogeneity. The variability that arises from IES excision allowed identification of hundreds of putative new IESs, compared to 42 that were previously known, and revealed cases of erroneous excision of segments of coding sequences. We demonstrate that IESs in coding regions are under selective pressure to introduce premature termination of translation in case of excision failure. PMID:18256234
CORALINA: a universal method for the generation of gRNA libraries for CRISPR-based screening.
Köferle, Anna; Worf, Karolina; Breunig, Christopher; Baumann, Valentin; Herrero, Javier; Wiesbeck, Maximilian; Hutter, Lukas H; Götz, Magdalena; Fuchs, Christiane; Beck, Stephan; Stricker, Stefan H
2016-11-14
The bacterial CRISPR system is fast becoming the most popular genetic and epigenetic engineering tool due to its universal applicability and adaptability. The desire to deploy CRISPR-based methods in a large variety of species and contexts has created an urgent need for the development of easy, time- and cost-effective methods enabling large-scale screening approaches. Here we describe CORALINA (comprehensive gRNA library generation through controlled nuclease activity), a method for the generation of comprehensive gRNA libraries for CRISPR-based screens. CORALINA gRNA libraries can be derived from any source of DNA without the need of complex oligonucleotide synthesis. We show the utility of CORALINA for human and mouse genomic DNA, its reproducibility in covering the most relevant genomic features including regulatory, coding and non-coding sequences and confirm the functionality of CORALINA generated gRNAs. The simplicity and cost-effectiveness make CORALINA suitable for any experimental system. The unprecedented sequence complexities obtainable with CORALINA libraries are a necessary pre-requisite for less biased large scale genomic and epigenomic screens.
NASA Astrophysics Data System (ADS)
Kraljić, K.; Strüngmann, L.; Fimmel, E.; Gumbel, M.
2018-01-01
The genetic code is degenerated and it is assumed that redundancy provides error detection and correction mechanisms in the translation process. However, the biological meaning of the code's structure is still under current research. This paper presents a Genetic Code Analysis Toolkit (GCAT) which provides workflows and algorithms for the analysis of the structure of nucleotide sequences. In particular, sets or sequences of codons can be transformed and tested for circularity, comma-freeness, dichotomic partitions and others. GCAT comes with a fertile editor custom-built to work with the genetic code and a batch mode for multi-sequence processing. With the ability to read FASTA files or load sequences from GenBank, the tool can be used for the mathematical and statistical analysis of existing sequence data. GCAT is Java-based and provides a plug-in concept for extensibility. Availability: Open source Homepage:http://www.gcat.bio/
Self-organizing approach for meta-genomes.
Zhu, Jianfeng; Zheng, Wei-Mou
2014-12-01
We extend the self-organizing approach for annotation of a bacterial genome to analyze the raw sequencing data of the human gut metagenome without sequence assembling. The original approach divides the genomic sequence of a bacterium into non-overlapping segments of equal length and assigns to each segment one of seven 'phases', among which one is for the noncoding regions, three for the direct coding regions to indicate the three possible codon positions of the segment starting site, and three for the reverse coding regions. The noncoding phase and the six coding phases are described by two frequency tables of the 64 triplet types or 'codon usages'. A set of codon usages can be used to update the phase assignment and vice versa. An iteration after an initialization leads to a convergent phase assignment to give an annotation of the genome. In the extension of the approach to a metagenome, we consider a mixture model of a number of categories described by different codon usages. The Illumina Genome Analyzer sequencing data of the total DNA from faecal samples are then examined to understand the diversity of the human gut microbiome. Copyright © 2014 Elsevier Ltd. All rights reserved.
Reinhardt, Josephine A.; Wanjiru, Betty M.; Brant, Alicia T.; Saelao, Perot; Begun, David J.; Jones, Corbin D.
2013-01-01
How non-coding DNA gives rise to new protein-coding genes (de novo genes) is not well understood. Recent work has revealed the origins and functions of a few de novo genes, but common principles governing the evolution or biological roles of these genes are unknown. To better define these principles, we performed a parallel analysis of the evolution and function of six putatively protein-coding de novo genes described in Drosophila melanogaster. Reconstruction of the transcriptional history of de novo genes shows that two de novo genes emerged from novel long non-coding RNAs that arose at least 5 MY prior to evolution of an open reading frame. In contrast, four other de novo genes evolved a translated open reading frame and transcription within the same evolutionary interval suggesting that nascent open reading frames (proto-ORFs), while not required, can contribute to the emergence of a new de novo gene. However, none of the genes arose from proto-ORFs that existed long before expression evolved. Sequence and structural evolution of de novo genes was rapid compared to nearby genes and the structural complexity of de novo genes steadily increases over evolutionary time. Despite the fact that these genes are transcribed at a higher level in males than females, and are most strongly expressed in testes, RNAi experiments show that most of these genes are essential in both sexes during metamorphosis. This lethality suggests that protein coding de novo genes in Drosophila quickly become functionally important. PMID:24146629
Superimposed Code Theoretic Analysis of DNA Codes and DNA Computing
2008-01-01
complements of one another and the DNA duplex formed is a Watson - Crick (WC) duplex. However, there are many instances when the formation of non-WC...that the user’s requirements for probe selection are met based on the Watson - Crick probe locality within a target. The second type, called...AFRL-RI-RS-TR-2007-288 Final Technical Report January 2008 SUPERIMPOSED CODE THEORETIC ANALYSIS OF DNA CODES AND DNA COMPUTING
Vladimirov, N V; Likhoshvaĭ, V A; Matushkin, Iu G
2007-01-01
Gene expression is known to correlate with degree of codon bias in many unicellular organisms. However, such correlation is absent in some organisms. Recently we demonstrated that inverted complementary repeats within coding DNA sequence must be considered for proper estimation of translation efficiency, since they may form secondary structures that obstruct ribosome movement. We have developed a program for estimation of potential coding DNA sequence expression in defined unicellular organism using its genome sequence. The program computes elongation efficiency index. Computation is based on estimation of coding DNA sequence elongation efficiency, taking into account three key factors: codon bias, average number of inverted complementary repeats, and free energy of potential stem-loop structures formed by the repeats. The influence of these factors on translation is numerically estimated. An optimal proportion of these factors is computed for each organism individually. Quantitative translational characteristics of 384 unicellular organisms (351 bacteria, 28 archaea, 5 eukaryota) have been computed using their annotated genomes from NCBI GenBank. Five potential evolutionary strategies of translational optimization have been determined among studied organisms. A considerable difference of preferred translational strategies between Bacteria and Archaea has been revealed. Significant correlations between elongation efficiency index and gene expression levels have been shown for two organisms (S. cerevisiae and H. pylori) using available microarray data. The proposed method allows to estimate numerically the coding DNA sequence translation efficiency and to optimize nucleotide composition of heterologous genes in unicellular organisms. http://www.mgs.bionet.nsc.ru/mgs/programs/eei-calculator/.
Kangaroo – A pattern-matching program for biological sequences
2002-01-01
Background Biologists are often interested in performing a simple database search to identify proteins or genes that contain a well-defined sequence pattern. Many databases do not provide straightforward or readily available query tools to perform simple searches, such as identifying transcription binding sites, protein motifs, or repetitive DNA sequences. However, in many cases simple pattern-matching searches can reveal a wealth of information. We present in this paper a regular expression pattern-matching tool that was used to identify short repetitive DNA sequences in human coding regions for the purpose of identifying potential mutation sites in mismatch repair deficient cells. Results Kangaroo is a web-based regular expression pattern-matching program that can search for patterns in DNA, protein, or coding region sequences in ten different organisms. The program is implemented to facilitate a wide range of queries with no restriction on the length or complexity of the query expression. The program is accessible on the web at http://bioinfo.mshri.on.ca/kangaroo/ and the source code is freely distributed at http://sourceforge.net/projects/slritools/. Conclusion A low-level simple pattern-matching application can prove to be a useful tool in many research settings. For example, Kangaroo was used to identify potential genetic targets in a human colorectal cancer variant that is characterized by a high frequency of mutations in coding regions containing mononucleotide repeats. PMID:12150718
Decoding the non-coding RNAs in Alzheimer's disease.
Schonrock, Nicole; Götz, Jürgen
2012-11-01
Non-coding RNAs (ncRNAs) are integral components of biological networks with fundamental roles in regulating gene expression. They can integrate sequence information from the DNA code, epigenetic regulation and functions of multimeric protein complexes to potentially determine the epigenetic status and transcriptional network in any given cell. Humans potentially contain more ncRNAs than any other species, especially in the brain, where they may well play a significant role in human development and cognitive ability. This review discusses their emerging role in Alzheimer's disease (AD), a human pathological condition characterized by the progressive impairment of cognitive functions. We discuss the complexity of the ncRNA world and how this is reflected in the regulation of the amyloid precursor protein and Tau, two proteins with central functions in AD. By understanding this intricate regulatory network, there is hope for a better understanding of disease mechanisms and ultimately developing diagnostic and therapeutic tools.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Leong, JoAnn Ching
The nucleotide sequence of the IHNV glycoprotein gene has been determined from a cDNA clone containing the entire coding region. The glycoprotein cDNA clone contained a leader sequence of 48 bases, a coding region of 1524 nucleotides, and 39 bases at the 3 foot end. The entire cDNA clone contains 1609 nucleodites and encodes a protein of 508 amino acids. The deduced amino acid sequence gave a translated molecular weight of 56,795 daltons. A hydropathicity profile of the deduced amino acid sequence indicated that there were two major hydrophobic domains: one,at the N-terminus,delineating a signal peptide of 18 amino acidsmore » and the other, at the C-terminus,delineating the region of the transmembrane. Five possible sites of N-linked glyscoylation were identified. Although no nucleic acid homology existed between the IHNV glycoprotein gene and the glycoprotein genes of rabies and VSV, there was significant homology at the amino acid level between all three rhabdovirus glycoproteins.« less
Biomimetic Artificial Epigenetic Code for Targeted Acetylation of Histones.
Taniguchi, Junichi; Feng, Yihong; Pandian, Ganesh N; Hashiya, Fumitaka; Hidaka, Takuya; Hashiya, Kaori; Park, Soyoung; Bando, Toshikazu; Ito, Shinji; Sugiyama, Hiroshi
2018-06-13
While the central role of locus-specific acetylation of histone proteins in eukaryotic gene expression is well established, the availability of designer tools to regulate acetylation at particular nucleosome sites remains limited. Here, we develop a unique strategy to introduce acetylation by constructing a bifunctional molecule designated Bi-PIP. Bi-PIP has a P300/CBP-selective bromodomain inhibitor (Bi) as a P300/CBP recruiter and a pyrrole-imidazole polyamide (PIP) as a sequence-selective DNA binder. Biochemical assays verified that Bi-PIPs recruit P300 to the nucleosomes having their target DNA sequences and extensively accelerate acetylation. Bi-PIPs also activated transcription of genes that have corresponding cognate DNA sequences inside living cells. Our results demonstrate that Bi-PIPs could act as a synthetic programmable histone code of acetylation, which emulates the bromodomain-mediated natural propagation system of histone acetylation to activate gene expression in a sequence-selective manner.
Wright, Imogen A.; Travers, Simon A.
2014-01-01
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance. PMID:24861618
Arnaiz, Olivier; Mathy, Nathalie; Baudry, Céline; Malinsky, Sophie; Aury, Jean-Marc; Denby Wilkes, Cyril; Garnier, Olivier; Labadie, Karine; Lauderdale, Benjamin E; Le Mouël, Anne; Marmignon, Antoine; Nowacki, Mariusz; Poulain, Julie; Prajer, Malgorzata; Wincker, Patrick; Meyer, Eric; Duharcourt, Sandra; Duret, Laurent; Bétermier, Mireille; Sperling, Linda
2012-01-01
Insertions of parasitic DNA within coding sequences are usually deleterious and are generally counter-selected during evolution. Thanks to nuclear dimorphism, ciliates provide unique models to study the fate of such insertions. Their germline genome undergoes extensive rearrangements during development of a new somatic macronucleus from the germline micronucleus following sexual events. In Paramecium, these rearrangements include precise excision of unique-copy Internal Eliminated Sequences (IES) from the somatic DNA, requiring the activity of a domesticated piggyBac transposase, PiggyMac. We have sequenced Paramecium tetraurelia germline DNA, establishing a genome-wide catalogue of -45,000 IESs, in order to gain insight into their evolutionary origin and excision mechanism. We obtained direct evidence that PiggyMac is required for excision of all IESs. Homology with known P. tetraurelia Tc1/mariner transposons, described here, indicates that at least a fraction of IESs derive from these elements. Most IES insertions occurred before a recent whole-genome duplication that preceded diversification of the P. aurelia species complex, but IES invasion of the Paramecium genome appears to be an ongoing process. Once inserted, IESs decay rapidly by accumulation of deletions and point substitutions. Over 90% of the IESs are shorter than 150 bp and present a remarkable size distribution with a -10 bp periodicity, corresponding to the helical repeat of double-stranded DNA and suggesting DNA loop formation during assembly of a transpososome-like excision complex. IESs are equally frequent within and between coding sequences; however, excision is not 100% efficient and there is selective pressure against IES insertions, in particular within highly expressed genes. We discuss the possibility that ancient domestication of a piggyBac transposase favored subsequent propagation of transposons throughout the germline by allowing insertions in coding sequences, a fraction of the genome in which parasitic DNA is not usually tolerated.
Arnaiz, Olivier; Mathy, Nathalie; Baudry, Céline; Malinsky, Sophie; Aury, Jean-Marc; Denby Wilkes, Cyril; Garnier, Olivier; Labadie, Karine; Lauderdale, Benjamin E.; Le Mouël, Anne; Marmignon, Antoine; Nowacki, Mariusz; Poulain, Julie; Prajer, Malgorzata; Wincker, Patrick; Meyer, Eric; Duharcourt, Sandra; Duret, Laurent; Bétermier, Mireille; Sperling, Linda
2012-01-01
Insertions of parasitic DNA within coding sequences are usually deleterious and are generally counter-selected during evolution. Thanks to nuclear dimorphism, ciliates provide unique models to study the fate of such insertions. Their germline genome undergoes extensive rearrangements during development of a new somatic macronucleus from the germline micronucleus following sexual events. In Paramecium, these rearrangements include precise excision of unique-copy Internal Eliminated Sequences (IES) from the somatic DNA, requiring the activity of a domesticated piggyBac transposase, PiggyMac. We have sequenced Paramecium tetraurelia germline DNA, establishing a genome-wide catalogue of ∼45,000 IESs, in order to gain insight into their evolutionary origin and excision mechanism. We obtained direct evidence that PiggyMac is required for excision of all IESs. Homology with known P. tetraurelia Tc1/mariner transposons, described here, indicates that at least a fraction of IESs derive from these elements. Most IES insertions occurred before a recent whole-genome duplication that preceded diversification of the P. aurelia species complex, but IES invasion of the Paramecium genome appears to be an ongoing process. Once inserted, IESs decay rapidly by accumulation of deletions and point substitutions. Over 90% of the IESs are shorter than 150 bp and present a remarkable size distribution with a ∼10 bp periodicity, corresponding to the helical repeat of double-stranded DNA and suggesting DNA loop formation during assembly of a transpososome-like excision complex. IESs are equally frequent within and between coding sequences; however, excision is not 100% efficient and there is selective pressure against IES insertions, in particular within highly expressed genes. We discuss the possibility that ancient domestication of a piggyBac transposase favored subsequent propagation of transposons throughout the germline by allowing insertions in coding sequences, a fraction of the genome in which parasitic DNA is not usually tolerated. PMID:23071448
[Cloning and sequencing of KIR2DL1 framework gene cDNA and identification of a novel allele].
Sun, Ge; Wang, Chang; Zhen, Jianxin; Zhang, Guobin; Xu, Yunping; Deng, Zhihui
2016-10-01
To develop an assay for cDNA cloning and haplotype sequencing of KIR2DL1 framework gene and determine the genotype of an ethnic Han from southern China. Total RNA was isolated from peripheral blood sample, and complementary DNA (cDNA) transcript was synthesized by RT-PCR. The entire coding sequence of the KIR2DL1 framework gene was amplified with a pair of KIR2DL1-specific PCR primers. The PCR products with a length of approximately 1.2 kb were then subjected to cloning and haplotype sequencing. A specific target fragment of the KIR2DL1 framework gene was obtained. Following allele separation, a wild-type KIR2DL1*00302 allele and a novel variant allele, KIR2DL1*031, were identified. Sequence alignment with KIR2DL1 alleles from the IPD-KIR Database showed that the novel allele KIR2DL1*031 has differed from the closest allele KIR2DL1*00302 by a non-synonymous mutation at CDS nt 188A>G (codon 42 GAG>GGG) in exon 4, which has caused an amino acid change Glu42Gly. The sequence of the novel allele KIR2DL1*031 was submitted to GenBank under the accession number KP025960 and to the IPD-KIR Database under the submission number IWS40001982. A name KIR2DL1*031 has been officially assigned by the World Health Organization (WHO) Nomenclature Committee. An assay for cDNA cloning and haplotype sequencing of KIR2DL1 has been established, which has a broad applications in KIR studies at allelic level.
Chao, Tianle; Wang, Guizhi; Wang, Jianmin; Liu, Zhaohua; Ji, Zhibin; Hou, Lei; Zhang, Chunlan
2016-01-01
High-throughput mRNA sequencing enables the discovery of new transcripts and additional parts of incompletely annotated transcripts. Compared with the human and cow genomes, the reference annotation level of the sheep genome is still low. An investigation of new transcripts in sheep skeletal muscle will improve our understanding of muscle development. Therefore, applying high-throughput sequencing, two cDNA libraries from the biceps brachii of small-tailed Han sheep and Dorper sheep were constructed, and whole-transcriptome analysis was performed to determine the unknown transcript catalogue of this tissue. In this study, 40,129 transcripts were finally mapped to the sheep genome. Among them, 3,467 transcripts were determined to be unannotated in the current reference sheep genome and were defined as new transcripts. Based on protein-coding capacity prediction and comparative analysis of sequence similarity, 246 transcripts were classified as portions of unannotated genes or incompletely annotated genes. Another 1,520 transcripts were predicted with high confidence to be long non-coding RNAs. Our analysis also revealed 334 new transcripts that displayed specific expression in ruminants and uncovered a number of new transcripts without intergenus homology but with specific expression in sheep skeletal muscle. The results confirmed a complex transcript pattern of coding and non-coding RNA in sheep skeletal muscle. This study provided important information concerning the sheep genome and transcriptome annotation, which could provide a basis for further study.
Enyeart, Peter J; Mohr, Georg; Ellington, Andrew D; Lambowitz, Alan M
2014-01-13
Mobile group II introns are bacterial retrotransposons that combine the activities of an autocatalytic intron RNA (a ribozyme) and an intron-encoded reverse transcriptase to insert site-specifically into DNA. They recognize DNA target sites largely by base pairing of sequences within the intron RNA and achieve high DNA target specificity by using the ribozyme active site to couple correct base pairing to RNA-catalyzed intron integration. Algorithms have been developed to program the DNA target site specificity of several mobile group II introns, allowing them to be made into 'targetrons.' Targetrons function for gene targeting in a wide variety of bacteria and typically integrate at efficiencies high enough to be screened easily by colony PCR, without the need for selectable markers. Targetrons have found wide application in microbiological research, enabling gene targeting and genetic engineering of bacteria that had been intractable to other methods. Recently, a thermostable targetron has been developed for use in bacterial thermophiles, and new methods have been developed for using targetrons to position recombinase recognition sites, enabling large-scale genome-editing operations, such as deletions, inversions, insertions, and 'cut-and-pastes' (that is, translocation of large DNA segments), in a wide range of bacteria at high efficiency. Using targetrons in eukaryotes presents challenges due to the difficulties of nuclear localization and sub-optimal magnesium concentrations, although supplementation with magnesium can increase integration efficiency, and directed evolution is being employed to overcome these barriers. Finally, spurred by new methods for expressing group II intron reverse transcriptases that yield large amounts of highly active protein, thermostable group II intron reverse transcriptases from bacterial thermophiles are being used as research tools for a variety of applications, including qRT-PCR and next-generation RNA sequencing (RNA-seq). The high processivity and fidelity of group II intron reverse transcriptases along with their novel template-switching activity, which can directly link RNA-seq adaptor sequences to cDNAs during reverse transcription, open new approaches for RNA-seq and the identification and profiling of non-coding RNAs, with potentially wide applications in research and biotechnology.
Tau mRNA 3'UTR-to-CDS ratio is increased in Alzheimer disease.
García-Escudero, Vega; Gargini, Ricardo; Martín-Maestro, Patricia; García, Esther; García-Escudero, Ramón; Avila, Jesús
2017-08-10
Neurons frequently show an imbalance in expression of the 3' untranslated region (3'UTR) relative to the coding DNA sequence (CDS) region of mature messenger RNAs (mRNA). The ratio varies among different cells or parts of the brain. The Map2 protein levels per cell depend on the 3'UTR-to-CDS ratio rather than the total mRNA amount, which suggests powerful regulation of protein expression by 3'UTR sequences. Here we found that MAPT (the microtubule-associated protein tau gene) 3'UTR levels are particularly high with respect to other genes; indeed, the 3'UTR-to-CDS ratio of MAPT is balanced in healthy brain in mouse and human. The tau protein accumulates in Alzheimer diseased brain. We nonetheless observed that the levels of RNA encoding MAPT/tau were diminished in these patients' brains. To explain this apparently contradictory result, we studied MAPT mRNA stoichiometry in coding and non-coding regions, and found that the 3'UTR-to-CDS ratio was higher in the hippocampus of Alzheimer disease patients, with higher tau protein but lower total mRNA levels. Our data indicate that changes in the 3'UTR-to-CDS ratio have a regulatory role in the disease. Future research should thus consider not only mRNA levels, but also the ratios between coding and non-coding regions. Copyright © 2017 Elsevier B.V. All rights reserved.
Chernicky, C L; Tan, H; Burfeind, P; Ilan, J; Ilan, J
1996-02-01
There are several cell types within the placenta that produce cytokines which can contribute to the regulatory mechanisms that ensure normal pregnancy. The immunological milieu at the maternofetal interface is considered to be crucial for survival of the fetus. Interleukin-2 (IL-2) is expressed by the syncytiotrophoblast, the cell layer between the mother and the fetus. IL-2 appears to be a key factor in maintenance of pregnancy. Therefore, it was important to determine the sequence of human placental interleukin-2. Direct sequencing of human placental IL-2 cDNA was determined for the coding region. Subclone sequencing was carried out for the 5'- and 3'-untranslated regions (5'-UTR and 3'-UTR). The 5'-UTR for human placental IL-2 cDNA is 294 bp, which is 247 nucleotides longer than that reported for cDNA IL-2 derived from T cells. The sequence of the coding region is identical to that reported for T cell IL-2, while sequence analysis of the polymerase chain reaction (PCR) product showed that the cDNA from the 3' end was the same as that reported for cDNA from T cells. Human placental IL-2 cDNA is 1,028 base pairs (excluding the poly A tail), which is 247 bp longer at the 5' end than that reported for IL-2 T cell cDNA. Therefore, the extended 5'-UTR of the placental IL-2 cDNA may be a consequence of alternative promoter utilization in the placenta.
Hutchins, Andrew Paul; Pei, Duanqing
Transposable elements (TEs) are mobile genomic sequences of DNA capable of autonomous and non-autonomous duplication. TEs have been highly successful, and nearly half of the human genome now consists of various families of TEs. Originally thought to be non-functional, these elements have been co-opted by animal genomes to perform a variety of physiological functions ranging from TE-derived proteins acting directly in normal biological functions, to innovations in transcription factor logic and influence on epigenetic control of gene expression. During embryonic development, when the genome is epigenetically reprogrammed and DNA-demethylated, TEs are released from repression and show embryonic stage-specific expression, and in human and mouse embryos, intact TE-derived endogenous viral particles can even be detected. A similar process occurs during the reprogramming of somatic cells to pluripotent cells: When the somatic DNA is demethylated, TEs are released from repression. In embryonic stem cells (ESCs), where DNA is hypomethylated, an elaborate system of epigenetic control is employed to suppress TEs, a system that often overlaps with normal epigenetic control of ESC gene expression. Finally, many long non-coding RNAs (lncRNAs) involved in normal ESC function and those assisting or impairing reprogramming contain multiple TEs in their RNA. These TEs may act as regulatory units to recruit RNA-binding proteins and epigenetic modifiers. This review covers how TEs are interlinked with the epigenetic machinery and lncRNAs, and how these links influence each other to modulate aspects of ESCs, embryogenesis, and somatic cell reprogramming.
[The ENCODE project and functional genomics studies].
Ding, Nan; Qu, Hongzhu; Fang, Xiangdong
2014-03-01
Upon the completion of the Human Genome Project, scientists have been trying to interpret the underlying genomic code for human biology. Since 2003, National Human Genome Research Institute (NHGRI) has invested nearly $0.3 billion and gathered over 440 scientists from more than 32 institutions in the United States, China, United Kingdom, Japan, Spain and Singapore to initiate the Encyclopedia of DNA Elements (ENCODE) project, aiming to identify and analyze all regulatory elements in the human genome. Taking advantage of the development of next-generation sequencing technologies and continuous improvement of experimental methods, ENCODE had made remarkable achievements: identified methylation and histone modification of DNA sequences and their regulatory effects on gene expression through altering chromatin structures, categorized binding sites of various transcription factors and constructed their regulatory networks, further revised and updated database for pseudogenes and non-coding RNA, and identified SNPs in regulatory sequences associated with diseases. These findings help to comprehensively understand information embedded in gene and genome sequences, the function of regulatory elements as well as the molecular mechanism underlying the transcriptional regulation by noncoding regions, and provide extensive data resource for life sciences, particularly for translational medicine. We re-viewed the contributions of high-throughput sequencing platform development and bioinformatical technology improve-ment to the ENCODE project, the association between epigenetics studies and the ENCODE project, and the major achievement of the ENCODE project. We also provided our prospective on the role of the ENCODE project in promoting the development of basic and clinical medicine.
GobyWeb: Simplified Management and Analysis of Gene Expression and DNA Methylation Sequencing Data
Dorff, Kevin C.; Chambwe, Nyasha; Zeno, Zachary; Simi, Manuele; Shaknovich, Rita; Campagne, Fabien
2013-01-01
We present GobyWeb, a web-based system that facilitates the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to previous analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. We conducted performance evaluations of the software and found it to either outperform or have similar performance to analysis programs developed for specialized analyses of HTS data. We found that most biologists who took a one-hour GobyWeb training session were readily able to analyze RNA-Seq data with state of the art analysis tools. GobyWeb can be obtained at http://gobyweb.campagnelab.org and is freely available for non-commercial use. GobyWeb plugins are distributed in source code and licensed under the open source LGPL3 license to facilitate code inspection, reuse and independent extensions http://github.com/CampagneLaboratory/gobyweb2-plugins. PMID:23936070
Molecular cloning of chitinase 33 (chit33) gene from Trichoderma atroviride
Matroudi, S.; Zamani, M.R.; Motallebi, M.
2008-01-01
In this study Trichoderma atroviride was selected as over producer of chitinase enzyme among 30 different isolates of Trichoderma sp. on the basis of chitinase specific activity. From this isolate the genomic and cDNA clones encoding chit33 have been isolated and sequenced. Comparison of genomic and cDNA sequences for defining gene structure indicates that this gene contains three short introns and also an open reading frame coding for a protein of 321 amino acids. The deduced amino acid sequence includes a 19 aa putative signal peptide. Homology between this sequence and other reported Trichoderma Chit33 proteins are discussed. The coding sequence of chit33 gene was cloned in pEt26b(+) expression vector and expressed in E. coli. PMID:24031242
DNA sequence-dependent mechanics and protein-assisted bending in repressor-mediated loop formation
Boedicker, James Q.; Garcia, Hernan G.; Johnson, Stephanie; Phillips, Rob
2014-01-01
As the chief informational molecule of life, DNA is subject to extensive physical manipulations. The energy required to deform double-helical DNA depends on sequence, and this mechanical code of DNA influences gene regulation, such as through nucleosome positioning. Here we examine the sequence-dependent flexibility of DNA in bacterial transcription factor-mediated looping, a context for which the role of sequence remains poorly understood. Using a suite of synthetic constructs repressed by the Lac repressor and two well-known sequences that show large flexibility differences in vitro, we make precise statistical mechanical predictions as to how DNA sequence influences loop formation and test these predictions using in vivo transcription and in vitro single-molecule assays. Surprisingly, sequence-dependent flexibility does not affect in vivo gene regulation. By theoretically and experimentally quantifying the relative contributions of sequence and the DNA-bending protein HU to DNA mechanical properties, we reveal that bending by HU dominates DNA mechanics and masks intrinsic sequence-dependent flexibility. Such a quantitative understanding of how mechanical regulatory information is encoded in the genome will be a key step towards a predictive understanding of gene regulation at single-base pair resolution. PMID:24231252
Kshirsagar, Rucha; Khan, Krishnendu; Joshi, Mamata V; Hosur, Ramakrishna V; Muniyappa, K
2017-05-23
A plethora of evidence suggests that different types of DNA quadruplexes are widely present in the genome of all organisms. The existence of a growing number of proteins that selectively bind and/or process these structures underscores their biological relevance. Moreover, G-quadruplex DNA has been implicated in the alignment of four sister chromatids by forming parallel guanine quadruplexes during meiosis; however, the underlying mechanism is not well defined. Here we show that a G/C-rich motif associated with a meiosis-specific DNA double-strand break (DSB) in Saccharomyces cerevisiae folds into G-quadruplex, and the C-rich sequence complementary to the G-rich sequence forms an i-motif. The presence of G-quadruplex or i-motif structures upstream of the green fluorescent protein-coding sequence markedly reduces the levels of gfp mRNA expression in S. cerevisiae cells, with a concomitant decrease in green fluorescent protein abundance, and blocks primer extension by DNA polymerase, thereby demonstrating the functional significance of these structures. Surprisingly, although S. cerevisiae Hop1, a component of synaptonemal complex axial/lateral elements, exhibits strong affinity to G-quadruplex DNA, it displays a much weaker affinity for the i-motif structure. However, the Hop1 C-terminal but not the N-terminal domain possesses strong i-motif binding activity, implying that the C-terminal domain has a distinct substrate specificity. Additionally, we found that Hop1 promotes intermolecular pairing between G/C-rich DNA segments associated with a meiosis-specific DSB site. Our results support the idea that the G/C-rich motifs associated with meiosis-specific DSBs fold into intramolecular G-quadruplex and i-motif structures, both in vitro and in vivo, thus revealing an important link between non-B form DNA structures and Hop1 in meiotic chromosome synapsis and recombination. Copyright © 2017 Biophysical Society. Published by Elsevier Inc. All rights reserved.
2013-01-01
Background Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process. Results For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes. In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60–100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range. The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF. Conclusions It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species. Reviewers This article was reviewed by Alexey Kondrashov, Rajeev Azad (nominated by Dr J.Peter Gogarten) and Yuriy Fofanov (nominated by Dr Janet Siefert). PMID:24067167
Early Evolution of Conserved Regulatory Sequences Associated with Development in Vertebrates
McEwen, Gayle K.; Goode, Debbie K.; Parker, Hugo J.; Woolfe, Adam; Callaway, Heather; Elgar, Greg
2009-01-01
Comparisons between diverse vertebrate genomes have uncovered thousands of highly conserved non-coding sequences, an increasing number of which have been shown to function as enhancers during early development. Despite their extreme conservation over 500 million years from humans to cartilaginous fish, these elements appear to be largely absent in invertebrates, and, to date, there has been little understanding of their mode of action or the evolutionary processes that have modelled them. We have now exploited emerging genomic sequence data for the sea lamprey, Petromyzon marinus, to explore the depth of conservation of this type of element in the earliest diverging extant vertebrate lineage, the jawless fish (agnathans). We searched for conserved non-coding elements (CNEs) at 13 human gene loci and identified lamprey elements associated with all but two of these gene regions. Although markedly shorter and less well conserved than within jawed vertebrates, identified lamprey CNEs are able to drive specific patterns of expression in zebrafish embryos, which are almost identical to those driven by the equivalent human elements. These CNEs are therefore a unique and defining characteristic of all vertebrates. Furthermore, alignment of lamprey and other vertebrate CNEs should permit the identification of persistent sequence signatures that are responsible for common patterns of expression and contribute to the elucidation of the regulatory language in CNEs. Identifying the core regulatory code for development, common to all vertebrates, provides a foundation upon which regulatory networks can be constructed and might also illuminate how large conserved regulatory sequence blocks evolve and become fixed in genomic DNA. PMID:20011110
The public goods hypothesis for the evolution of life on Earth
2011-01-01
It is becoming increasingly difficult to reconcile the observed extent of horizontal gene transfers with the central metaphor of a great tree uniting all evolving entities on the planet. In this manuscript we describe the Public Goods Hypothesis and show that it is appropriate in order to describe biological evolution on the planet. According to this hypothesis, nucleotide sequences (genes, promoters, exons, etc.) are simply seen as goods, passed from organism to organism through both vertical and horizontal transfer. Public goods sequences are defined by having the properties of being largely non-excludable (no organism can be effectively prevented from accessing these sequences) and non-rival (while such a sequence is being used by one organism it is also available for use by another organism). The universal nature of genetic systems ensures that such non-excludable sequences exist and non-excludability explains why we see a myriad of genes in different combinations in sequenced genomes. There are three features of the public goods hypothesis. Firstly, segments of DNA are seen as public goods, available for all organisms to integrate into their genomes. Secondly, we expect the evolution of mechanisms for DNA sharing and of defense mechanisms against DNA intrusion in genomes. Thirdly, we expect that we do not see a global tree-like pattern. Instead, we expect local tree-like patterns to emerge from the combination of a commonage of genes and vertical inheritance of genomes by cell division. Indeed, while genes are theoretically public goods, in reality, some genes are excludable, particularly, though not only, when they have variant genetic codes or behave as coalition or club goods, available for all organisms of a coalition to integrate into their genomes, and non-rival within the club. We view the Tree of Life hypothesis as a regionalized instance of the Public Goods hypothesis, just like classical mechanics and euclidean geometry are seen as regionalized instances of quantum mechanics and Riemannian geometry respectively. We argue for this change using an axiomatic approach that shows that the Public Goods hypothesis is a better accommodation of the observed data than the Tree of Life hypothesis. PMID:21861918
The Public Goods Hypothesis for the evolution of life on Earth.
McInerney, James O; Pisani, Davide; Bapteste, Eric; O'Connell, Mary J
2011-08-23
It is becoming increasingly difficult to reconcile the observed extent of horizontal gene transfers with the central metaphor of a great tree uniting all evolving entities on the planet. In this manuscript we describe the Public Goods Hypothesis and show that it is appropriate in order to describe biological evolution on the planet. According to this hypothesis, nucleotide sequences (genes, promoters, exons, etc.) are simply seen as goods, passed from organism to organism through both vertical and horizontal transfer. Public goods sequences are defined by having the properties of being largely non-excludable (no organism can be effectively prevented from accessing these sequences) and non-rival (while such a sequence is being used by one organism it is also available for use by another organism). The universal nature of genetic systems ensures that such non-excludable sequences exist and non-excludability explains why we see a myriad of genes in different combinations in sequenced genomes. There are three features of the public goods hypothesis. Firstly, segments of DNA are seen as public goods, available for all organisms to integrate into their genomes. Secondly, we expect the evolution of mechanisms for DNA sharing and of defense mechanisms against DNA intrusion in genomes. Thirdly, we expect that we do not see a global tree-like pattern. Instead, we expect local tree-like patterns to emerge from the combination of a commonage of genes and vertical inheritance of genomes by cell division. Indeed, while genes are theoretically public goods, in reality, some genes are excludable, particularly, though not only, when they have variant genetic codes or behave as coalition or club goods, available for all organisms of a coalition to integrate into their genomes, and non-rival within the club. We view the Tree of Life hypothesis as a regionalized instance of the Public Goods hypothesis, just like classical mechanics and euclidean geometry are seen as regionalized instances of quantum mechanics and Riemannian geometry respectively. We argue for this change using an axiomatic approach that shows that the Public Goods hypothesis is a better accommodation of the observed data than the Tree of Life hypothesis.
BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone.
Yang, Bite; Liu, Feng; Ren, Chao; Ouyang, Zhangyi; Xie, Ziwei; Bo, Xiaochen; Shu, Wenjie
2017-07-01
Enhancer elements are noncoding stretches of DNA that play key roles in controlling gene expression programmes. Despite major efforts to develop accurate enhancer prediction methods, identifying enhancer sequences continues to be a challenge in the annotation of mammalian genomes. One of the major issues is the lack of large, sufficiently comprehensive and experimentally validated enhancers for humans or other species. Thus, the development of computational methods based on limited experimentally validated enhancers and deciphering the transcriptional regulatory code encoded in the enhancer sequences is urgent. We present a deep-learning-based hybrid architecture, BiRen, which predicts enhancers using the DNA sequence alone. Our results demonstrate that BiRen can learn common enhancer patterns directly from the DNA sequence and exhibits superior accuracy, robustness and generalizability in enhancer prediction relative to other state-of-the-art enhancer predictors based on sequence characteristics. Our BiRen will enable researchers to acquire a deeper understanding of the regulatory code of enhancer sequences. Our BiRen method can be freely accessed at https://github.com/wenjiegroup/BiRen . shuwj@bmi.ac.cn or boxc@bmi.ac.cn. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
1-deoxy-d-xylulose-5-phosphate reductoisomerases and method of use
Croteau, Rodney B.; Lange, Bernd M.
2001-01-01
The present invention relates to isolated DNA sequences which code for the expression of plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein, such as the sequence presented in SEQ ID NO:1 which encodes a 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein from peppermint (Mentha x piperita). Additionally, the present invention relates to isolated plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein. In other aspects, the present invention is directed to replicable recombinant cloning vehicles comprising a nucleic acid sequence which codes for a plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase, to modified host cells transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence of the invention.
1-deoxy-D-xylulose-5-phosphate reductoisomerases, and methods of use
Croteau, Rodney B.; Lange, Bernd M.
2002-07-16
The present invention relates to isolated DNA sequences which code for the expression of plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein, such as the sequence presented in SEQ ID NO:1 which encodes a 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein from peppermint (Mentha x piperita). Additionally, the present invention relates to isolated plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein. In other aspects, the present invention is directed to replicable recombinant cloning vehicles comprising a nucleic acid sequence which codes for a plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase, to modified host cells transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence of the invention.
Numerical classification of coding sequences
NASA Technical Reports Server (NTRS)
Collins, D. W.; Liu, C. C.; Jukes, T. H.
1992-01-01
DNA sequences coding for protein may be represented by counts of nucleotides or codons. A complete reading frame may be abbreviated by its base count, e.g. A76C158G121T74, or with the corresponding codon table, e.g. (AAA)0(AAC)1(AAG)9 ... (TTT)0. We propose that these numerical designations be used to augment current methods of sequence annotation. Because base counts and codon tables do not require revision as knowledge of function evolves, they are well-suited to act as cross-references, for example to identify redundant GenBank entries. These descriptors may be compared, in place of DNA sequences, to extract homologous genes from large databases. This approach permits rapid searching with good selectivity.
Transformable Rhodobacter strains, method for producing transformable Rhodobacter strains
Laible, Philip D.; Hanson, Deborah K.
2018-05-08
The invention provides an organism for expressing foreign DNA, the organism engineered to accept standard DNA carriers. The genome of the organism codes for intracytoplasmic membranes and features an interruption in at least one of the genes coding for restriction enzymes. Further provided is a system for producing biological materials comprising: selecting a vehicle to carry DNA which codes for the biological materials; determining sites on the vehicle's DNA sequence susceptible to restriction enzyme cleavage; choosing an organism to accept the vehicle based on that organism not acting upon at least one of said vehicle's sites; engineering said vehicle to contain said DNA; thereby creating a synthetic vector; and causing the synthetic vector to enter the organism so as cause expression of said DNA.
[Hepatitis C virus: sequence homology of a European isolate and divergence from the prototype].
Seelig, R; Seelig, H P; Renz, M
1991-08-01
The polymerase chain reaction (PCR) detected specific hepatitis C viral (HCV) RNA sequences in liver biopsies from two patients with chronic hepatitis, in the tissue of a liver implantate, in plasma from four chronic non-A, non-B hepatitis (NANBH) patients and, for the first time, in an infectious anti-D-immunoglobulin preparation. A comparison of the viral sequences coding for a region for the nonstructural NS3 protein from the liver tissues revealed only a very small degree of sequence divergence on the cDNA as well as on the amino acid level (between 0 and 5%). The sequence similarities of the RNA isolated from plasma of the four chronic NANBH patients and the anti-D-immunoglobulin preparation were partly somewhat lower but altogether also high (between 90 and 100%). In contrast, all eight cDNA and amino acid sequences exhibited a significantly higher degree of divergence in comparison with the HCV prototype sequence (between 29 and 32%) than among themselves (between 0 and 10%). This unexpected high sequence similarity of the eight European isolates and their low homology to the Northamerican prototype sequence is indicative for the existence of different types of HCV. This will be important not only for epidemiological studies but also for the development of effective diagnostic procedures and vaccines. Concerning the pathogenesis of NANBH, a double infection or a helper mechanism has to be considered: in addition to the C virus, sequences of an other virus particle were found in the infectious IgG preparation as well as in the liver biopsies.
Cloning and sequence analysis of a cDNA clone coding for the mouse GM2 activator protein.
Bellachioma, G; Stirling, J L; Orlacchio, A; Beccari, T
1993-01-01
A cDNA (1.1 kb) containing the complete coding sequence for the mouse GM2 activator protein was isolated from a mouse macrophage library using a cDNA for the human protein as a probe. There was a single ATG located 12 bp from the 5' end of the cDNA clone followed by an open reading frame of 579 bp. Northern blot analysis of mouse macrophage RNA showed that there was a single band with a mobility corresponding to a size of 2.3 kb. We deduce from this that the mouse mRNA, in common with the mRNA for the human GM2 activator protein, has a long 3' untranslated sequence of approx. 1.7 kb. Alignment of the mouse and human deduced amino acid sequences showed 68% identity overall and 75% identity for the sequence on the C-terminal side of the first 31 residues, which in the human GM2 activator protein contains the signal peptide. Hydropathicity plots showed great similarity between the mouse and human sequences even in regions of low sequence similarity. There is a single N-glycosylation site in the mouse GM2 activator protein sequence (Asn151-Phe-Thr) which differs in its location from the single site reported in the human GM2 activator protein sequence (Asn63-Val-Thr). Images Figure 1 PMID:7689829
Ishikawa, Sohta A; Inagaki, Yuji; Hashimoto, Tetsuo
2012-01-01
In phylogenetic analyses of nucleotide sequences, 'homogeneous' substitution models, which assume the stationarity of base composition across a tree, are widely used, albeit individual sequences may bear distinctive base frequencies. In the worst-case scenario, a homogeneous model-based analysis can yield an artifactual union of two distantly related sequences that achieved similar base frequencies in parallel. Such potential difficulty can be countered by two approaches, 'RY-coding' and 'non-homogeneous' models. The former approach converts four bases into purine and pyrimidine to normalize base frequencies across a tree, while the heterogeneity in base frequency is explicitly incorporated in the latter approach. The two approaches have been applied to real-world sequence data; however, their basic properties have not been fully examined by pioneering simulation studies. Here, we assessed the performances of the maximum-likelihood analyses incorporating RY-coding and a non-homogeneous model (RY-coding and non-homogeneous analyses) on simulated data with parallel convergence to similar base composition. Both RY-coding and non-homogeneous analyses showed superior performances compared with homogeneous model-based analyses. Curiously, the performance of RY-coding analysis appeared to be significantly affected by a setting of the substitution process for sequence simulation relative to that of non-homogeneous analysis. The performance of a non-homogeneous analysis was also validated by analyzing a real-world sequence data set with significant base heterogeneity.
Feldhoff, A; Wetzel, T; Peters, D; Kellner, R; Krczal, G
1998-01-01
With the introduction of cutting-grown Petunia x hybrida plants on the European market, a new potyvirus which showed no serological reaction with antisera against any other potyviruses infecting petunias was discovered. Infected leaves contained flexuous rod-shaped virus particles of 750-800 nm in length and inclusion bodies (pinwheel structures) typical for potyviruses in ultrathin leaf sections. The purified coat protein with a Mr of approximately 36 kDa could be detected in Western immunoblots with a specific antibody to the coat protein of the petunia-infecting virus. The 3' end of the viral genome encompassing the 3' non-coding region, the coat protein gene, and part of the NIb gene was amplified from infected leaf material by IC/PCR using degenerate and specific primers. Sequences of PCR-generated cDNA clones were compared to other known sequences of potyviruses. Maximum homology of 56% was found in the 3' non-coding region between the petunia isolate and other potyviruses. A maximum homology of 69% was found between the amino acid sequence of the coat protein of the petunia isolate and corresponding sequences of other potyviruses. These data indicate that the petunia-infecting virus is a previously undescribed potyvirus and the name petunia flower mottle virus (PetFMV) is suggested.
BeerDeCoded: the open beer metagenome project.
Sobel, Jonathan; Henry, Luc; Rotman, Nicolas; Rando, Gianpaolo
2017-01-01
Next generation sequencing has radically changed research in the life sciences, in both academic and corporate laboratories. The potential impact is tremendous, yet a majority of citizens have little or no understanding of the technological and ethical aspects of this widespread adoption. We designed BeerDeCoded as a pretext to discuss the societal issues related to genomic and metagenomic data with fellow citizens, while advancing scientific knowledge of the most popular beverage of all. In the spirit of citizen science, sample collection and DNA extraction were carried out with the participation of non-scientists in the community laboratory of Hackuarium, a not-for-profit organisation that supports unconventional research and promotes the public understanding of science. The dataset presented herein contains the targeted metagenomic profile of 39 bottled beers from 5 countries, based on internal transcribed spacer (ITS) sequencing of fungal species. A preliminary analysis reveals the presence of a large diversity of wild yeast species in commercial brews. With this project, we demonstrate that coupling simple laboratory procedures that can be carried out in a non-professional environment with state-of-the-art sequencing technologies and targeted metagenomic analyses, can lead to the detection and identification of the microbial content in bottled beer.
BeerDeCoded: the open beer metagenome project
Sobel, Jonathan; Henry, Luc; Rotman, Nicolas; Rando, Gianpaolo
2017-01-01
Next generation sequencing has radically changed research in the life sciences, in both academic and corporate laboratories. The potential impact is tremendous, yet a majority of citizens have little or no understanding of the technological and ethical aspects of this widespread adoption. We designed BeerDeCoded as a pretext to discuss the societal issues related to genomic and metagenomic data with fellow citizens, while advancing scientific knowledge of the most popular beverage of all. In the spirit of citizen science, sample collection and DNA extraction were carried out with the participation of non-scientists in the community laboratory of Hackuarium, a not-for-profit organisation that supports unconventional research and promotes the public understanding of science. The dataset presented herein contains the targeted metagenomic profile of 39 bottled beers from 5 countries, based on internal transcribed spacer (ITS) sequencing of fungal species. A preliminary analysis reveals the presence of a large diversity of wild yeast species in commercial brews. With this project, we demonstrate that coupling simple laboratory procedures that can be carried out in a non-professional environment with state-of-the-art sequencing technologies and targeted metagenomic analyses, can lead to the detection and identification of the microbial content in bottled beer. PMID:29123645
Enzyme-free detection and quantification of double-stranded nucleic acids.
Feuillie, Cécile; Merheb, Maxime Mohamad; Gillet, Benjamin; Montagnac, Gilles; Hänni, Catherine; Daniel, Isabelle
2012-08-01
We have developed a fully enzyme-free SERRS hybridization assay for specific detection of double-stranded DNA sequences. Although all DNA detection methods ranging from PCR to high-throughput sequencing rely on enzymes, this method is unique for being totally non-enzymatic. The efficiency of enzymatic processes is affected by alterations, modifications, and/or quality of DNA. For instance, a limitation of most DNA polymerases is their inability to process DNA damaged by blocking lesions. As a result, enzymatic amplification and sequencing of degraded DNA often fail. In this study we succeeded in detecting and quantifying, within a mixture, relative amounts of closely related double-stranded DNA sequences from Rupicapra rupicapra (chamois) and Capra hircus (goat). The non-enzymatic SERRS assay presented here is the corner stone of a promising approach to overcome the failure of DNA polymerase when DNA is too degraded or when the concentration of polymerase inhibitors is too high. It is the first time double-stranded DNA has been detected with a truly non-enzymatic SERRS-based method. This non-enzymatic, inexpensive, rapid assay is therefore a breakthrough in nucleic acid detection.
Chang, Vivian Y.; Federman, Noah; Martinez-Agosto, Julian; Tatishchev, Sergei F.; Nelson, Stanley F.
2014-01-01
Background Gastric adenocarcinoma is a rare diagnosis in childhood. A 14-year old male patient presented with metastatic gastric adenocarcinoma, and a strong family history of colon cancer. Clinical sequencing of CDH1 and APC were negative. Whole exome sequencing was therefore applied to capture the majority of protein-coding regions for the identification of single-nucleotide variants, small insertion/deletions, and copy number abnormalities in the patient’s germline as well as primary tumor. Materials and Methods DNA was extracted from the patient’s blood, primary tumor, and the unaffected mother’s blood. DNA libraries were constructed and sequenced on Illumina HiSeq2000. Data were post-processed using Picard and Samtools, then analyzed with the Genome Analysis Toolkit. Variants were annotated using an in-house Ensembl-based program. Copy number was assessed using ExomeCNV. Results Each sample was sequenced to a mean depth of coverage of greater than 120×. A rare non-synonymous coding SNV in TP53 was identified in the germline. There were 10 somatic cancer protein-damaging variants that were not observed in the unaffected mother genome. ExomeCNV comparing tumor to the patient’s germline, identified abnormal copy number, spanning 6,946 genes. Conclusion We present an unusual case of Li-Fraumeni detected by whole exome sequencing. There were also likely driver somatic mutations in the gastric adenocarcinoma. These results highlight the need for more thorough and broad scale germline and cancer analyses to accurately inform patients of inherited risk to cancer and to identify somatic mutations. PMID:23015295
Recombinant pinoresinol/lariciresinol reductase, recombinant dirigent protein, and methods of use
Lewis, Norman G.; Davin, Laurence B.; Dinkova-Kostova, Albena T.; Fujita, Masayuki; Gang, David R.; Sarkanen, Simo; Ford, Joshua D.
2001-04-03
Dirigent proteins and pinoresinol/lariciresinol reductases have been isolated, together with cDNAs encoding dirigent proteins and pinoresinol/lariciresinol reductases. Accordingly, isolated DNA sequences are provided which code for the expression of dirigent proteins and pinoresinol/lariciresinol reductases. In other aspects, replicable recombinant cloning vehicles are provided which code for dirigent proteins or pinoresinol/lariciresinol reductases or for a base sequence sufficiently complementary to at least a portion of dirigent protein or pinoresinol/lariciresinol reductase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding dirigent protein or pinoresinol/lariciresinol reductase. Thus, systems and methods are provided for the recombinant expression of dirigent proteins and/or pinoresinol/lariciresinol reductases.
Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens.
Staats, Martijn; Erkens, Roy H J; van de Vossenberg, Bart; Wieringa, Jan J; Kraaijeveld, Ken; Stielow, Benjamin; Geml, József; Richardson, James E; Bakker, Freek T
2013-01-01
Unlocking the vast genomic diversity stored in natural history collections would create unprecedented opportunities for genome-scale evolutionary, phylogenetic, domestication and population genomic studies. Many researchers have been discouraged from using historical specimens in molecular studies because of both generally limited success of DNA extraction and the challenges associated with PCR-amplifying highly degraded DNA. In today's next-generation sequencing (NGS) world, opportunities and prospects for historical DNA have changed dramatically, as most NGS methods are actually designed for taking short fragmented DNA molecules as templates. Here we show that using a standard multiplex and paired-end Illumina sequencing approach, genome-scale sequence data can be generated reliably from dry-preserved plant, fungal and insect specimens collected up to 115 years ago, and with minimal destructive sampling. Using a reference-based assembly approach, we were able to produce the entire nuclear genome of a 43-year-old Arabidopsis thaliana (Brassicaceae) herbarium specimen with high and uniform sequence coverage. Nuclear genome sequences of three fungal specimens of 22-82 years of age (Agaricus bisporus, Laccaria bicolor, Pleurotus ostreatus) were generated with 81.4-97.9% exome coverage. Complete organellar genome sequences were assembled for all specimens. Using de novo assembly we retrieved between 16.2-71.0% of coding sequence regions, and hence remain somewhat cautious about prospects for de novo genome assembly from historical specimens. Non-target sequence contaminations were observed in 2 of our insect museum specimens. We anticipate that future museum genomics projects will perhaps not generate entire genome sequences in all cases (our specimens contained relatively small and low-complexity genomes), but at least generating vital comparative genomic data for testing (phylo)genetic, demographic and genetic hypotheses, that become increasingly more horizontal. Furthermore, NGS of historical DNA enables recovering crucial genetic information from old type specimens that to date have remained mostly unutilized and, thus, opens up a new frontier for taxonomic research as well.
Design Pattern Mining Using Distributed Learning Automata and DNA Sequence Alignment
Esmaeilpour, Mansour; Naderifar, Vahideh; Shukur, Zarina
2014-01-01
Context Over the last decade, design patterns have been used extensively to generate reusable solutions to frequently encountered problems in software engineering and object oriented programming. A design pattern is a repeatable software design solution that provides a template for solving various instances of a general problem. Objective This paper describes a new method for pattern mining, isolating design patterns and relationship between them; and a related tool, DLA-DNA for all implemented pattern and all projects used for evaluation. DLA-DNA achieves acceptable precision and recall instead of other evaluated tools based on distributed learning automata (DLA) and deoxyribonucleic acid (DNA) sequences alignment. Method The proposed method mines structural design patterns in the object oriented source code and extracts the strong and weak relationships between them, enabling analyzers and programmers to determine the dependency rate of each object, component, and other section of the code for parameter passing and modular programming. The proposed model can detect design patterns better that available other tools those are Pinot, PTIDEJ and DPJF; and the strengths of their relationships. Results The result demonstrate that whenever the source code is build standard and non-standard, based on the design patterns, then the result of the proposed method is near to DPJF and better that Pinot and PTIDEJ. The proposed model is tested on the several source codes and is compared with other related models and available tools those the results show the precision and recall of the proposed method, averagely 20% and 9.6% are more than Pinot, 27% and 31% are more than PTIDEJ and 3.3% and 2% are more than DPJF respectively. Conclusion The primary idea of the proposed method is organized in two following steps: the first step, elemental design patterns are identified, while at the second step, is composed to recognize actual design patterns. PMID:25243670
Rhipicephalus microplus strain Deutsch, 10 BAC clone sequences
USDA-ARS?s Scientific Manuscript database
The cattle tick, Rhipicephalus (Boophilus) microplus, has a genome over 2.4 times the size of the human genome, and with over 70% of repetitive DNA, this genome would prove very costly to sequence at today's prices and difficult to assemble and analyze. We used labeled DNA probes from the coding reg...
Franc, M A; Cohen, N; Warner, A W; Shaw, P M; Groenen, P; Snapir, A
2011-04-01
DNA samples collected in clinical trials and stored for future research are valuable to pharmaceutical drug development. Given the perceived higher risk associated with genetic research, industry has implemented complex coding methods for DNA. Following years of experience with these methods and with addressing questions from institutional review boards (IRBs), ethics committees (ECs) and health authorities, the industry has started reexamining the extent of the added value offered by these methods. With the goal of harmonization, the Industry Pharmacogenomics Working Group (I-PWG) conducted a survey to gain an understanding of company practices for DNA coding and to solicit opinions on their effectiveness at protecting privacy. The results of the survey and the limitations of the coding methods are described. The I-PWG recommends dialogue with key stakeholders regarding coding practices such that equal standards are applied to DNA and non-DNA samples. The I-PWG believes that industry standards for privacy protection should provide adequate safeguards for DNA and non-DNA samples/data and suggests a need for more universal standards for samples stored for future research.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kimelman, Aya; Levy, Asaf; Sberro, Hila
In the process of clone-based genome sequencing, initial assemblies frequently contain cloning gaps that can be resolved using cloning-independent methods, but the reason for their occurrence is largely unknown. By analyzing 9,328,693 sequencing clones from 393 microbial genomes we systematically mapped more than 15,000 genes residing in cloning gaps and experimentally showed that their expression products are toxic to the Escherichia coli host. A subset of these toxic sequences was further evaluated through a series of functional assays exploring the mechanisms of their toxicity. Among these genes our assays revealed novel toxins and restriction enzymes, and new classes of smallmore » non-coding toxic RNAs that reproducibly inhibit E. coli growth. Further analyses also revealed abundant, short toxic DNA fragments that were predicted to suppress E. coli growth by interacting with the replication initiator dnaA. Our results show that cloning gaps, once considered the result of technical problems, actually serve as a rich source for the discovery of biotechnologically valuable functions, and suggest new modes of antimicrobial interventions.« less
Blochlinger, K; Diggelmann, H
1984-12-01
The DNA coding sequence for the hygromycin B phosphotransferase gene was placed under the control of the regulatory sequences of a cloned long terminal repeat of Moloney sarcoma virus. This construction allowed direct selection for hygromycin B resistance after transfection of eucaryotic cell lines not naturally resistant to this antibiotic, thus providing another dominant marker for DNA transfer in eucaryotic cells.
Blochlinger, K; Diggelmann, H
1984-01-01
The DNA coding sequence for the hygromycin B phosphotransferase gene was placed under the control of the regulatory sequences of a cloned long terminal repeat of Moloney sarcoma virus. This construction allowed direct selection for hygromycin B resistance after transfection of eucaryotic cell lines not naturally resistant to this antibiotic, thus providing another dominant marker for DNA transfer in eucaryotic cells. Images PMID:6098829
Non-B-DNA structures on the interferon-beta promoter?
Robbe, K; Bonnefoy, E
1998-01-01
The high mobility group (HMG) I protein intervenes as an essential factor during the virus induced expression of the interferon-beta (IFN-beta) gene. It is a non-histone chromatine associated protein that has the dual capacity of binding to a non-B-DNA structure such as cruciform-DNA as well as to AT rich B-DNA sequences. In this work we compare the binding affinity of HMGI for a synthetic cruciform-DNA to its binding affinity for the HMGI-binding-site present in the positive regulatory domain II (PRDII) of the IFN-beta promoter. Using gel retardation experiments, we show that HMGI protein binds with at least ten times more affinity to the synthetic cruciform-DNA structure than to the PRDII B-DNA sequence. DNA hairpin sequences are present in both the human and the murine PRDII-DNAs. We discuss in this work the presence of, yet putative, non-B-DNA structures in the IFN-beta promoter.
Wright, Imogen A; Travers, Simon A
2014-07-01
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Palindromic repetitive DNA elements with coding potential in Methanocaldococcus jannaschii.
Suyama, Mikita; Lathe, Warren C; Bork, Peer
2005-10-10
We have identified 141 novel palindromic repetitive elements in the genome of euryarchaeon Methanocaldococcus jannaschii. The total length of these elements is 14.3kb, which corresponds to 0.9% of the total genomic sequence and 6.3% of all extragenic regions. The elements can be divided into three groups (MJRE1-3) based on the sequence similarity. The low sequence identity within each of the groups suggests rather old origin of these elements in M. jannaschii. Three MJRE2 elements were located within the protein coding regions without disrupting the coding potential of the host genes, indicating that insertion of repeats might be a widespread mechanism to enhance sequence diversity in coding regions.
Vongvanrungruang, A; Mongkolsiriwatana, C; Boonkaew, T; Sawatdichaikul, O; Srikulnath, K; Peyachoknagul, S
2016-09-19
The fragrance gene, betaine aldehyde dehydrogenase 2 (Badh2), has been well studied in many plant species. The objectives of this study were to clone Badh2 and compare the sequences between aromatic and non-aromatic coconuts. The complete coding region was cloned from cDNA of both aromatic and non-aromatic coconuts. The nucleotide sequences were highly homologous to Badh2 genes of other plants. Badh2 consisted of a 1512-bp open reading frame encoding 503 amino acids. A single nucleotide difference between aromatic and non-aromatic coconuts resulted in the conversion of alanine (non-aromatic) to proline (aromatic) at position 442, which was the substrate binding site of BADH2. The ring side chain of proline could destabilize the structure leading to a non-functional enzyme. Badh2 genomic DNA was cloned from exon 1 to 4, and from exon 5 to 15 from the two coconut types, except for intron 4 that was very long. The intron sequences of the two coconut groups were highly homologous. No differences in Badh2 expression were found among the tissues of aromatic coconut or between aromatic and non-aromatic coconuts. The amino acid sequences of BADH2 from coconut and other plants were compared and the genetic relationship was analyzed using MEGA 7.0. The phylogenetic tree reconstructed by the Bayesian information criterion consisted of two distinct groups of monocots and dicots. Among the monocots, coconut (Cocos nucifera) and oil palm (Elaeis guineensis) were the most closely related species. A marker for coconut differentiation was developed from one-base substitution site and could be successfully used.
Transcriptional mapping of the ribosomal RNA region of mouse L-cell mitochondrial DNA.
Nagley, P; Clayton, D A
1980-01-01
The map positions in mouse mitochondrial DNA of the two ribosomal RNA genes and adjacent genes coding several small transcripts have been determined precisely by application of a procedure in which DNA-RNA hybrids have been subjected to digestion by S1 nuclease under conditions of varying severity. Digestion of the DNA-RNA hybrids with S1 nuclease yielded a series of species which were shown to contain ribosomal RNA molecules together with adjacent transcripts hybridized conjointly to a continuous segment of mitochondrial DNA. There is one small transcript about 60 bases long whose gene adjoins the sequences coding the 5'-end of the small ribosomal RNA (950 bases) and which lies approximately 200 nucleotides from the D-loop origin of heavy strand mitochondrial DNA synthesis. An 80-base transcript lies between the small and large ribosomal RNA genes, and genes for two further short transcript (each about 80 bases in length) abut the sequences coding the 3'-end of the large ribosomal RNA (approximately 1500 bases). The ability to isolate a discrete DNA-RNA hybrid species approximately 2700 base pairs in length containing all these transcripts suggests that there can be few nucleotides in this region of mouse mitochondrial DNA which are not represented as stable RNA species. Images PMID:6253898
Reamon-Buettner, Stella Marie; Borlak, Jürgen
2007-07-01
'Epigenetics' is a heritable phenomenon without change in primary DNA sequence. In recent years, this field has attracted much attention as more epigenetic controls of gene activities are being discovered. Such epigenetic controls ensue from an interplay of DNA methylation, histone modifications, and RNA-mediated pathways from non-coding RNAs, notably silencing RNA (siRNA) and microRNA (miRNA). Although epigenetic regulation is inherent to normal development and differentiation, this can be misdirected leading to a number of diseases including cancer. All the same, many of the processes can be reversed offering a hope for epigenetic therapies such as inhibitors of enzymes controlling epigenetic modifications, specifically DNA methyltransferases, histone deacetylases, and RNAi therapeutics. 'In utero' or early life exposures to dietary and environmental exposures can have a profound effect on our epigenetic code, the so-called 'epigenome', resulting in birth defects and diseases developed later in life. Indeed, examples are accumulating in which environmental exposures can be attributed to epigenetic causes, an encouraging edge towards greater understanding of the contribution of epigenetic influences of environmental exposures. Routine analysis of epigenetic modifications as part of the mechanisms of action of environmental contaminants is in order. There is, however, an explosion of research in the field of epigenetics and to keep abreast of these developments could be a challenge. In this paper, we provide an overview of epigenetic mechanisms focusing on recent reviews and studies to serve as an entry point into the realm of 'environmental epigenetics'.
T cells are influenced by a long non-coding RNA in the autoimmune associated PTPN2 locus.
Houtman, Miranda; Shchetynsky, Klementy; Chemin, Karine; Hensvold, Aase Haj; Ramsköld, Daniel; Tandre, Karolina; Eloranta, Maija-Leena; Rönnblom, Lars; Uebe, Steffen; Catrina, Anca Irinel; Malmström, Vivianne; Padyukov, Leonid
2018-06-01
Non-coding SNPs in the protein tyrosine phosphatase non-receptor type 2 (PTPN2) locus have been linked with several autoimmune diseases, including rheumatoid arthritis, type I diabetes, and inflammatory bowel disease. However, the functional consequences of these SNPs are poorly characterized. Herein, we show in blood cells that SNPs in the PTPN2 locus are highly correlated with DNA methylation levels at four CpG sites downstream of PTPN2 and expression levels of the long non-coding RNA (lncRNA) LINC01882 downstream of these CpG sites. We observed that LINC01882 is mainly expressed in T cells and that anti-CD3/CD28 activated naïve CD4 + T cells downregulate the expression of LINC01882. RNA sequencing analysis of LINC01882 knockdown in Jurkat T cells, using a combination of antisense oligonucleotides and RNA interference, revealed the upregulation of the transcription factor ZEB1 and kinase MAP2K4, both involved in IL-2 regulation. Overall, our data suggests the involvement of LINC01882 in T cell activation and hints towards an auxiliary role of these non-coding SNPs in autoimmunity associated with the PTPN2 locus. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
Nguyen, Thong T; Suryamohan, Kushal; Kuriakose, Boney; Janakiraman, Vasantharajan; Reichelt, Mike; Chaudhuri, Subhra; Guillory, Joseph; Divakaran, Neethu; Rabins, P E; Goel, Ridhi; Deka, Bhabesh; Sarkar, Suman; Ekka, Preety; Tsai, Yu-Chih; Vargas, Derek; Santhosh, Sam; Mohan, Sangeetha; Chin, Chen-Shan; Korlach, Jonas; Thomas, George; Babu, Azariah; Seshagiri, Somasekar
2018-06-12
We sequenced the Hyposidra talaca NPV (HytaNPV) double stranded circular DNA genome using PacBio single molecule sequencing technology. We found that the HytaNPV genome is 139,089 bp long with a GC content of 39.6%. It encodes 141 open reading frames (ORFs) including the 37 baculovirus core genes, 25 genes conserved among lepidopteran baculoviruses, 72 genes known in baculovirus, and 7 genes unique to the HytaNPV genome. It is a group II alphabaculovirus that codes for the F protein and lacks the gp64 gene found in group I alphabaculovirus viruses. Using RNA-seq, we confirmed the expression of the ORFs identified in the HytaNPV genome. Phylogenetic analysis showed HytaNPV to be closest to BusuNPV, SujuNPV and EcobNPV that infect other tea pests, Buzura suppressaria, Sucra jujuba, and Ectropis oblique, respectively. We identified repeat elements and a conserved non-coding baculovirus element in the genome. Analysis of the putative promoter sequences identified motif consistent with the temporal expression of the genes observed in the RNA-seq data.
DNA copy number changes define spatial patterns of heterogeneity in colorectal cancer
Mamlouk, Soulafa; Childs, Liam Harold; Aust, Daniela; Heim, Daniel; Melching, Friederike; Oliveira, Cristiano; Wolf, Thomas; Durek, Pawel; Schumacher, Dirk; Bläker, Hendrik; von Winterfeld, Moritz; Gastl, Bastian; Möhr, Kerstin; Menne, Andrea; Zeugner, Silke; Redmer, Torben; Lenze, Dido; Tierling, Sascha; Möbs, Markus; Weichert, Wilko; Folprecht, Gunnar; Blanc, Eric; Beule, Dieter; Schäfer, Reinhold; Morkel, Markus; Klauschen, Frederick; Leser, Ulf; Sers, Christine
2017-01-01
Genetic heterogeneity between and within tumours is a major factor determining cancer progression and therapy response. Here we examined DNA sequence and DNA copy-number heterogeneity in colorectal cancer (CRC) by targeted high-depth sequencing of 100 most frequently altered genes. In 97 samples, with primary tumours and matched metastases from 27 patients, we observe inter-tumour concordance for coding mutations; in contrast, gene copy numbers are highly discordant between primary tumours and metastases as validated by fluorescent in situ hybridization. To further investigate intra-tumour heterogeneity, we dissected a single tumour into 68 spatially defined samples and sequenced them separately. We identify evenly distributed coding mutations in APC and TP53 in all tumour areas, yet highly variable gene copy numbers in numerous genes. 3D morpho-molecular reconstruction reveals two clusters with divergent copy number aberrations along the proximal–distal axis indicating that DNA copy number variations are a major source of tumour heterogeneity in CRC. PMID:28120820
Saavedra-Lira, E; Pérez-Montfort, R
1994-05-16
We isolated three overlapping clones from a DNA genomic library of Entamoeba histolytica strain HM1:IMSS, whose translated nucleotide (nt) sequence shows similarities of 51, 48 and 47% with the amino acid (aa) sequences reported for the pyruvate phosphate dikinases from Bacteroides symbiosus, maize and Flaveria trinervia, respectively. The reading frame determined codes for a protein of 886 aa.
Phylogeographic Differentiation of Mitochondrial DNA in Han Chinese
Yao, Yong-Gang; Kong, Qing-Peng; Bandelt, Hans-Jürgen; Kivisild, Toomas; Zhang, Ya-Ping
2002-01-01
To characterize the mitochondrial DNA (mtDNA) variation in Han Chinese from several provinces of China, we have sequenced the two hypervariable segments of the control region and the segment spanning nucleotide positions 10171–10659 of the coding region, and we have identified a number of specific coding-region mutations by direct sequencing or restriction-fragment–length–polymorphism tests. This allows us to define new haplogroups (clades of the mtDNA phylogeny) and to dissect the Han mtDNA pool on a phylogenetic basis, which is a prerequisite for any fine-grained phylogeographic analysis, the interpretation of ancient mtDNA, or future complete mtDNA sequencing efforts. Some of the haplogroups under study differ considerably in frequencies across different provinces. The southernmost provinces show more pronounced contrasts in their regional Han mtDNA pools than the central and northern provinces. These and other features of the geographical distribution of the mtDNA haplogroups observed in the Han Chinese make an initial Paleolithic colonization from south to north plausible but would suggest subsequent migration events in China that mainly proceeded from north to south and east to west. Lumping together all regional Han mtDNA pools into one fictive general mtDNA pool or choosing one or two regional Han populations to represent all Han Chinese is inappropriate for prehistoric considerations as well as for forensic purposes or medical disease studies. PMID:11836649
Cocho, Germinal; Miramontes, Pedro; Mansilla, Ricardo; Li, Wentian
2014-12-01
We examine the relationship between exponential correlation functions and Markov models in a bacterial genome in detail. Despite the well known fact that Markov models generate sequences with correlation function that decays exponentially, simply constructed Markov models based on nearest-neighbor dimer (first-order), trimer (second-order), up to hexamer (fifth-order), and treating the DNA sequence as being homogeneous all fail to predict the value of exponential decay rate. Even reading-frame-specific Markov models (both first- and fifth-order) could not explain the fact that the exponential decay is very slow. Starting with the in-phase coding-DNA-sequence (CDS), we investigated correlation within a fixed-codon-position subsequence, and in artificially constructed sequences by packing CDSs with out-of-phase spacers, as well as altering CDS length distribution by imposing an upper limit. From these targeted analyses, we conclude that the correlation in the bacterial genomic sequence is mainly due to a mixing of heterogeneous statistics at different codon positions, and the decay of correlation is due to the possible out-of-phase between neighboring CDSs. There are also small contributions to the correlation from bases at the same codon position, as well as by non-coding sequences. These show that the seemingly simple exponential correlation functions in bacterial genome hide a complexity in correlation structure which is not suitable for a modeling by Markov chain in a homogeneous sequence. Other results include: use of the (absolute value) second largest eigenvalue to represent the 16 correlation functions and the prediction of a 10-11 base periodicity from the hexamer frequencies. Copyright © 2014 Elsevier Ltd. All rights reserved.
Evolutional dynamics of 45S and 5S ribosomal DNA in ancient allohexaploid Atropa belladonna.
Volkov, Roman A; Panchuk, Irina I; Borisjuk, Nikolai V; Hosiawa-Baranska, Marta; Maluszynska, Jolanta; Hemleben, Vera
2017-01-23
Polyploid hybrids represent a rich natural resource to study molecular evolution of plant genes and genomes. Here, we applied a combination of karyological and molecular methods to investigate chromosomal structure, molecular organization and evolution of ribosomal DNA (rDNA) in nightshade, Atropa belladonna (fam. Solanaceae), one of the oldest known allohexaploids among flowering plants. Because of their abundance and specific molecular organization (evolutionarily conserved coding regions linked to variable intergenic spacers, IGS), 45S and 5S rDNA are widely used in plant taxonomic and evolutionary studies. Molecular cloning and nucleotide sequencing of A. belladonna 45S rDNA repeats revealed a general structure characteristic of other Solanaceae species, and a very high sequence similarity of two length variants, with the only difference in number of short IGS subrepeats. These results combined with the detection of three pairs of 45S rDNA loci on separate chromosomes, presumably inherited from both tetraploid and diploid ancestor species, example intensive sequence homogenization that led to substitution/elimination of rDNA repeats of one parent. Chromosome silver-staining revealed that only four out of six 45S rDNA sites are frequently transcriptionally active, demonstrating nucleolar dominance. For 5S rDNA, three size variants of repeats were detected, with the major class represented by repeats containing all functional IGS elements required for transcription, the intermediate size repeats containing partially deleted IGS sequences, and the short 5S repeats containing severe defects both in the IGS and coding sequences. While shorter variants demonstrate increased rate of based substitution, probably in their transition into pseudogenes, the functional 5S rDNA variants are nearly identical at the sequence level, pointing to their origin from a single parental species. Localization of the 5S rDNA genes on two chromosome pairs further supports uniparental inheritance from the tetraploid progenitor. The obtained molecular, cytogenetic and phylogenetic data demonstrate complex evolutionary dynamics of rDNA loci in allohexaploid species of Atropa belladonna. The high level of sequence unification revealed in 45S and 5S rDNA loci of this ancient hybrid species have been seemingly achieved by different molecular mechanisms.
Sequence of a cDNA encoding pancreatic preprosomatostatin-22.
Magazin, M; Minth, C D; Funckes, C L; Deschenes, R; Tavianini, M A; Dixon, J E
1982-01-01
We report the nucleotide sequence of a precursor to somatostatin that upon proteolytic processing may give rise to a hormone of 22 amino acids. The nucleotide sequence of a cDNA from the channel catfish (Ictalurus punctatus) encodes a precursor to somatostatin that is 105 amino acids (Mr, 11,500). The cDNA coding for somatostatin-22 consists of 36 nucleotides in the 5' untranslated region, 315 nucleotides that code for the precursor to somatostatin-22, 269 nucleotides at the 3' untranslated region, and a variable length of poly(A). The putative preprohormone contains a sequence of hydrophobic amino acids at the amino terminus that has the properties of a "signal" peptide. A connecting sequence of approximately 57 amino acids is followed by a single Arg-Arg sequence, which immediately precedes the hormone. Somatostatin-22 is homologous to somatostatin-14 in 7 of the 14 amino acids, including the Phe-Trp-Lys sequence. Hybridization selection of mRNA, followed by its translation in a wheat germ cell-free system, resulted in the synthesis of a single polypeptide having a molecular weight of approximately 10,000 as estimated on Na-DodSO4/polyacrylamide gels. Images PMID:6127673
Bäumlein, H; Wobus, U; Pustell, J; Kafatos, F C
1986-01-01
The field bean, Vicia faba L. var. minor, possesses two sub-families of 11 S legumin genes named A and B. We isolated from a genomic library a B-type gene (LeB4) and determined its primary DNA sequence. Gene LeB4 codes for a 484 amino acid residue prepropolypeptide, encompassing a signal peptide of 22 amino acid residues, an acidic, very hydrophilic alpha-chain of 281 residues and a basic, somewhat hydrophobic beta-chain of 181 residues. The latter two coding regions are immediately contiguous, but each is interrupted by a short intron. Type A legumin genes from soybean and pea are known to have introns in the same two positions, in addition to an extra intron (within the alpha-coding sequence). Sequence comparisons of legumin genes from these three plants revealed a highly conserved sequence element of at least 28 bp, centered at approximately 100 bp upstream of each cap site. The element is absent from the equivalent position of all non-legumin and other plant and fungal genes examined. We tentatively name this element "legumin box" and suggest that it may have a function in the regulation of legumin gene expression. PMID:3960730
Holland, M J; Holland, J P; Thill, G P; Jackson, K A
1981-02-10
Segments of yeast genomic DNA containing two enolase structural genes have been isolated by subculture cloning procedures using a cDNA hybridization probe synthesized from purified yeast enolase mRNA. Based on restriction endonuclease and transcriptional maps of these two segments of yeast DNA, each hybrid plasmid contains a region of extensive nucleotide sequence homology which forms hybrids with the cDNA probe. The DNA sequences which flank this homologous region in the two hybrid plasmids are nonhomologous indicating that these sequences are nontandemly repeated in the yeast genome. The complete nucleotide sequence of the coding as well as the flanking noncoding regions of these genes has been determined. The amino acid sequence predicted from one reading frame of both structural genes is extremely similar to that determined for yeast enolase (Chin, C. C. Q., Brewer, J. M., Eckard, E., and Wold, F. (1981) J. Biol. Chem. 256, 1370-1376), confirming that these isolated structural genes encode yeast enolase. The nucleotide sequences of the coding regions of the genes are approximately 95% homologous, and neither gene contains an intervening sequence. Codon utilization in the enolase genes follows the same biased pattern previously described for two yeast glyceraldehyde-3-phosphate dehydrogenase structural genes (Holland, J. P., and Holland, M. J. (1980) J. Biol. Chem. 255, 2596-2605). DNA blotting analysis confirmed that the isolated segments of yeast DNA are colinear with yeast genomic DNA and that there are two nontandemly repeated enolase genes per haploid yeast genome. The noncoding portions of the two enolase genes adjacent to the initiation and termination codons are approximately 70% homologous and contain sequences thought to be involved in the synthesis and processing messenger RNA. Finally there are regions of extensive homology between the two enolase structural genes and two yeast glyceraldehyde-3-phosphate dehydrogenase structural genes within the 5- noncoding portions of these glycolytic genes.
Kurtz, David T.; Feigelson, Philip
1977-01-01
A procedure is presented for the preparation of a 3H-labeled complementary DNA (cDNA) specific for the mRNA coding for α2u-globulin, a male rat liver protein under multihormonal control that represents approximately 1% of hepatic protein synthesis. Rat liver polysomes are incubated with monospecific rabbit antiserum to α2u-globulin, which binds to the nascent α2u-globulin chains on the polysomes. These antibody-polysome complexes are then adsorbed to goat antiserum to rabbit IgG that is covalently linked to p-aminobenzylcellulose. mRNA preparations are thus obtained that contain 30-40% α2u-globulin mRNA. A labeled cDNA is made to this α2u-globulin-enriched mRNA preparation by using RNA-dependent DNA polymerase (reverse transcriptase). To remove the non-α2u-globulin sequences, this cDNA preparation is hybridized to an RNA concentration × incubation time (R0t) of 1000 mol of ribonucleotide per liter × sec with female rat liver mRNA, which, though it shares the vast majority of mRNA sequences with male liver, contains no α2u-globulin mRNA sequences. The cDNA remaining single-stranded is isolated by hydroxylapatite chromatography and is shown to be specific for α2u-globulin mRNA by several criteria. Good correlation was found in all endocrine states studied between the hepatic level of α2u-globulin, the level of functional α2u-globulin mRNA as assayed in a wheat germ cell-free translational system, and the level of α2u-globulin mRNA sequences as measured by hybridization to the α2u-globulin cDNA. Thus, the hormonal control of hepatic α2u-globulin synthesis by sex steroids and thyroid hormone occurs through modulation of the cellular level of α2u-globulin mRNA sequences, presumably by hormonal control of transcriptive synthesis. PMID:73184
Recominant Pinoresino-Lariciresinol Reductase, Recombinant Dirigent Protein And Methods Of Use
Lewis, Norman G.; Davin, Laurence B.; Dinkova-Kostova, Albena T.; Fujita, Masayuki , Gang; David R. , Sarkanen; Simo , Ford; Joshua D.
2003-10-21
Dirigent proteins and pinoresinol/lariciresinol reductases have been isolated, together with cDNAs encoding dirigent proteins and pinoresinol/lariciresinol reductases. Accordingly, isolated DNA sequences are provided from source species Forsythia intermedia, Thuja plicata, Tsuga heterophylla, Eucommia ulmoides, Linum usitatissimum, and Schisandra chinensis, which code for the expression of dirigent proteins and pinoresinol/lariciresinol reductases. In other aspects, replicable recombinant cloning vehicles are provided which code for dirigent proteins or pinoresinol/lariciresinol reductases or for a base sequence sufficiently complementary to at least a portion of dirigent protein or pinoresinol/lariciresinol reductase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding dirigent protein or pinoresinol/lariciresinol reductase. Thus, systems and methods are provided for the recombinant expression of dirigent proteins and/or pinoresinol/lariciresinol reductases.
Deyashiki, Y; Ogasawara, A; Nakayama, T; Nakanishi, M; Miyabe, Y; Sato, K; Hara, A
1994-01-01
Human liver contains two dihydrodiol dehydrogenases, DD2 and DD4, associated with 3 alpha-hydroxysteroid dehydrogenase activity. We have raised polyclonal antibodies that cross-reacted with the two enzymes and isolated two 1.2 kb cDNA clones (C9 and C11) for the two enzymes from a human liver cDNA library using the antibodies. The clones of C9 and C11 contained coding sequences corresponding to 306 and 321 amino acid residues respectively, but lacked 5'-coding regions around the initiation codon. Sequence analyses of several peptides obtained by enzymic and chemical cleavages of the two purified enzymes verified that the C9 and C11 clones encoded DD2 and DD4 respectively, and further indicated that the sequence of DD2 had at least additional 16 residues upward from the N-terminal sequence deduced from the cDNA. There was 82% amino acid sequence identity between the two enzymes, indicating that the enzymes are genetic isoenzymes. A computer-based comparison of the cDNAs of the isoenzymes with the DNA sequence database revealed that the nucleotide and amino acid sequences of DD2 and DD4 are virtually identical with those of human bile-acid binder and human chlordecone reductase cDNAs respectively. Images Figure 1 PMID:8172617
Systematic analysis and evolution of 5S ribosomal DNA in metazoans.
Vierna, J; Wehner, S; Höner zu Siederdissen, C; Martínez-Lage, A; Marz, M
2013-11-01
Several studies on 5S ribosomal DNA (5S rDNA) have been focused on a subset of the following features in mostly one organism: number of copies, pseudogenes, secondary structure, promoter and terminator characteristics, genomic arrangements, types of non-transcribed spacers and evolution. In this work, we systematically analyzed 5S rDNA sequence diversity in available metazoan genomes, and showed organism-specific and evolutionary-conserved features. Putatively functional sequences (12,766) from 97 organisms allowed us to identify general features of this multigene family in animals. Interestingly, we show that each mammal species has a highly conserved (housekeeping) 5S rRNA type and many variable ones. The genomic organization of 5S rDNA is still under debate. Here, we report the occurrence of several paralog 5S rRNA sequences in 58 of the examined species, and a flexible genome organization of 5S rDNA in animals. We found heterogeneous 5S rDNA clusters in several species, supporting the hypothesis of an exchange of 5S rDNA from one locus to another. A rather high degree of variation of upstream, internal and downstream putative regulatory regions appears to characterize metazoan 5S rDNA. We systematically studied the internal promoters and described three different types of termination signals, as well as variable distances between the coding region and the typical termination signal. Finally, we present a statistical method for detection of linkage among noncoding RNA (ncRNA) gene families. This method showed no evolutionary-conserved linkage among 5S rDNAs and any other ncRNA genes within Metazoa, even though we found 5S rDNA to be linked to various ncRNAs in several clades.
Systematic analysis and evolution of 5S ribosomal DNA in metazoans
Vierna, J; Wehner, S; Höner zu Siederdissen, C; Martínez-Lage, A; Marz, M
2013-01-01
Several studies on 5S ribosomal DNA (5S rDNA) have been focused on a subset of the following features in mostly one organism: number of copies, pseudogenes, secondary structure, promoter and terminator characteristics, genomic arrangements, types of non-transcribed spacers and evolution. In this work, we systematically analyzed 5S rDNA sequence diversity in available metazoan genomes, and showed organism-specific and evolutionary-conserved features. Putatively functional sequences (12 766) from 97 organisms allowed us to identify general features of this multigene family in animals. Interestingly, we show that each mammal species has a highly conserved (housekeeping) 5S rRNA type and many variable ones. The genomic organization of 5S rDNA is still under debate. Here, we report the occurrence of several paralog 5S rRNA sequences in 58 of the examined species, and a flexible genome organization of 5S rDNA in animals. We found heterogeneous 5S rDNA clusters in several species, supporting the hypothesis of an exchange of 5S rDNA from one locus to another. A rather high degree of variation of upstream, internal and downstream putative regulatory regions appears to characterize metazoan 5S rDNA. We systematically studied the internal promoters and described three different types of termination signals, as well as variable distances between the coding region and the typical termination signal. Finally, we present a statistical method for detection of linkage among noncoding RNA (ncRNA) gene families. This method showed no evolutionary-conserved linkage among 5S rDNAs and any other ncRNA genes within Metazoa, even though we found 5S rDNA to be linked to various ncRNAs in several clades. PMID:23838690
Liu, Betty R.; Huang, Yue-Wern; Aronstam, Robert S.; Lee, Han-Jung
2016-01-01
Cell-penetrating peptides (CPPs) have been shown to deliver cargos, including protein, DNA, RNA, and nanomaterials, in fully active forms into live cells. Most of the CPP sequences in use today are based on non-native proteins that may be immunogenic. Here we demonstrate that the L5a CPP (RRWQW) from bovine lactoferricin (LFcin), stably and noncovalently complexed with plasmid DNA and prepared at an optimal nitrogen/phosphate ratio of 12, is able to efficiently enter into human lung cancer A549 cells. The L5a CPP delivered a plasmid containing the enhanced green fluorescent protein (EGFP) coding sequence that was subsequently expressed in cells, as revealed by real-time PCR and fluorescent microscopy at the mRNA and protein levels, respectively. Treatment with calcium chloride increased the level of gene expression, without affecting CPP-mediated transfection efficiency. Zeta-potential analysis revealed that positively electrostatic interactions of CPP/DNA complexes correlated with CPP-mediated transport. The L5a and L5a/DNA complexes were not cytotoxic. This biomimetic LFcin L5a represents one of the shortest effective CPPs and could be a promising lead peptide with less immunogenic for DNA delivery in gene therapy. PMID:26942714
Liu, Betty R; Huang, Yue-Wern; Aronstam, Robert S; Lee, Han-Jung
2016-01-01
Cell-penetrating peptides (CPPs) have been shown to deliver cargos, including protein, DNA, RNA, and nanomaterials, in fully active forms into live cells. Most of the CPP sequences in use today are based on non-native proteins that may be immunogenic. Here we demonstrate that the L5a CPP (RRWQW) from bovine lactoferricin (LFcin), stably and noncovalently complexed with plasmid DNA and prepared at an optimal nitrogen/phosphate ratio of 12, is able to efficiently enter into human lung cancer A549 cells. The L5a CPP delivered a plasmid containing the enhanced green fluorescent protein (EGFP) coding sequence that was subsequently expressed in cells, as revealed by real-time PCR and fluorescent microscopy at the mRNA and protein levels, respectively. Treatment with calcium chloride increased the level of gene expression, without affecting CPP-mediated transfection efficiency. Zeta-potential analysis revealed that positively electrostatic interactions of CPP/DNA complexes correlated with CPP-mediated transport. The L5a and L5a/DNA complexes were not cytotoxic. This biomimetic LFcin L5a represents one of the shortest effective CPPs and could be a promising lead peptide with less immunogenic for DNA delivery in gene therapy.
Campo, Daniel; García-Vázquez, Eva
2012-01-01
The 5S rDNA is organized in the genome as tandemly repeated copies of a structural unit composed of a coding sequence plus a nontranscribed spacer (NTS). The coding region is highly conserved in the evolution, whereas the NTS vary in both length and sequence. It has been proposed that 5S rRNA genes are members of a gene family that have arisen through concerted evolution. In this study, we describe the molecular organization and evolution of the 5S rDNA in the genera Lepidorhombus and Scophthalmus (Scophthalmidae) and compared it with already known 5S rDNA of the very different genera Merluccius (Merluccidae) and Salmo (Salmoninae), to identify common structural elements or patterns for understanding 5S rDNA evolution in fish. High intra- and interspecific diversity within the 5S rDNA family in all the genera can be explained by a combination of duplications, deletions, and transposition events. Sequence blocks with high similarity in all the 5S rDNA members across species were identified for the four studied genera, with evidences of intense gene conversion within noncoding regions. We propose a model to explain the evolution of the 5S rDNA, in which the evolutionary units are blocks of nucleotides rather than the entire sequences or single nucleotides. This model implies a "two-speed" evolution: slow within blocks (homogenized by recombination) and fast within the gene family (diversified by duplications and deletions).
Szabóová, Dana; Bielik, Peter; Poláková, Silvia; Šoltys, Katarína; Jatzová, Katarína; Szemes, Tomáš
2017-01-01
Abstract The yeast Saccharomyces are widely used to test ecological and evolutionary hypotheses. A large number of nuclear genomic DNA sequences are available, but mitochondrial genomic data are insufficient. We completed mitochondrial DNA (mtDNA) sequencing from Illumina MiSeq reads for all Saccharomyces species. All are circularly mapped molecules decreasing in size with phylogenetic distance from Saccharomyces cerevisiae but with similar gene content including regulatory and selfish elements like origins of replication, introns, free-standing open reading frames or GC clusters. Their most profound feature is species-specific alteration in gene order. The genetic code slightly differs from well-established yeast mitochondrial code as GUG is used rarely as the translation start and CGA and CGC code for arginine. The multilocus phylogeny, inferred from mtDNA, does not correlate with the trees derived from nuclear genes. mtDNA data demonstrate that Saccharomyces cariocanus should be assigned as a separate species and Saccharomyces bayanus CBS 380T should not be considered as a distinct species due to mtDNA nearly identical to Saccharomyces uvarum mtDNA. Apparently, comparison of mtDNAs should not be neglected in genomic studies as it is an important tool to understand the origin and evolutionary history of some yeast species. PMID:28992063
Mitochondrial DNA of Vitis vinifera and the issue of rampant horizontal gene transfer.
Goremykin, Vadim V; Salamini, Francesco; Velasco, Riccardo; Viola, Roberto
2009-01-01
The mitochondrial genome of grape (Vitis vinifera), the largest organelle genome sequenced so far, is presented. The genome is 773,279 nt long and has the highest coding capacity among known angiosperm mitochondrial DNAs (mtDNAs). The proportion of promiscuous DNA of plastid origin in the genome is also the largest ever reported for an angiosperm mtDNA, both in absolute and relative terms. In all, 42.4% of chloroplast genome of Vitis has been incorporated into its mitochondrial genome. In order to test if horizontal gene transfer (HGT) has also contributed to the gene content of the grape mtDNA, we built phylogenetic trees with the coding sequences of mitochondrial genes of grape and their homologs from plant mitochondrial genomes. Many incongruent gene tree topologies were obtained. However, the extent of incongruence between these gene trees is not significantly greater than that observed among optimal trees for chloroplast genes, the common ancestry of which has never been in doubt. In both cases, we attribute this incongruence to artifacts of tree reconstruction, insufficient numbers of characters, and gene paralogy. This finding leads us to question the recent phylogenetic interpretation of Bergthorsson et al. (2003, 2004) and Richardson and Palmer (2007) that rampant HGT into the mtDNA of Amborella best explains phylogenetic incongruence between mitochondrial gene trees for angiosperms. The only evidence for HGT into the Vitis mtDNA found involves fragments of two coding sequences stemming from two closteroviruses that cause the leaf roll disease of this plant. We also report that analysis of sequences shared by both chloroplast and mitochondrial genomes provides evidence for a previously unknown gene transfer route from the mitochondrion to the chloroplast.
Balintová, Jana; Plucnara, Medard; Vidláková, Pavlína; Pohl, Radek; Havran, Luděk; Fojta, Miroslav; Hocek, Michal
2013-09-16
Benzofurazane has been attached to nucleosides and dNTPs, either directly or through an acetylene linker, as a new redox label for electrochemical analysis of nucleotide sequences. Primer extension incorporation of the benzofurazane-modified dNTPs by polymerases has been developed for the construction of labeled oligonucleotide probes. In combination with nitrophenyl and aminophenyl labels, we have successfully developed a three-potential coding of DNA bases and have explored the relevant electrochemical potentials. The combination of benzofurazane and nitrophenyl reducible labels has proved to be excellent for ratiometric analysis of nucleotide sequences and is suitable for bioanalytical applications. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Machine Learned Replacement of N-Labels for Basecalled Sequences in DNA Barcoding.
Ma, Eddie Y T; Ratnasingham, Sujeevan; Kremer, Stefan C
2018-01-01
This study presents a machine learning method that increases the number of identified bases in Sanger Sequencing. The system post-processes a KB basecalled chromatogram. It selects a recoverable subset of N-labels in the KB-called chromatogram to replace with basecalls (A,C,G,T). An N-label correction is defined given an additional read of the same sequence, and a human finished sequence. Corrections are added to the dataset when an alignment determines the additional read and human agree on the identity of the N-label. KB must also rate the replacement with quality value of in the additional read. Corrections are only available during system training. Developing the system, nearly 850,000 N-labels are obtained from Barcode of Life Datasystems, the premier database of genetic markers called DNA Barcodes. Increasing the number of correct bases improves reference sequence reliability, increases sequence identification accuracy, and assures analysis correctness. Keeping with barcoding standards, our system maintains an error rate of percent. Our system only applies corrections when it estimates low rate of error. Tested on this data, our automation selects and recovers: 79 percent of N-labels from COI (animal barcode); 80 percent from matK and rbcL (plant barcodes); and 58 percent from non-protein-coding sequences (across eukaryotes).
Harper, J R; Prince, J T; Healy, P A; Stuart, J K; Nauman, S J; Stallcup, W B
1991-03-01
We have isolated cDNA clones coding for the human homologue of the neuronal cell adhesion molecule L1. The nucleotide sequence of the cDNA clones and the deduced primary amino acid sequence of the carboxy terminal portion of the human L1 are homologous to the corresponding sequences of mouse L1 and rat NILE glycoprotein, with an especially high sequences identity in the cytoplasmic regions of the proteins. There is also protein sequence homology with the cytoplasmic region of the Drosophila cell adhesion molecule, neuroglian. The conservation of the cytoplasmic domain argues for an important functional role for this portion of the molecule.
Gomes, S L; Gober, J W; Shapiro, L
1990-01-01
Caulobacter crescentus has a single dnaK gene that is highly homologous to the hsp70 family of heat shock genes. Analysis of the cloned and sequenced dnaK gene has shown that the deduced amino acid sequence could encode a protein of 67.6 kilodaltons that is 68% identical to the DnaK protein of Escherichia coli and 49% identical to the Drosophila and human hsp70 protein family. A partial open reading frame 165 base pairs 3' to the end of dnaK encodes a peptide of 190 amino acids that is 59% identical to DnaJ of E. coli. Northern blot analysis revealed a single 4.0-kilobase mRNA homologous to the cloned fragment. Since the dnaK coding region is 1.89 kilobases, dnaK and dnaJ may be transcribed as a polycistronic message. S1 mapping and primer extension experiments showed that transcription initiated at two sites 5' to the dnaK coding sequence. A single start site of transcription was identified during heat shock at 42 degrees C, and the predicted promoter sequence conformed to the consensus heat shock promoters of E. coli. At normal growth temperature (30 degrees C), a different start site was identified 3' to the heat shock start site that conformed to the E. coli sigma 70 promoter consensus sequence. S1 protection assays and analysis of expression of the dnaK gene fused to the lux transcription reporter gene showed that expression of dnaK is temporally controlled under normal physiological conditions and that transcription occurs just before the initiation of DNA replication. Thus, in both human cells (I. K. L. Milarski and R. I. Morimoto, Proc. Natl. Acad. Sci. USA 83:9517-9521, 1986) and in a simple bacterium, the transcription of a hsp70 gene is temporally controlled as a function of the cell cycle under normal growth conditions. Images PMID:2345134
Croteau, Rodney Bruce; Crock, John E.
2005-01-25
A cDNA encoding (E)-.beta.-farnesene synthase from peppermint (Mentha piperita) has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID NO:1) is provided which codes for the expression of (E)-.beta.-farnesene synthase (SEQ ID NO:2), from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for (E)-.beta.-farnesene synthase, or for a base sequence sufficiently complementary to at least a portion of (E)-.beta.-farnesene synthase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding (E)-.beta.-farnesene synthase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant (E)-.beta.-famesene synthase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant (E)-.beta.-farnesene synthase may be used to obtain expression or enhanced expression of (E)-.beta.-famesene synthase in plants in order to enhance the production of (E)-.beta.-farnesene, or may be otherwise employed for the regulation or expression of (E)-.beta.-farnesene synthase, or the production of its product.
Kapil, Aditi; Rai, Piyush Kant; Shanker, Asheesh
2014-01-01
Simple sequence repeats (SSRs) are regions in DNA sequence that contain repeating motifs of length 1–6 nucleotides. These repeats are ubiquitously present and are found in both coding and non-coding regions of genome. A total of 534 complete chloroplast genome sequences (as on 18 September 2014) of Viridiplantae are available at NCBI organelle genome resource. It provides opportunity to mine these genomes for the detection of SSRs and store them in the form of a database. In an attempt to properly manage and retrieve chloroplastic SSRs, we designed ChloroSSRdb which is a relational database developed using SQL server 2008 and accessed through ASP.NET. It provides information of all the three types (perfect, imperfect and compound) of SSRs. At present, ChloroSSRdb contains 124 430 mined SSRs, with majority lying in non-coding region. Out of these, PCR primers were designed for 118 249 SSRs. Tetranucleotide repeats (47 079) were found to be the most frequent repeat type, whereas hexanucleotide repeats (6414) being the least abundant. Additionally, in each species statistical analyses were performed to calculate relative frequency, correlation coefficient and chi-square statistics of perfect and imperfect SSRs. In accordance with the growing interest in SSR studies, ChloroSSRdb will prove to be a useful resource in developing genetic markers, phylogenetic analysis, genetic mapping, etc. Moreover, it will serve as a ready reference for mined SSRs in available chloroplast genomes of green plants. Database URL: www.compubio.in/chlorossrdb/ PMID:25380781
Kapil, Aditi; Rai, Piyush Kant; Shanker, Asheesh
2014-01-01
Simple sequence repeats (SSRs) are regions in DNA sequence that contain repeating motifs of length 1-6 nucleotides. These repeats are ubiquitously present and are found in both coding and non-coding regions of genome. A total of 534 complete chloroplast genome sequences (as on 18 September 2014) of Viridiplantae are available at NCBI organelle genome resource. It provides opportunity to mine these genomes for the detection of SSRs and store them in the form of a database. In an attempt to properly manage and retrieve chloroplastic SSRs, we designed ChloroSSRdb which is a relational database developed using SQL server 2008 and accessed through ASP.NET. It provides information of all the three types (perfect, imperfect and compound) of SSRs. At present, ChloroSSRdb contains 124 430 mined SSRs, with majority lying in non-coding region. Out of these, PCR primers were designed for 118 249 SSRs. Tetranucleotide repeats (47 079) were found to be the most frequent repeat type, whereas hexanucleotide repeats (6414) being the least abundant. Additionally, in each species statistical analyses were performed to calculate relative frequency, correlation coefficient and chi-square statistics of perfect and imperfect SSRs. In accordance with the growing interest in SSR studies, ChloroSSRdb will prove to be a useful resource in developing genetic markers, phylogenetic analysis, genetic mapping, etc. Moreover, it will serve as a ready reference for mined SSRs in available chloroplast genomes of green plants. Database URL: www.compubio.in/chlorossrdb/ © The Author(s) 2014. Published by Oxford University Press.
Tackett, Alan J.; Corey, David R.; Raney, Kevin D.
2002-01-01
Peptide nucleic acid (PNA) is a DNA mimic in which the nucleobases are linked by an N-(2-aminoethyl) glycine backbone. Here we report that PNA can interact with single-stranded DNA (ssDNA) in a non-sequence-specific fashion. We observed that a 15mer PNA inhibited the ssDNA-stimulated ATPase activity of a bacteriophage T4 helicase, Dda. Surprisingly, when a fluorescein-labeled 15mer PNA was used in binding studies no interaction was observed between PNA and Dda. However, fluorescence polarization did reveal non-sequence-specific interactions between PNA and ssDNA. Thus, the inhibition of ATPase activity of Dda appears to result from depletion of the available ssDNA due to non-Watson–Crick binding of PNA to ssDNA. Inhibition of the ssDNA-stimulated ATPase activity was observed for several PNAs of varying length and sequence. To study the basis for this phenomenon, we examined self-aggregation by PNAs. The 15mer PNA readily self-aggregates to the point of precipitation. Since PNAs are hydrophobic, they aggregate more than DNA or RNA, making the study of this phenomenon essential for understanding the properties of PNA. Non-sequence-specific interactions between PNA and ssDNA were observed at moderate concentrations of PNA, suggesting that such interactions should be considered for antisense and antigene applications. PMID:11842106
The cDNA-derived amino acid sequence of hemoglobin II from Lucina pectinata.
Torres-Mercado, Elineth; Renta, Jessicca Y; Rodríguez, Yolanda; López-Garriga, Juan; Cadilla, Carmen L
2003-11-01
Hemoglobin II from the clam Lucina pectinata is an oxygen-reactive protein with a unique structural organization in the heme pocket involving residues Gln65 (E7), Tyr30 (B10), Phe44 (CD1), and Phe69 (E11). We employed the reverse transcriptase-polymerase chain reaction (RT-PCR) and methods to synthesize various cDNA(HbII). An initial 300-bp cDNA clone was amplified from total RNA by RT-PCR using degenerate oligonucleotides. Gene-specific primers derived from the HbII-partial cDNA sequence were used to obtain the 5' and 3' ends of the cDNA by RACE. The length of the HbII cDNA, estimated from overlapping clones, was approximately 2114 bases. Northern blot analysis revealed that the mRNA size of HbII agrees with the estimated size using cDNA data. The coding region of the full-length HbII cDNA codes for 151 amino acids. The calculated molecular weight of HbII, including the heme group and acetylated N-terminal residue, is 17,654.07 Da.
The emerging role of epigenetics in rheumatic diseases.
Gay, Steffen; Wilson, Anthony G
2014-03-01
Epigenetics is a key mechanism regulating the expression of genes. There are three main and interrelated mechanisms: DNA methylation, post-translational modification of histone proteins and non-coding RNA. Gene activation is generally associated with lower levels of DNA methylation in promoters and with distinct histone marks such as acetylation of amino acids in histones. Unlike the genetic code, the epigenome is altered by endogenous (e.g. hormonal) and environmental (e.g. diet, exercise) factors and changes with age. Recent evidence implicates epigenetic mechanisms in the pathogenesis of common rheumatic disease, including RA, OA, SLE and scleroderma. Epigenetic drift has been implicated in age-related changes in the immune system that result in the development of a pro-inflammatory status termed inflammageing, potentially increasing the risk of age-related conditions such as polymyalgia rheumatica. Therapeutic targeting of the epigenome has shown promise in animal models of rheumatic diseases. Rapid advances in computational biology and DNA sequencing technology will lead to a more comprehensive understanding of the roles of epigenetics in the pathogenesis of common rheumatic diseases.
Living Organisms Author Their Read-Write Genomes in Evolution
2017-01-01
Evolutionary variations generating phenotypic adaptations and novel taxa resulted from complex cellular activities altering genome content and expression: (i) Symbiogenetic cell mergers producing the mitochondrion-bearing ancestor of eukaryotes and chloroplast-bearing ancestors of photosynthetic eukaryotes; (ii) interspecific hybridizations and genome doublings generating new species and adaptive radiations of higher plants and animals; and, (iii) interspecific horizontal DNA transfer encoding virtually all of the cellular functions between organisms and their viruses in all domains of life. Consequently, assuming that evolutionary processes occur in isolated genomes of individual species has become an unrealistic abstraction. Adaptive variations also involved natural genetic engineering of mobile DNA elements to rewire regulatory networks. In the most highly evolved organisms, biological complexity scales with “non-coding” DNA content more closely than with protein-coding capacity. Coincidentally, we have learned how so-called “non-coding” RNAs that are rich in repetitive mobile DNA sequences are key regulators of complex phenotypes. Both biotic and abiotic ecological challenges serve as triggers for episodes of elevated genome change. The intersections of cell activities, biosphere interactions, horizontal DNA transfers, and non-random Read-Write genome modifications by natural genetic engineering provide a rich molecular and biological foundation for understanding how ecological disruptions can stimulate productive, often abrupt, evolutionary transformations. PMID:29211049
Lindsay, Cameron; Seikaly, Hadi; Biron, Vincent L
2017-01-31
Epigenetic modifications are heritable changes in gene expression that do not directly alter DNA sequence. These modifications include DNA methylation, histone post-translational modifications, small and non-coding RNAs. Alterations in epigenetic profiles cause deregulation of fundamental gene expression pathways associated with carcinogenesis. The role of epigenetics in oropharyngeal squamous cell carcinoma (OPSCC) has recently been recognized, with implications for novel biomarkers, molecular diagnostics and chemotherapeutics. In this review, important epigenetic pathways in human papillomavirus (HPV) positive and negative OPSCC are summarized, as well as the potential clinical utility of this knowledge.This material has never been published and is not currently under evaluation in any other peer-reviewed publication.
Cloning of human prourokinase cDNA without the signal peptide and expression in Escherichia coli.
Hu, B; Li, J; Yu, W; Fang, J
1993-01-01
Human prourokinase (pro-UK) cDNA without the signal peptide was obtained using synthetic oligonucleotide and DNA recombination techniques and was successfully expressed in E. coli. The plasmid pMMUK which contained pro-UK cDNA (including both the entire coding sequence and the sequence for signal peptide) was digested with Hind III and PstI, so that the N-terminal 371-bp fragment could be recovered. A 304-bp fragment was collected from the 371-bp fragment after partial digestion with Fnu4HI in order to remove the signal peptide sequence. An intermediate plasmid was formed after this 304-bp fragment and the synthetic oligonucleotide was ligated with pUC18. Correctness of the ligation was confirmed by enzyme digestion and sequencing. By joining the PstI-PstI fragment of pro-UK to the plasmid we obtained the final plasmid which contained the entire coding sequence of pro-UK without the signal peptide. The coding sequence with correct orientation was inserted into pBV220 under the control of the temperature-induced promoter PRPL, and mature pro-UK was expressed in E. coli at 42 degrees C. Both sonicated supernatant and inclusion bodies of the bacterial host JM101 showed positive results by ELISA and FAPA assays. After renaturation, the biological activity of the expressed product was increased from 500-1000IU/L to about 60,000IU/L. The bacterial pro-UK showed a molecular weight of about 47,000 daltons by Western blot analysis. It can be completely inhibited by UK antiserum but not by t-PA antiserum nor by normal rabbit serum.
Effects of Replication and Transcription on DNA Structure-Related Genetic Instability.
Wang, Guliang; Vasquez, Karen M
2017-01-05
Many repetitive sequences in the human genome can adopt conformations that differ from the canonical B-DNA double helix (i.e., non-B DNA), and can impact important biological processes such as DNA replication, transcription, recombination, telomere maintenance, viral integration, transposome activation, DNA damage and repair. Thus, non-B DNA-forming sequences have been implicated in genetic instability and disease development. In this article, we discuss the interactions of non-B DNA with the replication and/or transcription machinery, particularly in disease states (e.g., tumors) that can lead to an abnormal cellular environment, and how such interactions may alter DNA replication and transcription, leading to potential conflicts at non-B DNA regions, and eventually result in genetic stability and human disease.
Effects of Replication and Transcription on DNA Structure-Related Genetic Instability
Wang, Guliang; Vasquez, Karen M.
2017-01-01
Many repetitive sequences in the human genome can adopt conformations that differ from the canonical B-DNA double helix (i.e., non-B DNA), and can impact important biological processes such as DNA replication, transcription, recombination, telomere maintenance, viral integration, transposome activation, DNA damage and repair. Thus, non-B DNA-forming sequences have been implicated in genetic instability and disease development. In this article, we discuss the interactions of non-B DNA with the replication and/or transcription machinery, particularly in disease states (e.g., tumors) that can lead to an abnormal cellular environment, and how such interactions may alter DNA replication and transcription, leading to potential conflicts at non-B DNA regions, and eventually result in genetic stability and human disease. PMID:28067787
Organizational heterogeneity of vertebrate genomes.
Frenkel, Svetlana; Kirzhner, Valery; Korol, Abraham
2012-01-01
Genomes of higher eukaryotes are mosaics of segments with various structural, functional, and evolutionary properties. The availability of whole-genome sequences allows the investigation of their structure as "texts" using different statistical and computational methods. One such method, referred to as Compositional Spectra (CS) analysis, is based on scoring the occurrences of fixed-length oligonucleotides (k-mers) in the target DNA sequence. CS analysis allows generating species- or region-specific characteristics of the genome, regardless of their length and the presence of coding DNA. In this study, we consider the heterogeneity of vertebrate genomes as a joint effect of regional variation in sequence organization superimposed on the differences in nucleotide composition. We estimated compositional and organizational heterogeneity of genome and chromosome sequences separately and found that both heterogeneity types vary widely among genomes as well as among chromosomes in all investigated taxonomic groups. The high correspondence of heterogeneity scores obtained on three genome fractions, coding, repetitive, and the remaining part of the noncoding DNA (the genome dark matter--GDM) allows the assumption that CS-heterogeneity may have functional relevance to genome regulation. Of special interest for such interpretation is the fact that natural GDM sequences display the highest deviation from the corresponding reshuffled sequences.
Guo, Y C; Wang, H; Wu, H P; Zhang, M Q
2015-12-21
Aimed to address the defects of the large mean square error (MSE), and the slow convergence speed in equalizing the multi-modulus signals of the constant modulus algorithm (CMA), a multi-modulus algorithm (MMA) based on global artificial fish swarm (GAFS) intelligent optimization of DNA encoding sequences (GAFS-DNA-MMA) was proposed. To improve the convergence rate and reduce the MSE, this proposed algorithm adopted an encoding method based on DNA nucleotide chains to provide a possible solution to the problem. Furthermore, the GAFS algorithm, with its fast convergence and global search ability, was used to find the best sequence. The real and imaginary parts of the initial optimal weight vector of MMA were obtained through DNA coding of the best sequence. The simulation results show that the proposed algorithm has a faster convergence speed and smaller MSE in comparison with the CMA, the MMA, and the AFS-DNA-MMA.
King, Brian R; Aburdene, Maurice; Thompson, Alex; Warres, Zach
2014-01-01
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.
Tang, Songsong; Gu, Yuan; Lu, Huiting; Dong, Haifeng; Zhang, Kai; Dai, Wenhao; Meng, Xiangdan; Yang, Fan; Zhang, Xueji
2018-04-03
Herein, a highly-sensitive microRNA (miRNA) detection strategy was developed by combining bio-bar-code assay (BBA) with catalytic hairpin assembly (CHA). In the proposed system, two nanoprobes of magnetic nanoparticles functionalized with DNA probes (MNPs-DNA) and gold nanoparticles with numerous barcode DNA (AuNPs-DNA) were designed. In the presence of target miRNA, the MNP-DNA and AuNP-DNA hybridized with target miRNA to form a "sandwich" structure. After "sandwich" structures were separated from the solution by the magnetic field and dehybridized by high temperature, the barcode DNA sequences were released by dissolving AuNPs. The released barcode DNA sequences triggered the toehold strand displacement assembly of two hairpin probes, leading to recycle of barcode DNA sequences and producing numerous fluorescent CHA products for miRNA detection. Under the optimal experimental conditions, the proposed two-stage amplification system could sensitively detect target miRNA ranging from 10 pM to 10 aM with a limit of detection (LOD) down to 97.9 zM. It displayed good capability to discriminate single base and three bases mismatch due to the unique sandwich structure. Notably, it presented good feasibility for selective multiplexed detection of various combinations of synthetic miRNA sequences and miRNAs extracted from different cell lysates, which were in agreement with the traditional polymerase chain reaction analysis. The two-stage amplification strategy may be significant implication in the biological detection and clinical diagnosis. Copyright © 2017 Elsevier B.V. All rights reserved.
Nucleic acid molecules encoding isopentenyl monophosphate kinase, and methods of use
Croteau, Rodney B.; Lange, Bernd M.
2001-01-01
A cDNA encoding isopentenyl monophosphate kinase (IPK) from peppermint (Mentha x piperita) has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID NO:1) is provided which codes for the expression of isopentenyl monophosphate kinase (SEQ ID NO:2), from peppermint (Mentha x piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for isopentenyl monophosphate kinase, or for a base sequence sufficiently complementary to at least a portion of isopentenyl monophosphate kinase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding isopentenyl monophosphate kinase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant isopentenyl monophosphate kinase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant isopentenyl monophosphate kinase may be used to obtain expression or enhanced expression of isopentenyl monophosphate kinase in plants in order to enhance the production of isopentenyl monophosphate kinase, or isoprenoids derived therefrom, or may be otherwise employed for the regulation or expression of isopentenyl monophosphate kinase, or the production of its products.
Chen, Zhi-Teng; Du, Yu-Zhou
2015-03-01
The complete mitochondrial genome of the stonefly, Sweltsa longistyla Wu (Plecoptera: Chloroperlidae), was sequenced in this study. The mitogenome of S. longistyla is 16,151bp and contains 37 genes including 13 protein-coding genes (PCGs), 22 tRNA genes, two rRNA genes, and a large non-coding region. S. longistyla, Pteronarcys princeps Banks, Kamimuria wangi Du and Cryptoperla stilifera Sivec belong to the Plecoptera, and the gene order and orientation of their mitogenomes were similar. The overall AT content for the four stoneflies was below 72%, and the AT content of tRNA genes was above 69%. The four genomes were compact and contained only 65-127bp of non-coding intergenic DNAs. Overlapping nucleotides existed in all four genomes and ranged from 24 (P. princeps) to 178bp (K. wangi). There was a 7-bp motif ('ATGATAA') of overlapping DNA and an 8-bp motif (AAGCCTTA) conserved in three stonefly species (P. princeps, K. wangi and C. stilifera). The control regions of four stoneflies contained a stem-loop structure. Four conserved sequence blocks (CSBs) were present in the A+T-rich regions of all four stoneflies. Copyright © 2014 Elsevier B.V. All rights reserved.
Silicene nanoribbon as a new DNA sequencing device
NASA Astrophysics Data System (ADS)
Alesheikh, Sara; Shahtahmassebi, Nasser; Roknabadi, Mahmood Rezaee; Pilevar Shahri, Raheleh
2018-02-01
The importance of applying DNA sequencing in different fields, results in looking for fast and cheap methods. Nanotechnology helps this development by introducing nanostructures used for DNA sequencing. In this work we study the interaction between zigzag silicene nanoribbon and DNA nucleobases using DFT and non equilibrium Green's function approach, to investigate the possibility of using zigzag silicene nanoribbons as a biosensor for DNA sequencing.
Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones
Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O; Barrero, Roberto A; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; Bonaldo, Maria de Fatima; Bono, Hidemasa; Bromberg, Susan K; Brookes, Anthony J; Bruford, Elspeth; Carninci, Piero; Chelala, Claude; Couillault, Christine; de Souza, Sandro J.; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; R. Gopinath, Gopal; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Paul; Kikuno, Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino, Takashi; Mano, Shuhei; Mariage-Samson, Regine; Mashima, Jun; Matsuda, Hideo; Mewes, Hans-Werner; Minoshima, Shinsei; Nagai, Keiichi; Nagasaki, Hideki; Nagata, Naoki; Nigam, Rajni; Ogasawara, Osamu; Ohara, Osamu; Ohtsubo, Masafumi; Okada, Norihiro; Okido, Toshihisa; Oota, Satoshi; Ota, Motonori; Ota, Toshio; Otsuki, Tetsuji; Piatier-Tonneau, Dominique; Poustka, Annemarie; Ren, Shuang-Xi; Saitou, Naruya; Sakai, Katsunaga; Sakamoto, Shigetaka; Sakate, Ryuichi; Schupp, Ingo; Servant, Florence; Sherry, Stephen; Shiba, Rie; Shimizu, Nobuyoshi; Shimoyama, Mary; Simpson, Andrew J; Soares, Bento; Steward, Charles; Suwa, Makiko; Suzuki, Mami; Takahashi, Aiko; Tamiya, Gen; Tanaka, Hiroshi; Taylor, Todd; Terwilliger, Joseph D; Unneberg, Per; Veeramachaneni, Vamsi; Watanabe, Shinya; Wilming, Laurens; Yasuda, Norikazu; Yoo, Hyang-Sook; Stodolsky, Marvin; Makalowski, Wojciech; Go, Mitiko; Nakai, Kenta; Takagi, Toshihisa; Kanehisa, Minoru; Sakaki, Yoshiyuki; Quackenbush, John; Okazaki, Yasushi; Hayashizaki, Yoshihide; Hide, Winston; Chakraborty, Ranajit; Nishikawa, Ken; Sugawara, Hideaki; Tateno, Yoshio; Chen, Zhu; Oishi, Michio; Tonellato, Peter; Apweiler, Rolf; Okubo, Kousaku; Wagner, Lukas; Wiemann, Stefan; Strausberg, Robert L; Isogai, Takao; Auffray, Charles; Nomura, Nobuo; Sugano, Sumio
2004-01-01
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology. PMID:15103394
USDA-ARS?s Scientific Manuscript database
Low-frequency coding DNA sequence variants in the proprotein convertase subtilisin/kexin type 9 gene (PCSK9) lower plasma low-density lipoprotein cholesterol (LDL-C), protect against risk of coronary heart disease (CHD), and have prompted the development of a new class of therapeutics. It is uncerta...
Xu, Chang; Nezami Ranjbar, Mohammad R; Wu, Zhong; DiCarlo, John; Wang, Yexun
2017-01-03
Detection of DNA mutations at very low allele fractions with high accuracy will significantly improve the effectiveness of precision medicine for cancer patients. To achieve this goal through next generation sequencing, researchers need a detection method that 1) captures rare mutation-containing DNA fragments efficiently in the mix of abundant wild-type DNA; 2) sequences the DNA library extensively to deep coverage; and 3) distinguishes low level true variants from amplification and sequencing errors with high accuracy. Targeted enrichment using PCR primers provides researchers with a convenient way to achieve deep sequencing for a small, yet most relevant region using benchtop sequencers. Molecular barcoding (or indexing) provides a unique solution for reducing sequencing artifacts analytically. Although different molecular barcoding schemes have been reported in recent literature, most variant calling has been done on limited targets, using simple custom scripts. The analytical performance of barcode-aware variant calling can be significantly improved by incorporating advanced statistical models. We present here a highly efficient, simple and scalable enrichment protocol that integrates molecular barcodes in multiplex PCR amplification. In addition, we developed smCounter, an open source, generic, barcode-aware variant caller based on a Bayesian probabilistic model. smCounter was optimized and benchmarked on two independent read sets with SNVs and indels at 5 and 1% allele fractions. Variants were called with very good sensitivity and specificity within coding regions. We demonstrated that we can accurately detect somatic mutations with allele fractions as low as 1% in coding regions using our enrichment protocol and variant caller.
DOE Office of Scientific and Technical Information (OSTI.GOV)
MacArthur, Stewart; Li, Xiao-Yong; Li, Jingyi
2009-05-15
BACKGROUND: We previously established that six sequence-specific transcription factors that initiate anterior/posterior patterning in Drosophila bind to overlapping sets of thousands of genomic regions in blastoderm embryos. While regions bound at high levels include known and probable functional targets, more poorly bound regions are preferentially associated with housekeeping genes and/or genes not transcribed in the blastoderm, and are frequently found in protein coding sequences or in less conserved non-coding DNA, suggesting that many are likely non-functional. RESULTS: Here we show that an additional 15 transcription factors that regulate other aspects of embryo patterning show a similar quantitative continuum of functionmore » and binding to thousands of genomic regions in vivo. Collectively, the 21 regulators show a surprisingly high overlap in the regions they bind given that they belong to 11 DNA binding domain families, specify distinct developmental fates, and can act via different cis-regulatory modules. We demonstrate, however, that quantitative differences in relative levels of binding to shared targets correlate with the known biological and transcriptional regulatory specificities of these factors. CONCLUSIONS: It is likely that the overlap in binding of biochemically and functionally unrelated transcription factors arises from the high concentrations of these proteins in nuclei, which, coupled with their broad DNA binding specificities, directs them to regions of open chromatin. We suggest that most animal transcription factors will be found to show a similar broad overlapping pattern of binding in vivo, with specificity achieved by modulating the amount, rather than the identity, of bound factor.« less
[Structural organization of 5S ribosomal DNA of Rosa rugosa].
Tynkevych, Iu O; Volkov, R A
2014-01-01
In order to clarify molecular organization of the genomic region encoding 5S rRNA in diploid species Rosa rugosa several 5S rDNA repeated units were cloned and sequenced. Analysis of the obtained sequences revealed that only one length variant of 5S rDNA repeated units, which contains intact promoter elements in the intergenic spacer region (IGS) and appears to be transcriptionally active is present in the genome. Additionally, a limited number of 5S rDNA pseudogenes lacking a portion of coding sequence and the complete IGS was detected. A high level of sequence similarity (from 93.7 to 97.5%) between the IGS of major 5S rDNA variants of East Asian R. rugosa and North American R. nitida was found indicating comparatively recent divergence of these species.
Transcriptome Analysis of Scorpion Species Belonging to the Vaejovis Genus
Quintero-Hernández, Verónica; Ramírez-Carreto, Santos; Romero-Gutiérrez, María Teresa; Valdez-Velázquez, Laura L.; Becerril, Baltazar; Possani, Lourival D.; Ortiz, Ernesto
2015-01-01
Scorpions belonging to the Buthidae family have traditionally drawn much of the biochemist’s attention due to the strong toxicity of their venoms. Scorpions not toxic to mammals, however, also have complex venoms. They have been shown to be an important source of bioactive peptides, some of them identified as potential drug candidates for the treatment of several emerging diseases and conditions. It is therefore important to characterize the large diversity of components found in the non-Buthidae venoms. As a contribution to this goal, this manuscript reports the construction and characterization of cDNA libraries from four scorpion species belonging to the Vaejovis genus of the Vaejovidae family: Vaejovis mexicanus, V. intrepidus, V. subcristatus and V. punctatus. Some sequences coding for channel-acting toxins were found, as expected, but the main transcribed genes in the glands actively producing venom were those coding for non disulfide-bridged peptides. The ESTs coding for putative channel-acting toxins, corresponded to sodium channel β toxins, to members of the potassium channel-acting α or κ families, and to calcium channel-acting toxins of the calcin family. Transcripts for scorpine-like peptides of two different lengths were found, with some of the species coding for the two kinds. One sequence coding for La1-like peptides, of yet unknown function, was found for each species. Finally, the most abundant transcripts corresponded to peptides belonging to the long chain multifunctional NDBP-2 family and to the short antimicrobials of the NDBP-4 family. This apparent venom composition is in correspondence with the data obtained to date for other non-Buthidae species. Our study constitutes the first approach to the characterization of the venom gland transcriptome for scorpion species belonging to the Vaejovidae family. PMID:25659089
Transcriptome analysis of scorpion species belonging to the Vaejovis genus.
Quintero-Hernández, Verónica; Ramírez-Carreto, Santos; Romero-Gutiérrez, María Teresa; Valdez-Velázquez, Laura L; Becerril, Baltazar; Possani, Lourival D; Ortiz, Ernesto
2015-01-01
Scorpions belonging to the Buthidae family have traditionally drawn much of the biochemist's attention due to the strong toxicity of their venoms. Scorpions not toxic to mammals, however, also have complex venoms. They have been shown to be an important source of bioactive peptides, some of them identified as potential drug candidates for the treatment of several emerging diseases and conditions. It is therefore important to characterize the large diversity of components found in the non-Buthidae venoms. As a contribution to this goal, this manuscript reports the construction and characterization of cDNA libraries from four scorpion species belonging to the Vaejovis genus of the Vaejovidae family: Vaejovis mexicanus, V. intrepidus, V. subcristatus and V. punctatus. Some sequences coding for channel-acting toxins were found, as expected, but the main transcribed genes in the glands actively producing venom were those coding for non disulfide-bridged peptides. The ESTs coding for putative channel-acting toxins, corresponded to sodium channel β toxins, to members of the potassium channel-acting α or κ families, and to calcium channel-acting toxins of the calcin family. Transcripts for scorpine-like peptides of two different lengths were found, with some of the species coding for the two kinds. One sequence coding for La1-like peptides, of yet unknown function, was found for each species. Finally, the most abundant transcripts corresponded to peptides belonging to the long chain multifunctional NDBP-2 family and to the short antimicrobials of the NDBP-4 family. This apparent venom composition is in correspondence with the data obtained to date for other non-Buthidae species. Our study constitutes the first approach to the characterization of the venom gland transcriptome for scorpion species belonging to the Vaejovidae family.
Coyne, Robert S; Thiagarajan, Mathangi; Jones, Kristie M; Wortman, Jennifer R; Tallon, Luke J; Haas, Brian J; Cassidy-Hanley, Donna M; Wiley, Emily A; Smith, Joshua J; Collins, Kathleen; Lee, Suzanne R; Couvillion, Mary T; Liu, Yifan; Garg, Jyoti; Pearlman, Ronald E; Hamilton, Eileen P; Orias, Eduardo; Eisen, Jonathan A; Methé, Barbara A
2008-01-01
Background Tetrahymena thermophila, a widely studied model for cellular and molecular biology, is a binucleated single-celled organism with a germline micronucleus (MIC) and somatic macronucleus (MAC). The recent draft MAC genome assembly revealed low sequence repetitiveness, a result of the epigenetic removal of invasive DNA elements found only in the MIC genome. Such low repetitiveness makes complete closure of the MAC genome a feasible goal, which to achieve would require standard closure methods as well as removal of minor MIC contamination of the MAC genome assembly. Highly accurate preliminary annotation of Tetrahymena's coding potential was hindered by the lack of both comparative genomic sequence information from close relatives and significant amounts of cDNA evidence, thus limiting the value of the genomic information and also leaving unanswered certain questions, such as the frequency of alternative splicing. Results We addressed the problem of MIC contamination using comparative genomic hybridization with purified MIC and MAC DNA probes against a whole genome oligonucleotide microarray, allowing the identification of 763 genome scaffolds likely to contain MIC-limited DNA sequences. We also employed standard genome closure methods to essentially finish over 60% of the MAC genome. For the improvement of annotation, we have sequenced and analyzed over 60,000 verified EST reads from a variety of cellular growth and development conditions. Using this EST evidence, a combination of automated and manual reannotation efforts led to updates that affect 16% of the current protein-coding gene models. By comparing EST abundance, many genes showing apparent differential expression between these conditions were identified. Rare instances of alternative splicing and uses of the non-standard amino acid selenocysteine were also identified. Conclusion We report here significant progress in genome closure and reannotation of Tetrahymena thermophila. Our experience to date suggests that complete closure of the MAC genome is attainable. Using the new EST evidence, automated and manual curation has resulted in substantial improvements to the over 24,000 gene models, which will be valuable to researchers studying this model organism as well as for comparative genomics purposes. PMID:19036158
New t-gap insertion-deletion-like metrics for DNA hybridization thermodynamic modeling.
D'yachkov, Arkadii G; Macula, Anthony J; Pogozelski, Wendy K; Renz, Thomas E; Rykov, Vyacheslav V; Torney, David C
2006-05-01
We discuss the concept of t-gap block isomorphic subsequences and use it to describe new abstract string metrics that are similar to the Levenshtein insertion-deletion metric. Some of the metrics that we define can be used to model a thermodynamic distance function on single-stranded DNA sequences. Our model captures a key aspect of the nearest neighbor thermodynamic model for hybridized DNA duplexes. One version of our metric gives the maximum number of stacked pairs of hydrogen bonded nucleotide base pairs that can be present in any secondary structure in a hybridized DNA duplex without pseudoknots. Thermodynamic distance functions are important components in the construction of DNA codes, and DNA codes are important components in biomolecular computing, nanotechnology, and other biotechnical applications that employ DNA hybridization assays. We show how our new distances can be calculated by using a dynamic programming method, and we derive a Varshamov-Gilbert-like lower bound on the size of some of codes using these distance functions as constraints. We also discuss software implementation of our DNA code design methods.
Epigenetic regulatory mechanisms in vertebrate eye development and disease
Cvekl, A; Mitton, KP
2014-01-01
Eukaryotic DNA is organized as a nucleoprotein polymer termed chromatin with nucleosomes serving as its repetitive architectural units. Cellular differentiation is a dynamic process driven by activation and repression of specific sets of genes, partitioning the genome into transcriptionally active and inactive chromatin domains. Chromatin architecture at individual genes/loci may remain stable through cell divisions, from a single mother cell to its progeny during mitosis, and represents an example of epigenetic phenomena. Epigenetics refers to heritable changes caused by mechanisms distinct from the primary DNA sequence. Recent studies have shown a number of links between chromatin structure, gene expression, extracellular signaling, and cellular differentiation during eye development. This review summarizes recent advances in this field, and the relationship between sequence-specific DNA-binding transcription factors and their roles in recruitment of chromatin remodeling enzymes. In addition, lens and retinal differentiation is accompanied by specific changes in the nucleolar organization, expression of non-coding RNAs, and DNA methylation. Epigenetic regulatory mechanisms in ocular tissues represent exciting areas of research that have opened new avenues for understanding normal eye development, inherited eye diseases and eye diseases related to aging and the environment. PMID:20179734
Low-Bandwidth and Non-Compute Intensive Remote Identification of Microbes from Raw Sequencing Reads
Gautier, Laurent; Lund, Ole
2013-01-01
Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data where a reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients: one running in a web browser, and one as a python script. Both are able to handle a large number of sequencing reads and from portable devices (the browser-based running on a tablet), perform its task within seconds, and consume an amount of bandwidth compatible with mobile broadband networks. Such client-server approaches could develop in the future, allowing a fully automated processing of sequencing data and routine instant quality check of sequencing runs from desktop sequencers. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data are available at http://bit.ly/1aURxkc. PMID:24391826
Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads.
Gautier, Laurent; Lund, Ole
2013-01-01
Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data where a reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients: one running in a web browser, and one as a python script. Both are able to handle a large number of sequencing reads and from portable devices (the browser-based running on a tablet), perform its task within seconds, and consume an amount of bandwidth compatible with mobile broadband networks. Such client-server approaches could develop in the future, allowing a fully automated processing of sequencing data and routine instant quality check of sequencing runs from desktop sequencers. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data are available at http://bit.ly/1aURxkc.
Seligmann, Hervé
2013-03-01
Usual DNA→RNA transcription exchanges T→U. Assuming different systematic symmetric nucleotide exchanges during translation, some GenBank RNAs match exactly human mitochondrial sequences (exchange rules listed in decreasing transcript frequencies): C↔U, A↔U, A↔U+C↔G (two nucleotide pairs exchanged), G↔U, A↔G, C↔G, none for A↔C, A↔G+C↔U, and A↔C+G↔U. Most unusual transcripts involve exchanging uracil. Independent measures of rates of rare replicational enzymatic DNA nucleotide misinsertions predict frequencies of RNA transcripts systematically exchanging the corresponding misinserted nucleotides. Exchange transcripts self-hybridize less than other gene regions, self-hybridization increases with length, suggesting endoribonuclease-limited elongation. Blast detects stop codon depleted putative protein coding overlapping genes within exchange-transcribed mitochondrial genes. These align with existing GenBank proteins (mainly metazoan origins, prokaryotic and viral origins underrepresented). These GenBank proteins frequently interact with RNA/DNA, are membrane transporters, or are typical of mitochondrial metabolism. Nucleotide exchange transcript frequencies increase with overlapping gene densities and stop densities, indicating finely tuned counterbalancing regulation of expression of systematic symmetric nucleotide exchange-encrypted proteins. Such expression necessitates combined activities of suppressor tRNAs matching stops, and nucleotide exchange transcription. Two independent properties confirm predicted exchanged overlap coding genes: discrepancy of third codon nucleotide contents from replicational deamination gradients, and codon usage according to circular code predictions. Predictions from both properties converge, especially for frequent nucleotide exchange types. Nucleotide exchanging transcription apparently increases coding densities of protein coding genes without lengthening genomes, revealing unsuspected functional DNA coding potential. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
LncRNA Structural Characteristics in Epigenetic Regulation
Wang, Chenguang; Wang, Lianzong; Ding, Yu; Lu, Xiaoyan; Zhang, Guosi; Yang, Jiaxin; Zheng, Hewei; Wang, Hong; Jiang, Yongshuai; Xu, Liangde
2017-01-01
The rapid development of new generation sequencing technology has deepened the understanding of genomes and functional products. RNA-sequencing studies in mammals show that approximately 85% of the DNA sequences have RNA products, for which the length greater than 200 nucleotides (nt) is called long non-coding RNAs (lncRNA). LncRNAs now have been shown to play important epigenetic regulatory roles in key molecular processes, such as gene expression, genetic imprinting, histone modification, chromatin dynamics, and other activities by forming specific structures and interacting with all kinds of molecules. This paper mainly discusses the correlation between the structure and function of lncRNAs with the recent progress in epigenetic regulation, which is important to the understanding of the mechanism of lncRNAs in physiological and pathological processes. PMID:29292750
Acquisition of New DNA Sequences After Infection of Chicken Cells with Avian Myeloblastosis Virus
Shoyab, M.; Baluda, M. A.; Evans, R.
1974-01-01
DNA-RNA hybridization studies between 70S RNA from avian myeloblastosis virus (AMV) and an excess of DNA from (i) AMV-induced leukemic chicken myeloblasts or (ii) a mixture of normal and of congenitally infected K-137 chicken embryos producing avian leukosis viruses revealed the presence of fast- and slow-hybridizing virus-specific DNA sequences. However, the leukemic cells contained twice the level of AMV-specific DNA sequences observed in normal chicken embryonic cells. The fast-reacting sequences were two to three times more numerous in leukemic DNA than in DNA from the mixed embryos. The slow-reacting sequences had a reiteration frequency of approximately 9 and 6, in the two respective systems. Both the fast- and the slow-reacting DNA sequences in leukemic cells exhibited a higher Tm (2 C) than the respective DNA sequences in normal cells. In normal and leukemic cells the slow hybrid sequences appeared to have a Tm which was 2 C higher than that of the fast hybrid sequences. Individual non-virus-producing chicken embryos, either group-specific antigen positive or negative, contained 40 to 100 copies of the fast sequences and 2 to 6 copies of the slowly hybridizing sequences per cell genome. Normal rat cells did not contain DNA that hybridized with AMV RNA, whereas non-virus-producing rat cells transformed by B-77 avian sarcoma virus contained only the slowly reacting sequences. The results demonstrate that leukemic cells transformed by AMV contain new AMV-specific DNA sequences which were not present before infection. PMID:16789139
Taylor, Jared F.; Khattab, Omar S.; Chen, Yu-Han; Chen, Yumay; Jacobsen, Steven E.; Wang, Ping H.
2015-01-01
Deciphering the multitude of epigenomic and genomic factors that influence the mutation rate is an area of great interest in modern biology. Recently, chromatin has been shown to play a part in this process. To elucidate this relationship further, we integrated our own ultra-deep sequenced human nucleosomal DNA data set with a host of published human genomic and cancer genomic data sets. Our results revealed, that differences in nucleosome occupancy are associated with changes in base-specific mutation rates. Increasing nucleosome occupancy is associated with an increasing transition to transversion ratio and an increased germline mutation rate within the human genome. Additionally, cancer single nucleotide variants and microindels are enriched within nucleosomes and both the coding and non-coding cancer mutation rate increases with increasing nucleosome occupancy. There is an enrichment of cancer indels at the theoretical start (74 bp) and end (115 bp) of linker DNA between two nucleosomes. We then hypothesized that increasing nucleosome occupancy decreases access to DNA by DNA repair machinery and could account for the increasing mutation rate. Such a relationship should not exist in DNA repair knockouts, and we thus repeated our analysis in DNA repair machinery knockouts to test our hypothesis. Indeed, our results revealed no correlation between increasing nucleosome occupancy and increasing mutation rate in DNA repair knockouts. Our findings emphasize the linkage of the genome and epigenome through the nucleosome whose properties can affect genome evolution and genetic aberrations such as cancer. PMID:26308346
2012-01-01
Background Tandemly arranged nuclear ribosomal DNA (rDNA), encoding 18S, 5.8S and 26S ribosomal RNA (rRNA), exhibit concerted evolution, a pattern thought to result from the homogenisation of rDNA arrays. However rDNA homogeneity at the single nucleotide polymorphism (SNP) level has not been detailed in organisms with more than a few hundred copies of the rDNA unit. Here we study rDNA complexity in species with arrays consisting of thousands of units. Methods We examined homogeneity of genic (18S) and non-coding internally transcribed spacer (ITS1) regions of rDNA using Roche 454 and/or Illumina platforms in four angiosperm species, Nicotiana sylvestris, N. tomentosiformis, N. otophora and N. kawakamii. We compared the data with Southern blot hybridisation revealing the structure of intergenic spacer (IGS) sequences and with the number and distribution of rDNA loci. Results and Conclusions In all four species the intragenomic homogeneity of the 18S gene was high; a single ribotype makes up over 90% of the genes. However greater variation was observed in the ITS1 region, particularly in species with two or more rDNA loci, where >55% of rDNA units were a single ribotype, with the second most abundant variant accounted for >18% of units. IGS heterogeneity was high in all species. The increased number of ribotypes in ITS1 compared with 18S sequences may reflect rounds of incomplete homogenisation with strong selection for functional genic regions and relaxed selection on ITS1 variants. The relationship between the number of ITS1 ribotypes and the number of rDNA loci leads us to propose that rDNA evolution and complexity is influenced by locus number and/or amplification of orphaned rDNA units at new chromosomal locations. PMID:23259460
The sequence of sequencers: The history of sequencing DNA
Heather, James M.; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. PMID:26554401
Lee, Hwan Young; Yoo, Ji-Eun; Park, Myung Jin; Chung, Ukhee; Kim, Chong-Youl; Shin, Kyoung-Jin
2006-11-01
The present study analyzed 21 coding region SNP markers and one deletion motif for the determination of East Asian mitochondrial DNA (mtDNA) haplogroups by designing three multiplex systems which apply single base extension methods. Using two multiplex systems, all 593 Korean mtDNAs were allocated into 15 haplogroups: M, D, D4, D5, G, M7, M8, M9, M10, M11, R, R9, B, A, and N9. As the D4 haplotypes occurred most frequently in Koreans, the third multiplex system was used to further define D4 subhaplogroups: D4a, D4b, D4e, D4g, D4h, and D4j. This method allowed the complementation of coding region information with control region mutation motifs and the resultant findings also suggest reliable control region mutation motifs for the assignment of East Asian mtDNA haplogroups. These three multiplex systems produce good results in degraded samples as they contain small PCR products (101-154 bp) for single base extension reactions. SNP scoring was performed in 101 old skeletal remains using these three systems to prove their utility in degraded samples. The sequence analysis of mtDNA control region with high incidence of haplogroup-specific mutations and the selective scoring of highly informative coding region SNPs using the three multiplex systems are useful tools for most applications involving East Asian mtDNA haplogroup determination and haplogroup-directed stringent quality control.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wilkins, T.A.
1993-06-01
This study investigates the molecular events of vacuole ontogeny in rapidly elongated cotton plant cells. Within the DNA coding region, the cotton and carrot cDNA clones exhibit 82.2% nucleotide sequence homology; at the amino acid level cotton and carrot catalytic subunits exhibited 95.7% identity and 2.1% amino acid similarity. When aligned with the analogous sequences from yeast, the cotton protein shared only 60.5% amino acid identity and 12.7% similarity. 10 refs., 1 tab.
Wildman, Derek E.; Uddin, Monica; Liu, Guozhen; Grossman, Lawrence I.; Goodman, Morris
2003-01-01
What do functionally important DNA sites, those scrutinized and shaped by natural selection, tell us about the place of humans in evolution? Here we compare ≈90 kb of coding DNA nucleotide sequence from 97 human genes to their sequenced chimpanzee counterparts and to available sequenced gorilla, orangutan, and Old World monkey counterparts, and, on a more limited basis, to mouse. The nonsynonymous changes (functionally important), like synonymous changes (functionally much less important), show chimpanzees and humans to be most closely related, sharing 99.4% identity at nonsynonymous sites and 98.4% at synonymous sites. On a time scale, the coding DNA divergencies separate the human–chimpanzee clade from the gorilla clade at between 6 and 7 million years ago and place the most recent common ancestor of humans and chimpanzees at between 5 and 6 million years ago. The evolutionary rate of coding DNA in the catarrhine clade (Old World monkey and ape, including human) is much slower than in the lineage to mouse. Among the genes examined, 30 show evidence of positive selection during descent of catarrhines. Nonsynonymous substitutions by themselves, in this subset of positively selected genes, group humans and chimpanzees closest to each other and have chimpanzees diverge about as much from the common human–chimpanzee ancestor as humans do. This functional DNA evidence supports two previously offered taxonomic proposals: family Hominidae should include all extant apes; and genus Homo should include three extant species and two subgenera, Homo (Homo) sapiens (humankind), Homo (Pan) troglodytes (common chimpanzee), and Homo (Pan) paniscus (bonobo chimpanzee). PMID:12766228
Wildman, Derek E; Uddin, Monica; Liu, Guozhen; Grossman, Lawrence I; Goodman, Morris
2003-06-10
What do functionally important DNA sites, those scrutinized and shaped by natural selection, tell us about the place of humans in evolution? Here we compare approximately 90 kb of coding DNA nucleotide sequence from 97 human genes to their sequenced chimpanzee counterparts and to available sequenced gorilla, orangutan, and Old World monkey counterparts, and, on a more limited basis, to mouse. The nonsynonymous changes (functionally important), like synonymous changes (functionally much less important), show chimpanzees and humans to be most closely related, sharing 99.4% identity at nonsynonymous sites and 98.4% at synonymous sites. On a time scale, the coding DNA divergencies separate the human-chimpanzee clade from the gorilla clade at between 6 and 7 million years ago and place the most recent common ancestor of humans and chimpanzees at between 5 and 6 million years ago. The evolutionary rate of coding DNA in the catarrhine clade (Old World monkey and ape, including human) is much slower than in the lineage to mouse. Among the genes examined, 30 show evidence of positive selection during descent of catarrhines. Nonsynonymous substitutions by themselves, in this subset of positively selected genes, group humans and chimpanzees closest to each other and have chimpanzees diverge about as much from the common human-chimpanzee ancestor as humans do. This functional DNA evidence supports two previously offered taxonomic proposals: family Hominidae should include all extant apes; and genus Homo should include three extant species and two subgenera, Homo (Homo) sapiens (humankind), Homo (Pan) troglodytes (common chimpanzee), and Homo (Pan) paniscus (bonobo chimpanzee).
Links, Matthew G; Chaban, Bonnie; Hemmingsen, Sean M; Muirhead, Kevin; Hill, Janet E
2013-08-15
Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.
Simon, J W; Slabas, A R
1998-09-18
The GenBank database was searched using the E. coli malonyl CoA:ACP transacylase (MCAT) sequence, for plant protein/cDNA sequences corresponding to MCAT, a component of plant fatty acid synthetase (FAS), for which the plant cDNA has not been isolated. A 272-bp Zea mays EST sequence (GenBank accession number: AA030706) was identified which has strong homology to the E. coli MCAT. A PCR derived cDNA probe from Zea mays was used to screen a Brassica napus (rape) cDNA library. This resulted in the isolation of a 1200-bp cDNA clone which encodes an open reading frame corresponding to a protein of 351 amino acids. The protein shows 47% homology to the E. coli MCAT amino acid sequence in the coding region for the mature protein. Expression of a plasmid (pMCATrap2) containing the plant cDNA sequence in Fab D89, an E. coli mutant, in MCAT activity restores growth demonstrating functional complementation and direct function of the cloned cDNA. This is the first functional evidence supporting the identification of a plant cDNA for MCAT.
Geranyl diphosphate synthase from mint
Croteau, Rodney Bruce; Wildung, Mark Raymond; Burke, Charles Cullen; Gershenzon, Jonathan
1999-01-01
A cDNA encoding geranyl diphosphate synthase from peppermint has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID No:1) is provided which codes for the expression of geranyl diphosphate synthase (SEQ ID No:2) from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for geranyl diphosphate synthase or for a base sequence sufficiently complementary to at least a portion of the geranyl diphosphate synthase DNA or RNA to enable hybridization therewith (e.g., antisense geranyl diphosphate synthase RNA or fragments of complementary geranyl diphosphate synthase DNA which are useful as polymerase chain reaction primers or as probes for geranyl diphosphate synthase or related genes). In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding geranyl diphosphate synthase. Thus, systems and methods are provided for the recombinant expression of geranyl diphosphate synthase that may be used to facilitate the production, isolation and purification of significant quantities of recombinant geranyl diphosphate synthase for subsequent use, to obtain expression or enhanced expression of geranyl diphosphate synthase in plants in order to enhance the production of monoterpenoids, to produce geranyl diphosphate in cancerous cells as a precursor to monoterpenoids having anti-cancer properties or may be otherwise employed for the regulation or expression of geranyl diphosphate synthase or the production of geranyl diphosphate.
Geranyl diphosphate synthase from mint
Croteau, R.B.; Wildung, M.R.; Burke, C.C.; Gershenzon, J.
1999-03-02
A cDNA encoding geranyl diphosphate synthase from peppermint has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID No:1) is provided which codes for the expression of geranyl diphosphate synthase (SEQ ID No:2) from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for geranyl diphosphate synthase or for a base sequence sufficiently complementary to at least a portion of the geranyl diphosphate synthase DNA or RNA to enable hybridization therewith (e.g., antisense geranyl diphosphate synthase RNA or fragments of complementary geranyl diphosphate synthase DNA which are useful as polymerase chain reaction primers or as probes for geranyl diphosphate synthase or related genes). In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding geranyl diphosphate synthase. Thus, systems and methods are provided for the recombinant expression of geranyl diphosphate synthase that may be used to facilitate the production, isolation and purification of significant quantities of recombinant geranyl diphosphate synthase for subsequent use, to obtain expression or enhanced expression of geranyl diphosphate synthase in plants in order to enhance the production of monoterpenoids, to produce geranyl diphosphate in cancerous cells as a precursor to monoterpenoids having anti-cancer properties or may be otherwise employed for the regulation or expression of geranyl diphosphate synthase or the production of geranyl diphosphate. 5 figs.
Model annotation for synthetic biology: automating model to nucleotide sequence conversion
Misirli, Goksel; Hallinan, Jennifer S.; Yu, Tommy; Lawson, James R.; Wimalaratne, Sarala M.; Cooling, Michael T.; Wipat, Anil
2011-01-01
Motivation: The need for the automated computational design of genetic circuits is becoming increasingly apparent with the advent of ever more complex and ambitious synthetic biology projects. Currently, most circuits are designed through the assembly of models of individual parts such as promoters, ribosome binding sites and coding sequences. These low level models are combined to produce a dynamic model of a larger device that exhibits a desired behaviour. The larger model then acts as a blueprint for physical implementation at the DNA level. However, the conversion of models of complex genetic circuits into DNA sequences is a non-trivial undertaking due to the complexity of mapping the model parts to their physical manifestation. Automating this process is further hampered by the lack of computationally tractable information in most models. Results: We describe a method for automatically generating DNA sequences from dynamic models implemented in CellML and Systems Biology Markup Language (SBML). We also identify the metadata needed to annotate models to facilitate automated conversion, and propose and demonstrate a method for the markup of these models using RDF. Our algorithm has been implemented in a software tool called MoSeC. Availability: The software is available from the authors' web site http://research.ncl.ac.uk/synthetic_biology/downloads.html. Contact: anil.wipat@ncl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21296753
... exons, the parts of DNA that code for proteins in the body. Researchers like this method because it is faster and cheaper. Learn More More still needs to be done before whole genome sequencing becomes a routine part of medical care. Many ...
Transcription and DNA Damage: Holding Hands or Crossing Swords?
D'Alessandro, Giuseppina; d'Adda di Fagagna, Fabrizio
2017-10-27
Transcription has classically been considered a potential threat to genome integrity. Collision between transcription and DNA replication machinery, and retention of DNA:RNA hybrids, may result in genome instability. On the other hand, it has been proposed that active genes repair faster and preferentially via homologous recombination. Moreover, while canonical transcription is inhibited in the proximity of DNA double-strand breaks, a growing body of evidence supports active non-canonical transcription at DNA damage sites. Small non-coding RNAs accumulate at DNA double-strand break sites in mammals and other organisms, and are involved in DNA damage signaling and repair. Furthermore, RNA binding proteins are recruited to DNA damage sites and participate in the DNA damage response. Here, we discuss the impact of transcription on genome stability, the role of RNA binding proteins at DNA damage sites, and the function of small non-coding RNAs generated upon damage in the signaling and repair of DNA lesions. Copyright © 2016 Elsevier Ltd. All rights reserved.
Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing.
Hargreaves, Adam D; Mulley, John F
2015-01-01
Portable DNA sequencers such as the Oxford Nanopore MinION device have the potential to be truly disruptive technologies, facilitating new approaches and analyses and, in some cases, taking sequencing out of the lab and into the field. However, the capabilities of these technologies are still being revealed. Here we show that single-molecule cDNA sequencing using the MinION accurately characterises venom toxin-encoding genes in the painted saw-scaled viper, Echis coloratus. We find the raw sequencing error rate to be around 12%, improved to 0-2% with hybrid error correction and 3% with de novo error correction. Our corrected data provides full coding sequences and 5' and 3' UTRs for 29 of 33 candidate venom toxins detected, far superior to Illumina data (13/40 complete) and Sanger-based ESTs (15/29). We suggest that, should the current pace of improvement continue, the MinION will become the default approach for cDNA sequencing in a variety of species.
Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing
Hargreaves, Adam D.
2015-01-01
Portable DNA sequencers such as the Oxford Nanopore MinION device have the potential to be truly disruptive technologies, facilitating new approaches and analyses and, in some cases, taking sequencing out of the lab and into the field. However, the capabilities of these technologies are still being revealed. Here we show that single-molecule cDNA sequencing using the MinION accurately characterises venom toxin-encoding genes in the painted saw-scaled viper, Echis coloratus. We find the raw sequencing error rate to be around 12%, improved to 0–2% with hybrid error correction and 3% with de novo error correction. Our corrected data provides full coding sequences and 5′ and 3′ UTRs for 29 of 33 candidate venom toxins detected, far superior to Illumina data (13/40 complete) and Sanger-based ESTs (15/29). We suggest that, should the current pace of improvement continue, the MinION will become the default approach for cDNA sequencing in a variety of species. PMID:26623194
Research on Image Encryption Based on DNA Sequence and Chaos Theory
NASA Astrophysics Data System (ADS)
Tian Zhang, Tian; Yan, Shan Jun; Gu, Cheng Yan; Ren, Ran; Liao, Kai Xin
2018-04-01
Nowadays encryption is a common technique to protect image data from unauthorized access. In recent years, many scientists have proposed various encryption algorithms based on DNA sequence to provide a new idea for the design of image encryption algorithm. Therefore, a new method of image encryption based on DNA computing technology is proposed in this paper, whose original image is encrypted by DNA coding and 1-D logistic chaotic mapping. First, the algorithm uses two modules as the encryption key. The first module uses the real DNA sequence, and the second module is made by one-dimensional logistic chaos mapping. Secondly, the algorithm uses DNA complementary rules to encode original image, and uses the key and DNA computing technology to compute each pixel value of the original image, so as to realize the encryption of the whole image. Simulation results show that the algorithm has good encryption effect and security.
SeqCompress: an algorithm for biological sequence compression.
Sardaraz, Muhammad; Tahir, Muhammad; Ikram, Ataul Aziz; Bajwa, Hassan
2014-10-01
The growth of Next Generation Sequencing technologies presents significant research challenges, specifically to design bioinformatics tools that handle massive amount of data efficiently. Biological sequence data storage cost has become a noticeable proportion of total cost in the generation and analysis. Particularly increase in DNA sequencing rate is significantly outstripping the rate of increase in disk storage capacity, which may go beyond the limit of storage capacity. It is essential to develop algorithms that handle large data sets via better memory management. This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences. The proposed algorithm is compared with recent specialized compression tools for biological sequences. Experimental results show that proposed algorithm has better compression gain as compared to other existing algorithms. Copyright © 2014 Elsevier Inc. All rights reserved.
Escribano, Julio; Coca-Prados, Miguel
2002-08-28
The ciliary body is largely known for its major roles in the regulation of aqueous humor secretion, intraocular pressure, and accommodation of the lens. In this review article we applied bioinformatics to re-examine hundreds of expressed sequence tags (ESTs) previously isolated by subtractive hybridization from a human ciliary body library [1]. The DNA sequences of these clones have been recently added to the web site of NEIBank. DNA sequence comparisons of subtracted ESTs were performed against all entries in the last available release of the non-redundant database containing GenBank, EMBL, DDBJ and PDB sequences using the BlastN program accessed through NCBI's BLAST services on the internet (NCBI). Sequences were also compared and mapped using the Blast search program provided through the Internet by the Human Genome Project (UCSC). A total number of 284 independent ESTs were classified in 17 functional groups. Analysis of their relationships allowed to define the expression of five major groups of known genes: (i) protein synthesis, folding, secretion and degradation (20%); (ii) energy supply and biosynthesis (12%); (iii) contractility and cytoskeleton structure (6%); (iv) cellular signaling and cell cycle regulation (7%); and (v) nerve cell related tasks (2%), including neuropeptide processing and putative non-visual phototransduction and circadian rhythm control. The largest group contain unidentified sequences, a total of 105 sequences, accounting for 37% of ESTs. The unidentified sequences show similarity to genomic non-coding regions, or genes of unknown function. The most highly represented EST, correspond to myocilin, a gene involved in glaucoma. The data also confirms the secretory functions of the ciliary epithelium, and its high metabolism; the presence of a neuroendocrine peptidergic system presumably involved in the regulation of the intraocular pressure and/or aqueous humor secretion. Additional genes may be related to a non-visual phototransduction cascade and/or to circadian rhythms. Overall this initial group of subtracted ESTs can lead to uncover novel physiological functions of the ciliary body in normal and in disease, as well as novel candidate genes for ocular diseases.
Secco, David; Wang, Chuang; Shou, Huixia; Schultz, Matthew D; Chiarenza, Serge; Nussaume, Laurent; Ecker, Joseph R; Whelan, James; Lister, Ryan
2015-07-21
Cytosine DNA methylation (mC) is a genome modification that can regulate the expression of coding and non-coding genetic elements. However, little is known about the involvement of mC in response to environmental cues. Using whole genome bisulfite sequencing to assess the spatio-temporal dynamics of mC in rice grown under phosphate starvation and recovery conditions, we identified widespread phosphate starvation-induced changes in mC, preferentially localized in transposable elements (TEs) close to highly induced genes. These changes in mC occurred after changes in nearby gene transcription, were mostly DCL3a-independent, and could partially be propagated through mitosis, however no evidence of meiotic transmission was observed. Similar analyses performed in Arabidopsis revealed a very limited effect of phosphate starvation on mC, suggesting a species-specific mechanism. Overall, this suggests that TEs in proximity to environmentally induced genes are silenced via hypermethylation, and establishes the temporal hierarchy of transcriptional and epigenomic changes in response to stress.
The Evolution of Dark Matter in the Mitogenome of Seed Beetles
Sayadi, Ahmed; Immonen, Elina; Tellgren-Roth, Christian
2017-01-01
Abstract Animal mitogenomes are generally thought of as being economic and optimized for rapid replication and transcription. We use long-read sequencing technology to assemble the remarkable mitogenomes of four species of seed beetles. These are the largest circular mitogenomes ever assembled in insects, ranging from 24,496 to 26,613 bp in total length, and are exceptional in that some 40% consists of non-coding DNA. The size expansion is due to two very long intergenic spacers (LIGSs), rich in tandem repeats. The two LIGSs are present in all species but vary greatly in length (114–10,408 bp), show very low sequence similarity, divergent tandem repeat motifs, a very high AT content and concerted length evolution. The LIGSs have been retained for at least some 45 my but must have undergone repeated reductions and expansions, despite strong purifying selection on protein coding mtDNA genes. The LIGSs are located in two intergenic sites where a few recent studies of insects have also reported shorter LIGSs (>200 bp). These sites may represent spaces that tolerate neutral repeat array expansions or, alternatively, the LIGSs may function to allow a more economic translational machinery. Mitochondrial respiration in adult seed beetles is based almost exclusively on fatty acids, which reduces the need for building complex I of the oxidative phosphorylation pathway (NADH dehydrogenase). One possibility is thus that the LIGSs may allow depressed transcription of NAD genes. RNA sequencing showed that LIGSs are partly transcribed and transcriptional profiling suggested that all seven mtDNA NAD genes indeed show low levels of transcription and co-regulation of transcription across sexes and tissues. PMID:29048527
Early detection of non-native fishes using next-generation DNA sequencing of fish larvae
Our objective was to evaluate the use of fish larvae for early detection of non-native fishes, comparing traditional and molecular taxonomy based on next-generation DNA sequencing to investigate potential efficiencies. Our approach was to intensively sample a Great Lakes non-nati...
Analysis of protein-coding genetic variation in 60,706 humans.
Lek, Monkol; Karczewski, Konrad J; Minikel, Eric V; Samocha, Kaitlin E; Banks, Eric; Fennell, Timothy; O'Donnell-Luria, Anne H; Ware, James S; Hill, Andrew J; Cummings, Beryl B; Tukiainen, Taru; Birnbaum, Daniel P; Kosmicki, Jack A; Duncan, Laramie E; Estrada, Karol; Zhao, Fengmei; Zou, James; Pierce-Hoffman, Emma; Berghout, Joanne; Cooper, David N; Deflaux, Nicole; DePristo, Mark; Do, Ron; Flannick, Jason; Fromer, Menachem; Gauthier, Laura; Goldstein, Jackie; Gupta, Namrata; Howrigan, Daniel; Kiezun, Adam; Kurki, Mitja I; Moonshine, Ami Levy; Natarajan, Pradeep; Orozco, Lorena; Peloso, Gina M; Poplin, Ryan; Rivas, Manuel A; Ruano-Rubio, Valentin; Rose, Samuel A; Ruderfer, Douglas M; Shakir, Khalid; Stenson, Peter D; Stevens, Christine; Thomas, Brett P; Tiao, Grace; Tusie-Luna, Maria T; Weisburd, Ben; Won, Hong-Hee; Yu, Dongmei; Altshuler, David M; Ardissino, Diego; Boehnke, Michael; Danesh, John; Donnelly, Stacey; Elosua, Roberto; Florez, Jose C; Gabriel, Stacey B; Getz, Gad; Glatt, Stephen J; Hultman, Christina M; Kathiresan, Sekar; Laakso, Markku; McCarroll, Steven; McCarthy, Mark I; McGovern, Dermot; McPherson, Ruth; Neale, Benjamin M; Palotie, Aarno; Purcell, Shaun M; Saleheen, Danish; Scharf, Jeremiah M; Sklar, Pamela; Sullivan, Patrick F; Tuomilehto, Jaakko; Tsuang, Ming T; Watkins, Hugh C; Wilson, James G; Daly, Mark J; MacArthur, Daniel G
2016-08-18
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
Whitaker, Weston R; Lee, Hanson; Arkin, Adam P; Dueber, John E
2015-03-20
Genetic sequences ported into non-native hosts for synthetic biology applications can gain unexpected properties. In this study, we explored sequences functioning as ribosome binding sites (RBSs) within protein coding DNA sequences (CDSs) that cause internal translation, resulting in truncated proteins. Genome-wide prediction of bacterial RBSs, based on biophysical calculations employed by the RBS calculator, suggests a selection against internal RBSs within CDSs in Escherichia coli, but not those in Saccharomyces cerevisiae. Based on these calculations, silent mutations aimed at removing internal RBSs can effectively reduce truncation products from internal translation. However, a solution for complete elimination of internal translation initiation is not always feasible due to constraints of available coding sequences. Fluorescence assays and Western blot analysis showed that in genes with internal RBSs, increasing the strength of the intended upstream RBS had little influence on the internal translation strength. Another strategy to minimize truncated products from an internal RBS is to increase the relative strength of the upstream RBS with a concomitant reduction in promoter strength to achieve the same protein expression level. Unfortunately, lower transcription levels result in increased noise at the single cell level due to stochasticity in gene expression. At the low expression regimes desired for many synthetic biology applications, this problem becomes particularly pronounced. We found that balancing promoter strengths and upstream RBS strengths to intermediate levels can achieve the target protein concentration while avoiding both excessive noise and truncated protein.
Roux-Rouquie, M; Marilley, M
2000-09-15
We have modeled local DNA sequence parameters to search for DNA architectural motifs involved in transcription regulation and promotion within the Xenopus laevis ribosomal gene promoter and the intergenic spacer (IGS) sequences. The IGS was found to be shaped into distinct topological domains. First, intrinsic bends split the IGS into domains of common but different helical features. Local parameters at inter-domain junctions exhibit a high variability with respect to intrinsic curvature, bendability and thermal stability. Secondly, the repeated sequence blocks of the IGS exhibit right-handed supercoiled structures which could be related to their enhancer properties. Thirdly, the gene promoter presents both inherent curvature and minor groove narrowing which may be viewed as motifs of a structural code for protein recognition and binding. Such pre-existing deformations could simply be remodeled during the binding of the transcription complex. Alternatively, these deformations could pre-shape the promoter in such a way that further remodeling is facilitated. Mutations shown to abolish promoter curvature as well as intrinsic minor groove narrowing, in a variant which maintained full transcriptional activity, bring circumstantial evidence for structurally-preorganized motifs in relation to transcription regulation and promotion. Using well documented X. laevis rDNA regulatory sequences we showed that computer modeling may be of invaluable assistance in assessing encrypted architectural motifs. The evidence of these DNA topological motifs with respect to the concept of structural code is discussed.
Forrest, Megan E; Saiakhova, Alina; Beard, Lydia; Buchner, David A; Scacheri, Peter C; LaFramboise, Thomas; Markowitz, Sanford; Khalil, Ahmad M
2018-05-09
Long non-coding RNAs (lncRNAs) are frequently dysregulated in many human cancers. We sought to identify candidate oncogenic lncRNAs in human colon tumors by utilizing RNA sequencing data from 22 colon tumors and 22 adjacent normal colon samples from The Cancer Genome Atlas (TCGA). The analysis led to the identification of ~200 differentially expressed lncRNAs. Validation in an independent cohort of normal colon and patient-derived colon cancer cell lines identified a novel lncRNA, lincDUSP, as a potential candidate oncogene. Knockdown of lincDUSP in patient-derived colon tumor cell lines resulted in significantly decreased cell proliferation and clonogenic potential, and increased susceptibility to apoptosis. The knockdown of lincDUSP affects the expression of ~800 genes, and NCI pathway analysis showed enrichment of DNA damage response and cell cycle control pathways. Further, identification of lincDUSP chromatin occupancy sites by ChIRP-Seq demonstrated association with genes involved in the replication-associated DNA damage response and cell cycle control. Consistent with these findings, lincDUSP knockdown in colon tumor cell lines increased both the accumulation of cells in early S-phase and γH2AX foci formation, indicating increased DNA damage response induction. Taken together, these results demonstrate a key role of lincDUSP in the regulation of important pathways in colon cancer.
High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA.
Chandrananda, Dineika; Thorne, Natalie P; Bahlo, Melanie
2015-06-17
High-throughput sequencing of cell-free DNA fragments found in human plasma has been used to non-invasively detect fetal aneuploidy, monitor organ transplants and investigate tumor DNA. However, many biological properties of this extracellular genetic material remain unknown. Research that further characterizes circulating DNA could substantially increase its diagnostic value by allowing the application of more sophisticated bioinformatics tools that lead to an improved signal to noise ratio in the sequencing data. In this study, we investigate various features of cell-free DNA in plasma using deep-sequencing data from two pregnant women (>70X, >50X) and compare them with matched cellular DNA. We utilize a descriptive approach to examine how the biological cleavage of cell-free DNA affects different sequence signatures such as fragment lengths, sequence motifs at fragment ends and the distribution of cleavage sites along the genome. We show that the size distributions of these cell-free DNA molecules are dependent on their autosomal and mitochondrial origin as well as the genomic location within chromosomes. DNA mapping to particular microsatellites and alpha repeat elements display unique size signatures. We show how cell-free fragments occur in clusters along the genome, localizing to nucleosomal arrays and are preferentially cleaved at linker regions by correlating the mapping locations of these fragments with ENCODE annotation of chromatin organization. Our work further demonstrates that cell-free autosomal DNA cleavage is sequence dependent. The region spanning up to 10 positions on either side of the DNA cleavage site show a consistent pattern of preference for specific nucleotides. This sequence motif is present in cleavage sites localized to nucleosomal cores and linker regions but is absent in nucleosome-free mitochondrial DNA. These background signals in cell-free DNA sequencing data stem from the non-random biological cleavage of these fragments. This sequence structure can be harnessed to improve bioinformatics algorithms, in particular for CNV and structural variant detection. Descriptive measures for cell-free DNA features developed here could also be used in biomarker analysis to monitor the changes that occur during different pathological conditions.
Beaudet, Denis; Nadimi, Maryam; Iffis, Bachir; Hijri, Mohamed
2013-01-01
Arbuscular mycorrhizal fungi (AMF) are common and important plant symbionts. They have coenocytic hyphae and form multinucleated spores. The nuclear genome of AMF is polymorphic and its organization is not well understood, which makes the development of reliable molecular markers challenging. In stark contrast, their mitochondrial genome (mtDNA) is homogeneous. To assess the intra- and inter-specific mitochondrial variability in closely related Glomus species, we performed 454 sequencing on total genomic DNA of Glomus sp. isolate DAOM-229456 and we compared its mtDNA with two G. irregulare isolates. We found that the mtDNA of Glomus sp. is homogeneous, identical in gene order and, with respect to the sequences of coding regions, almost identical to G. irregulare. However, certain genomic regions vary substantially, due to insertions/deletions of elements such as introns, mitochondrial plasmid-like DNA polymerase genes and mobile open reading frames. We found no evidence of mitochondrial or cytoplasmic plasmids in Glomus species, and mobile ORFs in Glomus are responsible for the formation of four gene hybrids in atp6, atp9, cox2, and nad3, which are most probably the result of horizontal gene transfer and are expressed at the mRNA level. We found evidence for substantial sequence variation in defined regions of mtDNA, even among closely related isolates with otherwise identical coding gene sequences. This variation makes it possible to design reliable intra- and inter-specific markers. PMID:23637766
Beaudet, Denis; Nadimi, Maryam; Iffis, Bachir; Hijri, Mohamed
2013-01-01
Arbuscular mycorrhizal fungi (AMF) are common and important plant symbionts. They have coenocytic hyphae and form multinucleated spores. The nuclear genome of AMF is polymorphic and its organization is not well understood, which makes the development of reliable molecular markers challenging. In stark contrast, their mitochondrial genome (mtDNA) is homogeneous. To assess the intra- and inter-specific mitochondrial variability in closely related Glomus species, we performed 454 sequencing on total genomic DNA of Glomus sp. isolate DAOM-229456 and we compared its mtDNA with two G. irregulare isolates. We found that the mtDNA of Glomus sp. is homogeneous, identical in gene order and, with respect to the sequences of coding regions, almost identical to G. irregulare. However, certain genomic regions vary substantially, due to insertions/deletions of elements such as introns, mitochondrial plasmid-like DNA polymerase genes and mobile open reading frames. We found no evidence of mitochondrial or cytoplasmic plasmids in Glomus species, and mobile ORFs in Glomus are responsible for the formation of four gene hybrids in atp6, atp9, cox2, and nad3, which are most probably the result of horizontal gene transfer and are expressed at the mRNA level. We found evidence for substantial sequence variation in defined regions of mtDNA, even among closely related isolates with otherwise identical coding gene sequences. This variation makes it possible to design reliable intra- and inter-specific markers.
Farah, Azman H.; Lee, Shiou Yih; Gao, Zhihui; Yao, Tze Leong; Madon, Maria; Mohamed, Rozi
2018-01-01
The tribe Aquilarieae of the family Thymelaeaceae consists of two genera, Aquilaria and Gyrinops, with a total of 30 species, distributed from northeast India, through southeast Asia and the south of China, to Papua New Guinea. They are an important botanical resource for fragrant agarwood, a prized product derived from injured or infected stems of these species. The aim of this study was to estimate the genome size of selected Aquilaria species and comprehend the evolutionary history of Aquilarieae speciation through molecular phylogeny. Five non-coding chloroplast DNA regions and a nuclear region were sequenced from 12 Aquilaria and three Gyrinops species. Phylogenetic trees constructed using combined chloroplast DNA sequences revealed relationships of the studied 15 members in Aquilarieae, while nuclear ribosomal DNA internal transcribed spacer (ITS) sequences showed a paraphyletic relationship between Aquilaria species from Indochina and Malesian. We exposed, for the first time, the estimated divergence time for Aquilarieae speciation, which was speculated to happen during the Miocene Epoch. The ancestral split and biogeographic pattern of studied species were discussed. Results showed no large variation in the 2C-values for the five Aquilaria species (1.35–2.23 pg). Further investigation into the genome size may provide additional information regarding ancestral traits and its evolution history. PMID:29896211
Complete mitochondrial genome sequence of Urechis caupo, a representative of the phylum Echiura
Boore, Jeffrey L
2004-01-01
Background Mitochondria contain small genomes that are physically separate from those of nuclei. Their comparison serves as a model system for understanding the processes of genome evolution. Although hundreds of these genome sequences have been reported, the taxonomic sampling is highly biased toward vertebrates and arthropods, with many whole phyla remaining unstudied. This is the first description of a complete mitochondrial genome sequence of a representative of the phylum Echiura, that of the fat innkeeper worm, Urechis caupo. Results This mtDNA is 15,113 nts in length and 62% A+T. It contains the 37 genes that are typical for animal mtDNAs in an arrangement somewhat similar to that of annelid worms. All genes are encoded by the same DNA strand which is rich in A and C relative to the opposite strand. Codons ending with the dinucleotide GG are more frequent than would be expected from apparent mutational biases. The largest non-coding region is only 282 nts long, is 71% A+T, and has potential for secondary structures. Conclusions Urechis caupo mtDNA shares many features with those of the few studied annelids, including the common usage of ATG start codons, unusual among animal mtDNAs, as well as gene arrangements, tRNA structures, and codon usage biases. PMID:15369601
Cooper, David N.; Bacolla, Albino; Férec, Claude; Vasquez, Karen M.; Kehrer-Sawatzki, Hildegard; Chen, Jian-Min
2011-01-01
Different types of human gene mutation may vary in size, from structural variants (SVs) to single base-pair substitutions, but what they all have in common is that their nature, size and location are often determined either by specific characteristics of the local DNA sequence environment or by higher-order features of the genomic architecture. The human genome is now recognized to contain ‘pervasive architectural flaws’ in that certain DNA sequences are inherently mutation-prone by virtue of their base composition, sequence repetitivity and/or epigenetic modification. Here we explore how the nature, location and frequency of different types of mutation causing inherited disease are shaped in large part, and often in remarkably predictable ways, by the local DNA sequence environment. The mutability of a given gene or genomic region may also be influenced indirectly by a variety of non-canonical (non-B) secondary structures whose formation is facilitated by the underlying DNA sequence. Since these non-B DNA structures can interfere with subsequent DNA replication and repair, and may serve to increase mutation frequencies in generalized fashion (i.e. both in the context of subtle mutations and SVs), they have the potential to serve as a unifying concept in studies of mutational mechanisms underlying human inherited disease. PMID:21853507
Extracting DNA words based on the sequence features: non-uniform distribution and integrity.
Li, Zhi; Cao, Hongyan; Cui, Yuehua; Zhang, Yanbo
2016-01-25
DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the "words" based only on the DNA sequences. We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract "DNA words" that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome.
Nucleic Acid Chaperone Activity of the ORF1 Protein from the Mouse LINE-1 Retrotransposon
Martin, Sandra L.; Bushman, Frederic D.
2001-01-01
Non-LTR retrotransposons such as L1 elements are major components of the mammalian genome, but their mechanism of replication is incompletely understood. Like retroviruses and LTR-containing retrotransposons, non-LTR retrotransposons replicate by reverse transcription of an RNA intermediate. The details of cDNA priming and integration, however, differ between these two classes. In retroviruses, the nucleocapsid (NC) protein has been shown to assist reverse transcription by acting as a “nucleic acid chaperone,” promoting the formation of the most stable duplexes between nucleic acid molecules. A protein-coding region with an NC-like sequence is present in most non-LTR retrotransposons, but no such sequence is evident in mammalian L1 elements or other members of its class. Here we investigated the ORF1 protein from mouse L1 and found that it does in fact display nucleic acid chaperone activities in vitro. L1 ORF1p (i) promoted annealing of complementary DNA strands, (ii) facilitated strand exchange to form the most stable hybrids in competitive displacement assays, and (iii) facilitated melting of an imperfect duplex but stabilized perfect duplexes. These findings suggest a role for L1 ORF1p in mediating nucleic acid strand transfer steps during L1 reverse transcription. PMID:11134335
Lijavetzky, Diego; Cabezas, José Antonio; Ibáñez, Ana; Rodríguez, Virginia; Martínez-Zapater, José M
2007-01-01
Background Single-nucleotide polymorphisms (SNPs) are the most abundant type of DNA sequence polymorphisms. Their higher availability and stability when compared to simple sequence repeats (SSRs) provide enhanced possibilities for genetic and breeding applications such as cultivar identification, construction of genetic maps, the assessment of genetic diversity, the detection of genotype/phenotype associations, or marker-assisted breeding. In addition, the efficiency of these activities can be improved thanks to the ease with which SNP genotyping can be automated. Expressed sequence tags (EST) sequencing projects in grapevine are allowing for the in silico detection of multiple putative sequence polymorphisms within and among a reduced number of cultivars. In parallel, the sequence of the grapevine cultivar Pinot Noir is also providing thousands of polymorphisms present in this highly heterozygous genome. Still the general application of those SNPs requires further validation since their use could be restricted to those specific genotypes. Results In order to develop a large SNP set of wide application in grapevine we followed a systematic re-sequencing approach in a group of 11 grape genotypes corresponding to ancient unrelated cultivars as well as wild plants. Using this approach, we have sequenced 230 gene fragments, what represents the analysis of over 1 Mb of grape DNA sequence. This analysis has allowed the discovery of 1573 SNPs with an average of one SNP every 64 bp (one SNP every 47 bp in non-coding regions and every 69 bp in coding regions). Nucleotide diversity in grape (π = 0.0051) was found to be similar to values observed in highly polymorphic plant species such as maize. The average number of haplotypes per gene sequence was estimated as six, with three haplotypes representing over 83% of the analyzed sequences. Short-range linkage disequilibrium (LD) studies within the analyzed sequences indicate the existence of a rapid decay of LD within the selected grapevine genotypes. To validate the use of the detected polymorphisms in genetic mapping, cultivar identification and genetic diversity studies we have used the SNPlex™ genotyping technology in a sample of grapevine genotypes and segregating progenies. Conclusion These results provide accurate values for nucleotide diversity in coding sequences and a first estimate of short-range LD in grapevine. Using SNPlex™ genotyping we have shown the application of a set of discovered SNPs as molecular markers for cultivar identification, linkage mapping and genetic diversity studies. Thus, the combination a highly efficient re-sequencing approach and the SNPlex™ high throughput genotyping technology provide a powerful tool for grapevine genetic analysis. PMID:18021442
Human somatostatin I: sequence of the cDNA.
Shen, L P; Pictet, R L; Rutter, W J
1982-01-01
RNA has been isolated from a human pancreatic somatostatinoma and used to prepare a cDNA library. After prescreening, clones containing somatostatin I sequences were identified by hybridization with an anglerfish somatostatin I-cloned cDNA probe. From the nucleotide sequence of two of these clones, we have deduced an essentially full-length mRNA sequence, including the preprosomatostatin coding region, 105 nucleotides from the 5' untranslated region and the complete 150-nucleotide 3' untranslated region. The coding region predicts a 116-amino acid precursor protein (Mr, 12.727) that contains somatostatin-14 and -28 at its COOH terminus. The predicted amino acid sequence of human somatostatin-28 is identical to that of somatostatin-28 isolated from the porcine and ovine species. A comparison of the amino acid sequences of human and anglerfish preprosomatostatin I indicated that the COOH-terminal region encoding somatostatin-14 and the adjacent 6 amino acids are highly conserved, whereas the remainder of the molecule, including the signal peptide region, is more divergent. However, many of the amino acid differences found in the pro region of the human and anglerfish proteins are conservative changes. This suggests that the propeptides have a similar secondary structure, which in turn may imply a biological function for this region of the molecule. Images PMID:6126875
Croteau, Rodney Bruce; Wildung, Mark Raymond; Crock, John E.
1999-01-01
A cDNA encoding (E)-.beta.-farnesene synthase from peppermint (Mentha piperita) has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID NO:1) is provided which codes for the expression of (E)-.beta.-farnesene synthase (SEQ ID NO:2), from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for (E)-.beta.-farnesene synthase, or for a base sequence sufficiently complementary to at least a portion of (E)-.beta.-farnesene synthase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding (E)-.beta.-farnesene synthase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant (E)-.beta.-farnesene synthase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant (E)-.beta.-farnesene synthase may be used to obtain expression or enhanced expression of (E)-.beta.-farnesene synthase in plants in order to enhance the production of (E)-.beta.-farnesene, or may be otherwise employed for the regulation or expression of (E)-.beta.-farnesene synthase, or the production of its product.
The complete mitochondrial genome of the bagarius yarrelli from honghe river
NASA Astrophysics Data System (ADS)
Du, M.; Zhou, C. J.; Niu, B. Z.; Liu, Y. H.; Li, N.; Ai, J. L.; Xu, G. L.
2016-08-01
The total length of mitochondrial DNA sequence of the Bagarius yarrelli from the Honghe river of China is determined in this paper. The total length of the circular molecule is 16524 base pair which denoted a similar gene order to that of the other bony fishes, which include a non-coding control region, a replicated origin, two ribosome RNA (rRNA) genes, 22 transfer RNA (tRNA) genes as well as 13 protein-coding genes. Its whole base constitution is 31.4% for A, 26.9% for C, 15.7% for G and 26.0% for T, with an A+T bias of 57.4%. Those mitochondrial data would contribute to further study molecular evolution and population genetics of this species.
Beccari, T; Hoade, J; Orlacchio, A; Stirling, J L
1992-01-01
cDNAs encoding the mouse beta-N-acetylhexosaminidase alpha-subunit were isolated from a mouse testis library. The longest of these (1.7 kb) was sequenced and showed 83% similarity with the human alpha-subunit cDNA sequence. The 5' end of the coding sequence was obtained from a genomic DNA clone. Alignment of the human and mouse sequences showed that all three putative N-glycosylation sites are conserved, but that the mouse alpha-subunit has an additional site towards the C-terminus. All eight cysteines in the human sequence are conserved in the mouse. There are an additional two cysteines in the mouse alpha-subunit signal peptide. All amino acids affected in Tay-Sachs-disease mutations are conserved in the mouse. Images Fig. 1. PMID:1379046
NASA Astrophysics Data System (ADS)
Sun, S. M.; Slightom, J. L.; Hall, T. C.
1981-01-01
A plant gene coding for the major storage protein (phaseolin, G1-globulin) of the French bean was isolated from a genomic library constructed in the phage vector Charon 24A. Comparison of the nucleotide sequence of part of the gene with that of the cloned messenger RNA (cDNA) revealed the presence of three intervening sequences, all beginning with GTand ending with AG. The 5' and 3' boundaries of intervening sequences TVS-A (88 base pairs) and IVS-B (124 base pairs) are similar to those described for animal and viral genes, but the 3' boundary of IVS-C (129 base pairs) shows some differences. A sequence of 185 amino acids deduced from the cloned DMAs represents about 40% of a phaseolin polypeptide.
Fractal landscape analysis of DNA walks
NASA Technical Reports Server (NTRS)
Peng, C. K.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Sciortino, F.; Simons, M.; Stanley, H. E.
1992-01-01
By mapping nucleotide sequences onto a "DNA walk", we uncovered remarkably long-range power law correlations [Nature 356 (1992) 168] that imply a new scale invariant property of DNA. We found such long-range correlations in intron-containing genes and in non-transcribed regulatory DNA sequences, but not in cDNA sequences or intron-less genes. In this paper, we present more explicit evidences to support our findings.
Spliced DNA Sequences in the Paramecium Germline: Their Properties and Evolutionary Potential
Catania, Francesco; McGrath, Casey L.; Doak, Thomas G.; Lynch, Michael
2013-01-01
Despite playing a crucial role in germline-soma differentiation, the evolutionary significance of developmentally regulated genome rearrangements (DRGRs) has received scant attention. An example of DRGR is DNA splicing, a process that removes segments of DNA interrupting genic and/or intergenic sequences. Perhaps, best known for shaping immune-system genes in vertebrates, DNA splicing plays a central role in the life of ciliated protozoa, where thousands of germline DNA segments are eliminated after sexual reproduction to regenerate a functional somatic genome. Here, we identify and chronicle the properties of 5,286 sequences that putatively undergo DNA splicing (i.e., internal eliminated sequences [IESs]) across the genomes of three closely related species of the ciliate Paramecium (P. tetraurelia, P. biaurelia, and P. sexaurelia). The study reveals that these putative IESs share several physical characteristics. Although our results are consistent with excision events being largely conserved between species, episodes of differential IES retention/excision occur, may have a recent origin, and frequently involve coding regions. Our findings indicate interconversion between somatic—often coding—DNA sequences and noncoding IESs, and provide insights into the role of DNA splicing in creating potentially functional genetic innovation. PMID:23737328
Complete mitochondrial DNA sequence of the Eastern keelback mullet Liza affinis.
Gong, Xiaoling; Zhu, Wenjia; Bao, Baolong
2016-05-01
Eastern keelback mullet (Liza affinis) inhabits inlet waters and estuaries of rivers. In this paper, we initially determined the complete mitochondrial genome of Liza affinis. The entire mtDNA sequence is 16,831 bp in length, including 2 rRNA genes, 22 tRNA genes, 13 protein-coding genes and 1 putative control region. Its order and numbers of genes are similar to most bony fishes.
Gubser, Caroline; Smith, Geoffrey L
2002-04-01
Camelpox virus (CMPV) and variola virus (VAR) are orthopoxviruses (OPVs) that share several biological features and cause high mortality and morbidity in their single host species. The sequence of a virulent CMPV strain was determined; it is 202182 bp long, with inverted terminal repeats (ITRs) of 6045 bp and has 206 predicted open reading frames (ORFs). As for other poxviruses, the genes are tightly packed with little non-coding sequence. Most genes within 25 kb of each terminus are transcribed outwards towards the terminus, whereas genes within the centre of the genome are transcribed from either DNA strand. The central region of the genome contains genes that are highly conserved in other OPVs and 87 of these are conserved in all sequenced chordopoxviruses. In contrast, genes towards either terminus are more variable and encode proteins involved in host range, virulence or immunomodulation. In some cases, these are broken versions of genes found in other OPVs. The relationship of CMPV to other OPVs was analysed by comparisons of DNA and predicted protein sequences, repeats within the ITRs and arrangement of ORFs within the terminal regions. Each comparison gave the same conclusion: CMPV is the closest known virus to variola virus, the cause of smallpox.
The sequence of sequencers: The history of sequencing DNA.
Heather, James M; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Kikuchi, Shoshi
2009-02-01
Completion of the high-precision genome sequence analysis of rice led to the collection of about 35,000 full-length cDNA clones and the determination of their complete sequences. Mapping of these full-length cDNA sequences has given us information on (1) the number of genes expressed in the rice genome; (2) the start and end positions and exon-intron structures of rice genes; (3) alternative transcripts; (4) possible encoded proteins; (5) non-protein-coding (np) RNAs; (6) the density of gene localization on the chromosome; (7) setting the parameters of gene prediction programs; and (8) the construction of a microarray system that monitors global gene expression. Manual curation for rice gene annotation by using mapping information on full-length cDNA and EST assemblies has revealed about 32,000 expressed genes in the rice genome. Analysis of major gene families, such as those encoding membrane transport proteins (pumps, ion channels, and secondary transporters), along with the evolution from bacteria to higher animals and plants, reveals how gene numbers have increased through adaptation to circumstances. Family-based gene annotation also gives us a new way of comparing organisms. Massive amounts of data on gene expression under many kinds of physiological conditions are being accumulated in rice oligoarrays (22K and 44K) based on full-length cDNA sequences. Cluster analyses of genes that have the same promoter cis-elements, that have similar expression profiles, or that encode enzymes in the same metabolic pathways or signal transduction cascades give us clues to understanding the networks of gene expression in rice. As a tool for that purpose, we recently developed "RiCES", a tool for searching for cis-elements in the promoter regions of clustered genes.
Raymond, Frédéric; Boisvert, Sébastien; Roy, Gaétan; Ritt, Jean-François; Légaré, Danielle; Isnard, Amandine; Stanke, Mario; Olivier, Martin; Tremblay, Michel J.; Papadopoulou, Barbara; Ouellette, Marc; Corbeil, Jacques
2012-01-01
The Leishmania tarentolae Parrot-TarII strain genome sequence was resolved to an average 16-fold mean coverage by next-generation DNA sequencing technologies. This is the first non-pathogenic to humans kinetoplastid protozoan genome to be described thus providing an opportunity for comparison with the completed genomes of pathogenic Leishmania species. A high synteny was observed between all sequenced Leishmania species. A limited number of chromosomal regions diverged between L. tarentolae and L. infantum, while remaining syntenic to L. major. Globally, >90% of the L. tarentolae gene content was shared with the other Leishmania species. We identified 95 predicted coding sequences unique to L. tarentolae and 250 genes that were absent from L. tarentolae. Interestingly, many of the latter genes were expressed in the intracellular amastigote stage of pathogenic species. In addition, genes coding for products involved in antioxidant defence or participating in vesicular-mediated protein transport were underrepresented in L. tarentolae. In contrast to other Leishmania genomes, two gene families were expanded in L. tarentolae, namely the zinc metallo-peptidase surface glycoprotein GP63 and the promastigote surface antigen PSA31C. Overall, L. tarentolae's gene content appears better adapted to the promastigote insect stage rather than the amastigote mammalian stage. PMID:21998295
Nedelcu, Aurora M.; Lee, Robert W.; Lemieux, Claude; Gray, Michael W.; Burger, Gertraud
2000-01-01
Two distinct mitochondrial genome types have been described among the green algal lineages investigated to date: a reduced–derived, Chlamydomonas-like type and an ancestral, Prototheca-like type. To determine if this unexpected dichotomy is real or is due to insufficient or biased sampling and to define trends in the evolution of the green algal mitochondrial genome, we sequenced and analyzed the mitochondrial DNA (mtDNA) of Scenedesmus obliquus. This genome is 42,919 bp in size and encodes 42 conserved genes (i.e., large and small subunit rRNA genes, 27 tRNA and 13 respiratory protein-coding genes), four additional free-standing open reading frames with no known homologs, and an intronic reading frame with endonuclease/maturase similarity. No 5S rRNA or ribosomal protein-coding genes have been identified in Scenedesmus mtDNA. The standard protein-coding genes feature a deviant genetic code characterized by the use of UAG (normally a stop codon) to specify leucine, and the unprecedented use of UCA (normally a serine codon) as a signal for termination of translation. The mitochondrial genome of Scenedesmus combines features of both green algal mitochondrial genome types: the presence of a more complex set of protein-coding and tRNA genes is shared with the ancestral type, whereas the lack of 5S rRNA and ribosomal protein-coding genes as well as the presence of fragmented and scrambled rRNA genes are shared with the reduced–derived type of mitochondrial genome organization. Furthermore, the gene content and the fragmentation pattern of the rRNA genes suggest that this genome represents an intermediate stage in the evolutionary process of mitochondrial genome streamlining in green algae. [The sequence data described in this paper have been submitted to the GenBank data library under accession no. AF204057.] PMID:10854413
Publishing large DNA sequence data in reduced spaces and lasting formats, in paper or PDF.
Aguiar, Alexandre Pires
2013-02-04
Scientific publications carry a practical moral duty: they must last. Along that line of thinking, some methods are proposed to allow economically and structurally viable publication of DNA sequence data of any size in printed matter and PDFs. The proposal is primarily aimed at contributing for preserving information for the future, while allowing authors to avoid information splitting and complement storage ex situ, that is, in server machines, outside the publication proper. The technique may also help to solve the impasse between the ICZN Code requirement that a new nomen be associated to diagnostic characters for the taxon vs. the phylogenetic definition of taxa, based on cladograms only: sequence data are characters, and can now be easily and comfortably included in taxonomic publications, with direct textual mention to their diagnostic sections. The compression level achieved allows the inclusion of all wanted DNA or RNA sequences in the same printed matter or PDF publications where the sequences are cited and discussed. Reduced font sizes, invisible fonts, and original 2D black & white and color barcodes are illustrated and briefly discussed. The level of data compression achieved can allow each full page of sequence data, or about 5000 characters, to be precisely coded into a color barcode as small as a square of 1.5 mm. A practical example is provided with Taeniogonalos woodorum Smith (Hymenoptera, Trigonalidae). Free software to generate publishable barcodes from txt or FASTA files is provided at www.systaxon.ufes.br/dna.
Vingron, Martin
2016-01-01
Non-methylated islands (NMIs) of DNA are genomic regions that are important for gene regulation and development. A recent study of genome-wide non-methylation data in vertebrates by Long et al. (eLife 2013;2:e00348) has shown that many experimentally identified non-methylated regions do not overlap with classically defined CpG islands which are computationally predicted using simple DNA sequence features. This is especially true in cold-blooded vertebrates such as Danio rerio (zebrafish). In order to investigate how predictive DNA sequence is of a region’s methylation status, we applied a supervised learning approach using a spectrum kernel support vector machine, to see if a more complex model and supervised learning can be used to improve non-methylated island prediction and to understand the sequence properties of these regions. We demonstrate that DNA sequence is highly predictive of methylation status, and that in contrast to existing CpG island prediction methods our method is able to provide more useful predictions of NMIs genome-wide in all vertebrate organisms that were studied. Our results also show that in cold-blooded vertebrates (Anolis carolinensis, Xenopus tropicalis and Danio rerio) where genome-wide classical CpG island predictions consist primarily of false positives, longer primarily AT-rich DNA sequence features are able to identify these regions much more accurately. PMID:27984582
Molecular phylogeography of the Andean alpine plant, Gunnera magellanica
NASA Astrophysics Data System (ADS)
Shimizu, M.; Fujii, N.; Ito, M.; Asakawa, T.; Nishida, H.; Suyama, C.; Ueda, K.
2015-12-01
To clarify the evolutionary history of Gunnera magellanica (Gunneraceae), an alpine plant of the Andes mountains, we performed molecular phylogeographic analyses based on the sequences of an internal transcribed spacer (ITS) of nuclear ribosomal DNA and four non-coding regions (trnH-psbA, trnL-trnF, atpB-rbcL, rpl16 intron) of chloroplast DNA. We investigated 3, 4, 4 and 11 populations in, Ecuador, Bolivia, Argentina, and Chile, respectively, and detected six ITS genotypes (Types A-F) in G. magellanica. Five genotypes (Types A-E) were observed in the northern Andes population (Ecuador and Bolivia); only one ITS genotype (Type F) was observed in the southern Andes population (Chile and Argentina). Phylogenetic analyses showed that the ITS genotypes of the northern and southern Andes populations form different clades with high bootstrap probability. Furthermore, network analysis, analysis of molecular variance, and spatial analysis of molecular variance showed that there were two major clusters (the northern and southern Andes populations) in this species. Furthermore, in chloroplast DNA analysis, three major clades (northern Andes, Chillan, and southern Andes) were inferred from phylogenetic analyses using four non-coding regions, a finding that was supported by the above three types of analysis. The Chillan clade is the northernmost population in the southern Andes populations. With the exception of the Chillan clade (Chillan population), results of nuclear DNA and chloroplast DNA analyses were consistent. Both markers showed that the northern and southern Andes populations of G. magellanica were genetically different from each other. This type of clear phylogeographical structure was supported by PERMUT analysis according to Pons & Petit (1995, 1996). Moreover, based on our preliminary estimation that is based on the ITS sequences, the northern and southern Andes clades diverged ~0.63-3 million years ago, during a period of upheaval in the Andes. This suggests that the populations of G. magellanica that were distributed along the Andes have been divided into the two local populations of the northern and southern Andes during the uplift of the Andes.
Kimura, Tomohiro; Nakano, Toshiki; Yamaguchi, Toshiyasu; Sato, Minoru; Ogawa, Tomohisa; Muramoto, Koji; Yokoyama, Takehiko; Kan-No, Nobuhiro; Nagahisa, Eizou; Janssen, Frank; Grieshaber, Manfred K
2004-01-01
The complete complementary DNA sequences of genes presumably coding for opine dehydrogenases from Arabella iricolor (sandworm), Haliotis discus hannai (abalone), and Patinopecten yessoensis (scallop) were determined, and partial cDNA sequences were derived for Meretrix lusoria (Japanese hard clam) and Spisula sachalinensis (Sakhalin surf clam). The primers ODH-9F and ODH-11R proved useful for amplifying the sequences for opine dehydrogenases from the 4 mollusk species investigated in this study. The sequence of the sandworm was obtained using primers constructed from the amino acid sequence of tauropine dehydrogenase, the main opine dehydrogenase in A. iricolor. The complete cDNA sequence of A. iricolor, H. discus hannai, and P. yessoensis encode 397, 400, and 405 amino acids, respectively. All sequences were aligned and compared with published databank sequences of Loligo opalescens, Loligo vulgaris (squid), Sepia officinalis (cuttlefish), and Pecten maximus (scallop). As expected, a high level of homology was observed for the cDNA from closely related species, such as for cephalopods or scallops, whereas cDNA from the other species showed lower-level homologies. A similar trend was observed when the deduced amino acid sequences were compared. Furthermore, alignment of these sequences revealed some structural motifs that are possibly related to the binding sites of the substrates. The phylogenetic trees derived from the nucleotide and amino acid sequences were consistent with the classification of species resulting from classical taxonomic analyses.
2004-12-09
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
Non-B-Form DNA Is Enriched at Centromeres
Henikoff, Steven
2018-01-01
Abstract Animal and plant centromeres are embedded in repetitive “satellite” DNA, but are thought to be epigenetically specified. To define genetic characteristics of centromeres, we surveyed satellite DNA from diverse eukaryotes and identified variation in <10-bp dyad symmetries predicted to adopt non-B-form conformations. Organisms lacking centromeric dyad symmetries had binding sites for sequence-specific DNA-binding proteins with DNA-bending activity. For example, human and mouse centromeres are depleted for dyad symmetries, but are enriched for non-B-form DNA and are associated with binding sites for the conserved DNA-binding protein CENP-B, which is required for artificial centromere function but is paradoxically nonessential. We also detected dyad symmetries and predicted non-B-form DNA structures at neocentromeres, which form at ectopic loci. We propose that centromeres form at non-B-form DNA because of dyad symmetries or are strengthened by sequence-specific DNA binding proteins. This may resolve the CENP-B paradox and provide a general basis for centromere specification. PMID:29365169
Complete cDNA sequence and amino acid analysis of a bovine ribonuclease K6 gene.
Pietrowski, D; Förster, M
2000-01-01
The complete cDNA sequence of a ribonuclease k6 gene of Bos Taurus has been determined. It codes for a protein with 154 amino acids and contains the invariant cysteine, histidine and lysine residues as well as the characteristic motifs specific to ribonuclease active sites. The deduced protein sequence is 27 residues longer than other known ribonucleases k6 and shows amino acids exchanges which could reflect a strain specificity or polymorphism within the bovine genome. Based on sequence similarity we have termed the identified gene bovine ribonuclease k6 b (brk6b).
Image Encryption Algorithm Based on Hyperchaotic Maps and Nucleotide Sequences Database
2017-01-01
Image encryption technology is one of the main means to ensure the safety of image information. Using the characteristics of chaos, such as randomness, regularity, ergodicity, and initial value sensitiveness, combined with the unique space conformation of DNA molecules and their unique information storage and processing ability, an efficient method for image encryption based on the chaos theory and a DNA sequence database is proposed. In this paper, digital image encryption employs a process of transforming the image pixel gray value by using chaotic sequence scrambling image pixel location and establishing superchaotic mapping, which maps quaternary sequences and DNA sequences, and by combining with the logic of the transformation between DNA sequences. The bases are replaced under the displaced rules by using DNA coding in a certain number of iterations that are based on the enhanced quaternary hyperchaotic sequence; the sequence is generated by Chen chaos. The cipher feedback mode and chaos iteration are employed in the encryption process to enhance the confusion and diffusion properties of the algorithm. Theoretical analysis and experimental results show that the proposed scheme not only demonstrates excellent encryption but also effectively resists chosen-plaintext attack, statistical attack, and differential attack. PMID:28392799
Behind the curtain of non-coding RNAs; long non-coding RNAs regulating hepatocarcinogenesis
El Khodiry, Aya; Afify, Menna; El Tayebi, Hend M
2018-01-01
Hepatocellular carcinoma (HCC) is one of the most common and aggressive cancers worldwide. HCC is the fifth common malignancy in the world and the second leading cause of cancer death in Asia. Long non-coding RNAs (lncRNAs) are RNAs with a length greater than 200 nucleotides that do not encode proteins. lncRNAs can regulate gene expression and protein synthesis in several ways by interacting with DNA, RNA and proteins in a sequence specific manner. They could regulate cellular and developmental processes through either gene inhibition or gene activation. Many studies have shown that dysregulation of lncRNAs is related to many human diseases such as cardiovascular diseases, genetic disorders, neurological diseases, immune mediated disorders and cancers. However, the study of lncRNAs is challenging as they are poorly conserved between species, their expression levels aren’t as high as that of mRNAs and have great interpatient variations. The study of lncRNAs expression in cancers have been a breakthrough as it unveils potential biomarkers and drug targets for cancer therapy and helps understand the mechanism of pathogenesis. This review discusses many long non-coding RNAs and their contribution in HCC, their role in development, metastasis, and prognosis of HCC and how to regulate and target these lncRNAs as a therapeutic tool in HCC treatment in the future. PMID:29434445
Sun, Zichen; Stack, Colin; Šlapeta, Jan
2012-05-25
In order to investigate the genetic variation between Tritrichomonas foetus from bovine and feline origins, cysteine protease 8 (CP8) coding sequence was selected as the polymorphic DNA marker. Direct sequencing of CP8 coding sequence of T. foetus from four feline isolates and two bovine isolates with polymerase chain reaction successfully revealed conserved nucleotide polymorphisms between feline and bovine isolates. These results provide useful information for CP8-based molecular differentiation of T. foetus genotypes. Copyright © 2011 Elsevier B.V. All rights reserved.
Nylinder, Stephan; Cronholm, Bodil; de Lange, Peter J; Walsh, Neville; Anderberg, Arne A
2013-08-01
A species tree phylogeny of the Australian/New Zealand genus Centipeda (Asteraceae) is estimated based on nucleotide sequence data. We analysed sequences of nuclear ribosomal DNA (ETS, ITS) and three plasmid loci (ndhF, psbA-trnH, and trnL-F) using the multi-species coalescent module in BEAST. A total of 129 individuals from all 10 recognised species of Centipeda were sampled throughout the species distribution ranges, including two subspecies. We conclude that the inferred species tree topology largely conform previous assumptions on species relationships. Centipeda racemosa (Snuffweed) is the sister to remaining species, which is also the only consistently perennial representative in the genus. Centipeda pleiocephala (Tall Sneezeweed) and C. nidiformis (Cotton Sneezeweed) constitute a species pair, as does C. borealis and C. minima (Spreading Sneezeweed), all sharing the symplesiomorphic characters of spherical capitulum and convex receptacle with C. racemosa. Another species group comprising C. thespidioides (Desert Sneezeweed), C. cunninghamii (Old man weed, or Common sneeze-weed), C. crateriformis is well-supported but then include the morphologically aberrant C. aotearoana, all sharing the character of having capitula that mature more slowly relative the subtending shoot. Centipeda elatinoides takes on a weakly supported intermediate position between the two mentioned groups, and is difficult to relate to any of the former groups based on morphological characters. Copyright © 2013 Elsevier Inc. All rights reserved.
Henrich, Oliver; Gutiérrez Fosado, Yair Augusto; Curk, Tine; Ouldridge, Thomas E
2018-05-10
During the last decade coarse-grained nucleotide models have emerged that allow us to study DNA and RNA on unprecedented time and length scales. Among them is oxDNA, a coarse-grained, sequence-specific model that captures the hybridisation transition of DNA and many structural properties of single- and double-stranded DNA. oxDNA was previously only available as standalone software, but has now been implemented into the popular LAMMPS molecular dynamics code. This article describes the new implementation and analyses its parallel performance. Practical applications are presented that focus on single-stranded DNA, an area of research which has been so far under-investigated. The LAMMPS implementation of oxDNA lowers the entry barrier for using the oxDNA model significantly, facilitates future code development and interfacing with existing LAMMPS functionality as well as other coarse-grained and atomistic DNA models.
Multimodal biometric digital watermarking on immigrant visas for homeland security
NASA Astrophysics Data System (ADS)
Sasi, Sreela; Tamhane, Kirti C.; Rajappa, Mahesh B.
2004-08-01
Passengers with immigrant Visa's are a major concern to the International Airports due to the various fraud operations identified. To curb tampering of genuine Visa, the Visa's should contain human identification information. Biometric characteristic is a common and reliable way to authenticate the identity of an individual [1]. A Multimodal Biometric Human Identification System (MBHIS) that integrates iris code, DNA fingerprint, and the passport number on the Visa photograph using digital watermarking scheme is presented. Digital Watermarking technique is well suited for any system requiring high security [2]. Ophthalmologists [3], [4], [5] suggested that iris scan is an accurate and nonintrusive optical fingerprint. DNA sequence can be used as a genetic barcode [6], [7]. While issuing Visa at the US consulates, the DNA sequence isolated from saliva, the iris code and passport number shall be digitally watermarked in the Visa photograph. This information is also recorded in the 'immigrant database'. A 'forward watermarking phase' combines a 2-D DWT transformed digital photograph with the personal identification information. A 'detection phase' extracts the watermarked information from this VISA photograph at the port of entry, from which iris code can be used for identification and DNA biometric for authentication, if an anomaly arises.
McFrederick, Quinn S; Vuong, Hoang Q; Rothman, Jason A
2018-06-01
Gram-stain-positive, rod-shaped, non-spore forming bacteria have been isolated from flowers and the guts of adult wild bees in the families Megachilidae and Halictidae. Phylogenetic analysis of the 16S rRNA gene indicated that these bacteria belong to the genus Lactobacillus, and are most closely related to the honey-bee associated bacteria Lactobacillus kunkeei (97.0 % sequence similarity) and Lactobacillus apinorum (97.0 % sequence similarity). Phylogenetic analyses of 16S rRNA genes and six single-copy protein coding genes, in situ and in silico DNA-DNA hybridization, and fatty-acid profiling differentiates the newly isolated bacteria as three novel Lactobacillus species: Lactobacillus micheneri sp. nov. with the type strain Hlig3 T (=DSM 104126 T ,=NRRL B-65473 T ), Lactobacillus timberlakei with the type strain HV_12 T (=DSM 104128 T ,=NRRL B-65472 T ), and Lactobacillus quenuiae sp. nov. with the type strain HV_6 T (=DSM 104127 T ,=NRRL B-65474 T ).
Landscape of somatic mutations in 560 breast cancer whole-genome sequences
Nik-Zainal, Serena; Davies, Helen; Staaf, Johan; ...
2016-05-02
Here, we analysed whole-genome sequences of 560 breast cancers to advance understanding of the driver mutations conferring clonal advantage and the mutational processes generating somatic mutations. We found that 93 protein-coding cancer genes carried probable driver mutations. Some non-coding regions exhibited high mutation frequencies, but most have distinctive structural features probably causing elevated mutation rates and do not contain driver mutations. Mutational signature analysis was extended to genome rearrangements and revealed twelve base substitution and six rearrangement signatures. Three rearrangement signatures, characterized by tandem duplications or deletions, appear associated with defective homologous-recombination-based DNA repair: one with deficient BRCA1 function, anothermore » with deficient BRCA1 or BRCA2 function, the cause of the third is unknown. This analysis of all classes of somatic mutation across exons, introns and intergenic regions highlights the repertoire of cancer genes and mutational processes operating, and progresses towards a comprehensive account of the somatic genetic basis of breast cancer.« less
Landscape of somatic mutations in 560 breast cancer whole-genome sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nik-Zainal, Serena; Davies, Helen; Staaf, Johan
Here, we analysed whole-genome sequences of 560 breast cancers to advance understanding of the driver mutations conferring clonal advantage and the mutational processes generating somatic mutations. We found that 93 protein-coding cancer genes carried probable driver mutations. Some non-coding regions exhibited high mutation frequencies, but most have distinctive structural features probably causing elevated mutation rates and do not contain driver mutations. Mutational signature analysis was extended to genome rearrangements and revealed twelve base substitution and six rearrangement signatures. Three rearrangement signatures, characterized by tandem duplications or deletions, appear associated with defective homologous-recombination-based DNA repair: one with deficient BRCA1 function, anothermore » with deficient BRCA1 or BRCA2 function, the cause of the third is unknown. This analysis of all classes of somatic mutation across exons, introns and intergenic regions highlights the repertoire of cancer genes and mutational processes operating, and progresses towards a comprehensive account of the somatic genetic basis of breast cancer.« less
Landscape of somatic mutations in 560 breast cancer whole genome sequences
Nik-Zainal, Serena; Davies, Helen; Staaf, Johan; Ramakrishna, Manasa; Glodzik, Dominik; Zou, Xueqing; Martincorena, Inigo; Alexandrov, Ludmil B.; Martin, Sancha; Wedge, David C.; Van Loo, Peter; Ju, Young Seok; Smid, Marcel; Brinkman, Arie B; Morganella, Sandro; Aure, Miriam R.; Lingjærde, Ole Christian; Langerød, Anita; Ringnér, Markus; Ahn, Sung-Min; Boyault, Sandrine; Brock, Jane E.; Broeks, Annegien; Butler, Adam; Desmedt, Christine; Dirix, Luc; Dronov, Serge; Fatima, Aquila; Foekens, John A.; Gerstung, Moritz; Hooijer, Gerrit KJ; Jang, Se Jin; Jones, David R.; Kim, Hyung-Yong; King, Tari A.; Krishnamurthy, Savitri; Lee, Hee Jin; Lee, Jeong-Yeon; Li, Yilong; McLaren, Stuart; Menzies, Andrew; Mustonen, Ville; O’Meara, Sarah; Pauporté, Iris; Pivot, Xavier; Purdie, Colin A.; Raine, Keiran; Ramakrishnan, Kamna; Rodríguez-González, F. Germán; Romieu, Gilles; Sieuwerts, Anieta M.; Simpson, Peter T; Shepherd, Rebecca; Stebbings, Lucy; Stefansson, Olafur A; Teague, Jon; Tommasi, Stefania; Treilleux, Isabelle; Van den Eynden, Gert G.; Vermeulen, Peter; Vincent-Salomon, Anne; Yates, Lucy; Caldas, Carlos; van’t Veer, Laura; Tutt, Andrew; Knappskog, Stian; Tan, Benita Kiat Tee; Jonkers, Jos; Borg, Åke; Ueno, Naoto T; Sotiriou, Christos; Viari, Alain; Futreal, P. Andrew; Campbell, Peter J; Span, Paul N.; Van Laere, Steven; Lakhani, Sunil R; Eyfjord, Jorunn E.; Thompson, Alastair M.; Birney, Ewan; Stunnenberg, Hendrik G; van de Vijver, Marc J; Martens, John W.M.; Børresen-Dale, Anne-Lise; Richardson, Andrea L.; Kong, Gu; Thomas, Gilles; Stratton, Michael R.
2016-01-01
We analysed whole genome sequences of 560 breast cancers to advance understanding of the driver mutations conferring clonal advantage and the mutational processes generating somatic mutations. 93 protein-coding cancer genes carried likely driver mutations. Some non-coding regions exhibited high mutation frequencies but most have distinctive structural features probably causing elevated mutation rates and do not harbour driver mutations. Mutational signature analysis was extended to genome rearrangements and revealed 12 base substitution and six rearrangement signatures. Three rearrangement signatures, characterised by tandem duplications or deletions, appear associated with defective homologous recombination based DNA repair: one with deficient BRCA1 function; another with deficient BRCA1 or BRCA2 function; the cause of the third is unknown. This analysis of all classes of somatic mutation across exons, introns and intergenic regions highlights the repertoire of cancer genes and mutational processes operative, and progresses towards a comprehensive account of the somatic genetic basis of breast cancer. PMID:27135926
Ivancic-Jelecki, Jelena; Slovic, Anamarija; Šantak, Maja; Tešović, Goran; Forcic, Dubravko
2016-07-29
The canonical genome organization of measles virus (MV) is characterized by total size of 15 894 nucleotides (nts) and defined length of every genomic region, both coding and non-coding. Only rarely have reports of strains possessing non-canonical genomic properties (possessing indels, with or without the change of total genome length) been published. The observed mutations are mutually compensatory in a sense that the total genome length remains polyhexameric. Although programmed and highly precise pseudo-templated nucleotide additions during transcription are inherent to polymerases of all viruses belonging to family Paramyxoviridae, a similar mechanism that would serve to non-randomly correct genome length, if an indel has occurred during replication, has so far not been described in the context of a complete virus genome. We compiled all complete MV genomic sequences (64 in total) available in open access sequence databases. Multiple sequence comparisons and phylogenetic analyses were performed with the aim of exploring whether non-recombinant and non-evolutionary linked measles strains that show deviations from canonical genome organization possess a common genetic characteristic. In 11 MV sequences we detected deviations from canonical genome organization due to short indels located within homopolymeric stretches or next to them. In nine out of 11 identified non-canonical MV sequences, a common feature was observed: one mutation, either an insertion or a deletion, was located in a 28 nts long region in F gene 5' untranslated region (positions 5051-5078 in genomic cDNA of canonical strains). This segment is composed of five tandemly linked homopolymeric stretches, its consensus sequence is G6-7C7-8A6-7G1-3C5-6. Although none of the mononucleotide repeats within this segment has fixed length, the total number of nts in canonical strains is always 28. These nine non-canonical strains, as well as the tenth (not mutated in 5051-5078 segment), can be grouped in three clusters, based on their passage histories/epidemiological data/genetic similarities. There are no indications that the 3 clusters are evolutionary linked, other than the fact that they all belong to clade D. A common narrow genomic region was found to be mutated in different, non-related, wild type strains suggesting that this region might have a function in non-random genome length corrections occurring during MV replication.
Microbial metatranscriptomics in a permanent marine oxygen minimum zone.
Stewart, Frank J; Ulloa, Osvaldo; DeLong, Edward F
2012-01-01
Simultaneous characterization of taxonomic composition, metabolic gene content and gene expression in marine oxygen minimum zones (OMZs) has potential to broaden perspectives on the microbial and biogeochemical dynamics in these environments. Here, we present a metatranscriptomic survey of microbial community metabolism in the Eastern Tropical South Pacific OMZ off northern Chile. Community RNA was sampled in late austral autumn from four depths (50, 85, 110, 200 m) extending across the oxycline and into the upper OMZ. Shotgun pyrosequencing of cDNA yielded 180,000 to 550,000 transcript sequences per depth. Based on functional gene representation, transcriptome samples clustered apart from corresponding metagenome samples from the same depth, highlighting the discrepancies between metabolic potential and actual transcription. BLAST-based characterizations of non-ribosomal RNA sequences revealed a dominance of genes involved with both oxidative (nitrification) and reductive (anammox, denitrification) components of the marine nitrogen cycle. Using annotations of protein-coding genes as proxies for taxonomic affiliation, we observed depth-specific changes in gene expression by key functional taxonomic groups. Notably, transcripts most closely matching the genome of the ammonia-oxidizing archaeon Nitrosopumilus maritimus dominated the transcriptome in the upper three depths, representing one in five protein-coding transcripts at 85 m. In contrast, transcripts matching the anammox bacterium Kuenenia stuttgartiensis dominated at the core of the OMZ (200 m; 1 in 12 protein-coding transcripts). The distribution of N. maritimus-like transcripts paralleled that of transcripts matching ammonia monooxygenase genes, which, despite being represented by both bacterial and archaeal sequences in the community DNA, were dominated (> 99%) by archaeal sequences in the RNA, suggesting a substantial role for archaeal nitrification in the upper OMZ. These data, as well as those describing other key OMZ metabolic processes (e.g. sulfur oxidation), highlight gene-specific expression patterns in the context of the entire community transcriptome, as well as identify key functional groups for taxon-specific genomic profiling. © 2011 Society for Applied Microbiology and Blackwell Publishing Ltd.
Rudder, Steven; Doohan, Fiona; Creevey, Christopher J; Wendt, Toni; Mullins, Ewen
2014-04-07
Recently it has been shown that Ensifer adhaerens can be used as a plant transformation technology, transferring genes into several plant genomes when equipped with a Ti plasmid. For this study, we have sequenced the genome of Ensifer adhaerens OV14 (OV14) and compared it with those of Agrobacterium tumefaciens C58 (C58) and Sinorhizobium meliloti 1021 (1021); the latter of which has also demonstrated a capacity to genetically transform crop genomes, albeit at significantly reduced frequencies. The 7.7 Mb OV14 genome comprises two chromosomes and two plasmids. All protein coding regions in the OV14 genome were functionally grouped based on an eggNOG database. No genes homologous to the A. tumefaciens Ti plasmid vir genes appeared to be present in the OV14 genome. Unexpectedly, OV14 and 1021 were found to possess homologs to chromosomal based genes cited as essential to A. tumefaciens T-DNA transfer. Of significance, genes that are non-essential but exert a positive influence on virulence and the ability to genetically transform host genomes were identified in OV14 but were absent from the 1021 genome. This study reveals the presence of homologs to chromosomally based Agrobacterium genes that support T-DNA transfer within the genome of OV14 and other alphaproteobacteria. The sequencing and analysis of the OV14 genome increases our understanding of T-DNA transfer by non-Agrobacterium species and creates a platform for the continued improvement of Ensifer-mediated transformation (EMT).
2014-01-01
Background Recently it has been shown that Ensifer adhaerens can be used as a plant transformation technology, transferring genes into several plant genomes when equipped with a Ti plasmid. For this study, we have sequenced the genome of Ensifer adhaerens OV14 (OV14) and compared it with those of Agrobacterium tumefaciens C58 (C58) and Sinorhizobium meliloti 1021 (1021); the latter of which has also demonstrated a capacity to genetically transform crop genomes, albeit at significantly reduced frequencies. Results The 7.7 Mb OV14 genome comprises two chromosomes and two plasmids. All protein coding regions in the OV14 genome were functionally grouped based on an eggNOG database. No genes homologous to the A. tumefaciens Ti plasmid vir genes appeared to be present in the OV14 genome. Unexpectedly, OV14 and 1021 were found to possess homologs to chromosomal based genes cited as essential to A. tumefaciens T-DNA transfer. Of significance, genes that are non-essential but exert a positive influence on virulence and the ability to genetically transform host genomes were identified in OV14 but were absent from the 1021 genome. Conclusions This study reveals the presence of homologs to chromosomally based Agrobacterium genes that support T-DNA transfer within the genome of OV14 and other alphaproteobacteria. The sequencing and analysis of the OV14 genome increases our understanding of T-DNA transfer by non-Agrobacterium species and creates a platform for the continued improvement of Ensifer-mediated transformation (EMT). PMID:24708309
Cheng, Rubin; Zheng, Xiaodong; Ma, Yuanyuan; Li, Qi
2013-01-01
In the present study, we determined the complete mitochondrial DNA (mtDNA) sequences of two species of Cistopus, namely C. chinensis and C. taiwanicus, and conducted a comparative mt genome analysis across the class Cephalopoda. The mtDNA length of C. chinensis and C. taiwanicus are 15706 and 15793 nucleotides with an AT content of 76.21% and 76.5%, respectively. The sequence identity of mtDNA between C. chinensis and C. taiwanicus was 88%, suggesting a close relationship. Compared with C. taiwanicus and other octopods, C. chinensis encoded two additional tRNA genes, showing a novel gene arrangement. In addition, an unusual 23 poly (A) signal structure is found in the ATP8 coding region of C. chinensis. The entire genome and each protein coding gene of the two Cistopus species displayed notable levels of AT and GC skews. Based on sliding window analysis among Octopodiformes, ND1 and DN5 were considered to be more reliable molecular beacons. Phylogenetic analyses based on the 13 protein-coding genes revealed that C. chinensis and C. taiwanicus form a monophyletic group with high statistical support, consistent with previous studies based on morphological characteristics. Our results also indicated that the phylogenetic position of the genus Cistopus is closer to Octopus than to Amphioctopus and Callistoctopus. The complete mtDNA sequence of C. chinensis and C. taiwanicus represent the first whole mt genomes in the genus Cistopus. These novel mtDNA data will be important in refining the phylogenetic relationships within Octopodiformes and enriching the resource of markers for systematic, population genetic and evolutionary biological studies of Cephalopoda. PMID:24358345
Evidence for a Complex Class of Nonadenylated mRNA in Drosophila
Zimmerman, J. Lynn; Fouts, David L.; Manning, Jerry E.
1980-01-01
The amount, by mass, of poly(A+) mRNA present in the polyribosomes of third-instar larvae of Drosophila melanogaster, and the relative contribution of the poly(A+) mRNA to the sequence complexity of total polysomal RNA, has been determined. Selective removal of poly(A+) mRNA from total polysomal RNA by use of either oligo-dT-cellulose, or poly(U)-sepharose affinity chromatography, revealed that only 0.15% of the mass of the polysomal RNA was present as poly(A+) mRNA. The present study shows that this RNA hybridized at saturation with 3.3% of the single-copy DNA in the Drosophila genome. After correction for asymmetric transcription and reactability of the DNA, 7.4% of the single-copy DNA in the Drosophila genome is represented in larval poly(A+) mRNA. This corresponds to 6.73 x 106 nucleotides of mRNA coding sequences, or approximately 5,384 diverse RNA sequences of average size 1,250 nucleotides. However, total polysomal RNA hybridizes at saturation to 10.9% of the single-copy DNA sequences. After correcting this value for asymmetric transcription and tracer DNA reactability, 24% of the single-copy DNA in Drosophila is represented in total polysomal RNA. This corresponds to 2.18 x 107 nucleotides of RNA coding sequences or 17,440 diverse RNA molecules of size 1,250 nucleotides. This value is 3.2 times greater than that observed for poly(A+) mRNA, and indicates that ≃69% of the polysomal RNA sequence complexity is contributed by nonadenylated RNA. Furthermore, if the number of different structural genes represented in total polysomal RNA is ≃1.7 x 104, then the number of genes expressed in third-instar larvae exceeds the number of chromomeres in Drosophila by about a factor of three. This numerology indicates that the number of chromomeres observed in polytene chromosomes does not reflect the number of structural gene sequences in the Drosophila genome. PMID:6777246
Electron holes appear to trigger cancer-implicated mutations
NASA Astrophysics Data System (ADS)
Miller, John; Villagran, Martha
Malignant tumors are caused by mutations, which also affect their subsequent growth and evolution. We use a novel approach, computational DNA hole spectroscopy [M.Y. Suarez-Villagran & J.H. Miller, Sci. Rep. 5, 13571 (2015)], to compute spectra of enhanced hole probability based on actual sequence data. A hole is a mobile site of positive charge created when an electron is removed, for example by radiation or contact with a mutagenic agent. Peaks in the hole spectrum depict sites where holes tend to localize and potentially trigger a base pair mismatch during replication. Our studies of reveal a correlation between hole spectrum peaks and spikes in human mutation frequencies. Importantly, we also find that hole peak positions that do not coincide with large variant frequencies often coincide with cancer-implicated mutations and/or (for coding DNA) encoded conserved amino acids. This enables combining hole spectra with variant data to identify critical base pairs and potential cancer `driver' mutations. Such integration of DNA hole and variance spectra could also prove invaluable for pinpointing critical regions, and sites of driver mutations, in the vast non-protein-coding genome. Supported by the State of Texas through the Texas Ctr. for Superconductivity.
Arcot Sadagopan, Karthikeyan; Battista, Robert; Keep, Rosanne B; Capasso, Jenina E; Levin, Alex V
2015-06-01
Leber congenital amaurosis (LCA) is most often an autosomal recessive disorder. We report a father and son with autosomal dominant LCA due to a mutation in the CRX gene. DNA screening using an allele specific assay of 90 of the most common LCA-causing variations in the coding sequences of AIPL1, CEP290, CRB1, CRX, GUCY2D, RDH12 and RPE65 was performed on the father. Automated DNA sequencing of his son examining exon 3 of the CRX gene was subsequently performed. Both father and son have a heterozygous single base pair deletion of an adenine at codon 153 in the coding sequence of the CRX gene resulting in a frameshift mutation. Mutations involving the CRX gene may demonstrate an autosomal dominant inheritance pattern for LCA.
Chung, Jonathan H.; Cai, Jinlu; Suskin, Barrie G.; Zhang, Zhengdong; Coleman, Karlene
2015-01-01
The 22q11.2 deletion syndrome (22q11DS) affects 1:4000 live births and presents with highly variable phenotype expressivity. In this study, we developed an analytical approach utilizing whole genome sequencing and integrative analysis to discover genetic modifiers. Our pipeline combined available tools in order to prioritize rare, predicted deleterious, coding and non-coding single nucleotide variants (SNVs) and insertion/deletions (INDELs) from whole genome sequencing (WGS). We sequenced two unrelated probands with 22q11DS, with contrasting clinical findings, and their unaffected parents. Proband P1 had cognitive impairment, psychotic episodes, anxiety, and tetralogy of Fallot (TOF); while proband P2 had juvenile rheumatoid arthritis but no other major clinical findings. In P1, we identified common variants in COMT and PRODH on 22q11.2 as well as rare potentially deleterious DNA variants in other behavioral/neurocognitive genes. We also identified a de novo SNV in ADNP2 (NM_014913.3:c.2243G>C), encoding a neuroprotective protein that may be involved in behavioral disorders. In P2, we identified a novel non-synonymous SNV in ZFPM2 (NM_012082.3:c.1576C>T), a known causative gene for TOF, which may act as a protective variant downstream of TBX1, haploinsufficiency of which is responsible for congenital heart disease in individuals with 22q11DS. PMID:25981510
Li, Xingang; Lu, Hongming; Fan, Guilian; He, Miao; Sun, Yu; Xu, Kai; Shi, Fengjun
2017-11-01
Osteosarcoma (OS) is one of the most prevalent primary malignant bone tumors in adolescent. HOTAIR is highly expressed and associated with the epigenetic modifications, especially DNA methylation, in cancer. However, the regulation mechanism between HOTAIR and DNA methylation and the biological effects of them in the pathogenesis of osteosarcoma remains elusive. Through RNA-sequencing and computational analysis, followed by a variety of experimental validations, we report a novel interplay between HOTAIR, miR-126, and DNA methylation in OS. We found that HOTAIR is highly expressed in OS cells and the knockdown of HOTAIR leads to the down-regulation of DNMT1, as well as the decrease of global DNA methylation level. RNA-sequencing analysis of HOTAIR-regulated gene shows that CDKN2A is significantly repressed by HOTAIR. A series of experiments show that HOTAIR represses the expression of CDKN2A through inhibiting the promoter activity of CDKN2A by DNA hypermethylation. Further evidence shows that HOTAIR activates the expression of DNMT1 through repressing miR-126, which is the negative regulator of DNMT1. Functionally, HOTAIR depletion increases the sensibility of OS cells to DNMT1 inhibitor through regulating the viability and apoptosis of OS cells via HOTAIR-miR126-DNMT1-CDKN2A axis. These results not only enrich our understanding of the regulation relationship between non-coding RNA, DNA methylation, and gene expression, however, also provide a novel direction in developing more sophisticated therapeutic strategies for OS patients.
NASA Technical Reports Server (NTRS)
Reddy, A. S.; Czernik, A. J.; An, G.; Poovaiah, B. W.
1992-01-01
We cloned and sequenced a plant cDNA that encodes U1 small nuclear ribonucleoprotein (snRNP) 70K protein. The plant U1 snRNP 70K protein cDNA is not full length and lacks the coding region for 68 amino acids in the amino-terminal region as compared to human U1 snRNP 70K protein. Comparison of the deduced amino acid sequence of the plant U1 snRNP 70K protein with the amino acid sequence of animal and yeast U1 snRNP 70K protein showed a high degree of homology. The plant U1 snRNP 70K protein is more closely related to the human counter part than to the yeast 70K protein. The carboxy-terminal half is less well conserved but, like the vertebrate 70K proteins, is rich in charged amino acids. Northern analysis with the RNA isolated from different parts of the plant indicates that the snRNP 70K gene is expressed in all of the parts tested. Southern blotting of genomic DNA using the cDNA indicates that the U1 snRNP 70K protein is coded by a single gene.
Detection of Merkel Cell Polyomavirus DNA in Serum Samples of Healthy Blood Donors
Mazzoni, Elisa; Rotondo, John C.; Marracino, Luisa; Selvatici, Rita; Bononi, Ilaria; Torreggiani, Elena; Touzé, Antoine; Martini, Fernanda; Tognon, Mauro G.
2017-01-01
Merkel cell polyomavirus (MCPyV) has been detected in 80% of Merkel cell carcinomas (MCC). In the host, the MCPyV reservoir remains elusive. MCPyV DNA sequences were revealed in blood donor buffy coats. In this study, MCPyV DNA sequences were investigated in the sera (n = 190) of healthy blood donors. Two MCPyV DNA sequences, coding for the viral oncoprotein large T antigen (LT), were investigated using polymerase chain reaction (PCR) methods and DNA sequencing. Circulating MCPyV sequences were detected in sera with a prevalence of 2.6% (5/190), at low-DNA viral load, which is in the range of 1–4 and 1–5 copies/μl by real-time PCR and droplet digital PCR, respectively. DNA sequencing carried out in the five MCPyV-positive samples indicated that the two MCPyV LT sequences which were analyzed belong to the MKL-1 strain. Circulating MCPyV LT sequences are present in blood donor sera. MCPyV-positive samples from blood donors could represent a potential vehicle for MCPyV infection in receivers, whereas an increase in viral load may occur with multiple blood transfusions. In certain patient conditions, such as immune-depression/suppression, additional disease or old age, transfusion of MCPyV-positive samples could be an additional risk factor for MCC onset. PMID:29238698
Bhattacharya, D; Steinkötter, J; Melkonian, M
1993-12-01
Centrin (= caltractin) is a ubiquitous, cytoskeletal protein which is a member of the EF-hand superfamily of calcium-binding proteins. A centrin-coding cDNA was isolated and characterized from the prasinophyte green alga Scherffelia dubia. Centrin PCR amplification primers were used to isolate partial, homologous cDNA sequences from the green algae Tetraselmis striata and Spermatozopsis similis. Annealing analyses suggested that centrin is a single-copy-coding region in T. striata and S. similis and other green algae studied. Centrin-coding regions from S. dubia, S. similis and T. striata encode four colinear EF-hand domains which putatively bind calcium. Phylogenetic analyses, including homologous sequences from Chlamydomonas reinhardtii and the land plant Atriplex nummularia, demonstrate that the domains of centrins are congruent and arose from the two-fold duplication of an ancestral EF hand with Domains 1+3 and Domains 2+4 clustering. The domains of centrins are also congruent with those of calmodulins demonstrating that, like calmodulin, centrin is an ancient protein which arose within the ancestor of all eukaryotes via gene duplication. Phylogenetic relationships inferred from centrin-coding region comparisons mirror results of small subunit ribosomal RNA sequence analyses suggesting that centrin-coding regions are useful evolutionary markers within the green algae.
RNA processing in Neurospora crassa mitochondria: use of transfer RNA sequences as signals.
Breitenberger, C A; Browning, K S; Alzner-DeWeerd, B; RajBhandary, U L
1985-01-01
We have used RNA gel transfer hybridization, S1 nuclease mapping and primer extension to analyze transcripts derived from several genes in Neurospora crassa mitochondria. The transcripts studied include those for cytochrome oxidase subunit III, 17S rRNA and an unidentified open reading frame. In all three cases, initial transcripts are long, include tRNA sequences, and are subsequently processed to generate the mature RNAs. We find that endpoints of the most abundant transcripts generally coincide with those of tRNA sequences. We therefore conclude that tRNA sequences in long transcripts act as primary signals for RNA processing in N. crassa mitochondria. The situation is somewhat analogous to that observed in mammalian mitochondrial systems. The difference, however, is that in mammalian mitochondria, noncoding spacers between tRNA, rRNA and protein genes are very short and in many cases non-existent, allowing no room for intergenic RNA processing signals whereas, in N. crassa mtDNA, intergenic non-coding sequences are usually several hundred nucleotides long and contain highly conserved GC-rich palindromic sequences. Since these GC-rich palindromic sequences are retained in the processed mature RNAs, we conclude that they do not serve as signals for RNA processing. Images Fig. 2. Fig. 3. Fig. 4. Fig. 5. Fig. 6. Fig. 7. PMID:2990893
Kaplan, Oktay I; Berber, Burak; Hekim, Nezih; Doluca, Osman
2016-11-02
Many studies show that short non-coding sequences are widely conserved among regulatory elements. More and more conserved sequences are being discovered since the development of next generation sequencing technology. A common approach to identify conserved sequences with regulatory roles relies on topological changes such as hairpin formation at the DNA or RNA level. G-quadruplexes, non-canonical nucleic acid topologies with little established biological roles, are increasingly considered for conserved regulatory element discovery. Since the tertiary structure of G-quadruplexes is strongly dependent on the loop sequence which is disregarded by the generally accepted algorithm, we hypothesized that G-quadruplexes with similar topology and, indirectly, similar interaction patterns, can be determined using phylogenetic clustering based on differences in the loop sequences. Phylogenetic analysis of 52 G-quadruplex forming sequences in the Escherichia coli genome revealed two conserved G-quadruplex motifs with a potential regulatory role. Further analysis revealed that both motifs tend to form hairpins and G quadruplexes, as supported by circular dichroism studies. The phylogenetic analysis as described in this work can greatly improve the discovery of functional G-quadruplex structures and may explain unknown regulatory patterns. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Sensitive Periods in Epigenetics: bringing us closer to complex behavioral phenotypes
Nagy, Corina; Turecki, Gustavo
2017-01-01
Genetic studies have attempted to elucidate causal mechanisms for the development of complex disease but genome-wide associations have been largely unsuccessful in establishing these links. As an alternative link between genes and disease, recent efforts have focused on mechanisms that alter the function of genes without altering the underlying DNA sequence. Known as epigenetic mechanisms, these include: DNA methylation, chromatin conformational changes through histone modifications, non-coding RNAs, and most recently, 5-hydroxymethylcytosine. Though DNA methylation is involved in normal development, aging and gene regulation, altered methylation patterns have been associated with disease. It is generally believed that early life constitutes a period during which there is increased sensitivity to the regulatory effects of epigenetic mechanisms. The purpose of this review is to outline the contribution of epigenetic mechanisms to genomic function, particularly in the development of complex behavioral phenotypes, focusing on the sensitive periods. PMID:22920183
Osipiuk, J; Joachimiak, A
1997-09-12
We propose that the dnaK operon of Thermus thermophilus HB8 is composed of three functionally linked genes: dnaK, grpE, and dnaJ. The dnaK and dnaJ gene products are most closely related to their cyanobacterial homologs. The DnaK protein sequence places T. thermophilus in the plastid Hsp70 subfamily. In contrast, the grpE translated sequence is most similar to GrpE from Clostridium acetobutylicum, a Gram-positive anaerobic bacterium. A single promoter region, with homology to the Escherichia coli consensus promoter sequences recognized by the sigma70 and sigma32 transcription factors, precedes the postulated operon. This promoter is heat-shock inducible. The dnaK mRNA level increased more than 30 times upon 10 min of heat shock (from 70 degrees C to 85 degrees C). A strong transcription terminating sequence was found between the dnaK and grpE genes. The individual genes were cloned into pET expression vectors and the thermophilic proteins were overproduced at high levels in E. coli and purified to homogeneity. The recombinant T. thermophilus DnaK protein was shown to have a weak ATP-hydrolytic activity, with an optimum at 90 degrees C. The ATPase was stimulated by the presence of GrpE and DnaJ. Another open reading frame, coding for ClpB heat-shock protein, was found downstream of the dnaK operon.
Short segment search method for phylogenetic analysis using nested sliding windows
NASA Astrophysics Data System (ADS)
Iskandar, A. A.; Bustamam, A.; Trimarsanto, H.
2017-10-01
To analyze phylogenetics in Bioinformatics, coding DNA sequences (CDS) segment is needed for maximal accuracy. However, analysis by CDS cost a lot of time and money, so a short representative segment by CDS, which is envelope protein segment or non-structural 3 (NS3) segment is necessary. After sliding window is implemented, a better short segment than envelope protein segment and NS3 is found. This paper will discuss a mathematical method to analyze sequences using nested sliding window to find a short segment which is representative for the whole genome. The result shows that our method can find a short segment which more representative about 6.57% in topological view to CDS segment than an Envelope segment or NS3 segment.
Cryptic splice site in the complementary DNA of glucocerebrosidase causes inefficient expression.
Bukovac, Scott W; Bagshaw, Richard D; Rigat, Brigitte A; Callahan, John W; Clarke, Joe T R; Mahuran, Don J
2008-10-15
The low levels of human lysosomal glucocerebrosidase activity expressed in transiently transfected Chinese hamster ovary (CHO) cells were investigated. Reverse transcription PCR (RT-PCR) demonstrated that a significant portion of the transcribed RNA was misspliced owing to the presence of a cryptic splice site in the complementary DNA (cDNA). Missplicing results in the deletion of 179 bp of coding sequence and a premature stop codon. A repaired cDNA was constructed abolishing the splice site without changing the amino acid sequence. The level of glucocerebrosidase expression was increased sixfold. These data demonstrate that for maximum expression of any cDNA construct, the transcription products should be examined.
Wavelet analysis of frequency chaos game signal: a time-frequency signature of the C. elegans DNA.
Messaoudi, Imen; Oueslati, Afef Elloumi; Lachiri, Zied
2014-12-01
Challenging tasks are encountered in the field of bioinformatics. The choice of the genomic sequence's mapping technique is one the most fastidious tasks. It shows that a judicious choice would serve in examining periodic patterns distribution that concord with the underlying structure of genomes. Despite that, searching for a coding technique that can highlight all the information contained in the DNA has not yet attracted the attention it deserves. In this paper, we propose a new mapping technique based on the chaos game theory that we call the frequency chaos game signal (FCGS). The particularity of the FCGS coding resides in exploiting the statistical properties of the genomic sequence itself. This may reflect important structural and organizational features of DNA. To prove the usefulness of the FCGS approach in the detection of different local periodic patterns, we use the wavelet analysis because it provides access to information that can be obscured by other time-frequency methods such as the Fourier analysis. Thus, we apply the continuous wavelet transform (CWT) with the complex Morlet wavelet as a mother wavelet function. Scalograms that relate to the organism Caenorhabditis elegans (C. elegans) exhibit a multitude of periodic organization of specific DNA sequences.
Galián, José A; Rosato, Marcela; Rosselló, Josep A
2014-03-01
Multigene families have provided opportunities for evolutionary biologists to assess molecular evolution processes and phylogenetic reconstructions at deep and shallow systematic levels. However, the use of these markers is not free of technical and analytical challenges. Many evolutionary studies that used the nuclear 5S rDNA gene family rarely used contiguous 5S coding sequences due to the routine use of head-to-tail polymerase chain reaction primers that are anchored to the coding region. Moreover, the 5S coding sequences have been concatenated with independent, adjacent gene units in many studies, creating simulated chimeric genes as the raw data for evolutionary analysis. This practice is based on the tacitly assumed, but rarely tested, hypothesis that strict intra-locus concerted evolution processes are operating in 5S rDNA genes, without any empirical evidence as to whether it holds for the recovered data. The potential pitfalls of analysing the patterns of molecular evolution and reconstructing phylogenies based on these chimeric genes have not been assessed to date. Here, we compared the sequence integrity and phylogenetic behavior of entire versus concatenated 5S coding regions from a real data set obtained from closely related plant species (Medicago, Fabaceae). Our results suggest that within arrays sequence homogenization is partially operating in the 5S coding region, which is traditionally assumed to be highly conserved. Consequently, concatenating 5S genes increases haplotype diversity, generating novel chimeric genotypes that most likely do not exist within the genome. In addition, the patterns of gene evolution are distorted, leading to incorrect haplotype relationships in some evolutionary reconstructions.
Making the Bend: DNA Tertiary Structure and Protein-DNA Interactions
Harteis, Sabrina; Schneider, Sabine
2014-01-01
DNA structure functions as an overlapping code to the DNA sequence. Rapid progress in understanding the role of DNA structure in gene regulation, DNA damage recognition and genome stability has been made. The three dimensional structure of both proteins and DNA plays a crucial role for their specific interaction, and proteins can recognise the chemical signature of DNA sequence (“base readout”) as well as the intrinsic DNA structure (“shape recognition”). These recognition mechanisms do not exist in isolation but, depending on the individual interaction partners, are combined to various extents. Driving force for the interaction between protein and DNA remain the unique thermodynamics of each individual DNA-protein pair. In this review we focus on the structures and conformations adopted by DNA, both influenced by and influencing the specific interaction with the corresponding protein binding partner, as well as their underlying thermodynamics. PMID:25026169
Sastre-Garau, X; Favre, M; Couturier, J; Orth, G
2000-08-01
We previously described two genital carcinomas (IC2, IC4) containing human papillomavirus type 16 (HPV-16)- or HPV-18-related sequences integrated in chromosomal bands containing the c-myc (8q24) or N-myc (2p24) gene, respectively. The c-myc gene was rearranged and amplified in IC2 cells without evidence of overexpression. The N-myc gene was amplified and highly transcribed in IC4 cells. Here, the sequence of an 8039 bp IC4 DNA fragment containing the integrated viral sequences and the cellular junctions is reported. A 3948 bp segment of the genome of HPV-45 encompassing the upstream regulatory region and the E6 and E7 ORFs was integrated into the untranslated part of N-myc exon 3, upstream of the N-myc polyadenylation signal. Both N-myc and HPV-45 sequences were amplified 10- to 20-fold. The 3' ends of the major N-myc transcript were mapped upstream of the 5' junction. A minor N-myc/HPV-45 fusion transcript was also identified, as well as two abundant transcripts from the HPV-45 E6-E7 region. Large amounts of N-myc protein were detected in IC4 cells. A major alteration of c-myc sequences in IC2 cells involved the insertion of a non-coding sequence into the second intron and their co-amplification with the third exon, without any evidence for the integration of HPV-16 sequences within or close to the gene. Different patterns of myc gene alterations may thus be associated with integration of HPV DNA in genital tumours, including the activation of the protooncogene via a mechanism of insertional mutagenesis and/or gene amplification.
Lo, Y M Dennis
2013-12-01
The discovery of cell-free fetal DNA in maternal plasma in 1997 has stimulated a rapid development of non-invasive prenatal testing. The recent advent of massively parallel sequencing has allowed the analysis of circulating cell-free fetal DNA to be performed with unprecedented sensitivity and precision. Fetal trisomies 21, 18 and 13 are now robustly detectable in maternal plasma and such analyses have been available clinically since 2011. Fetal genome-wide molecular karyotyping and whole-genome sequencing have now been demonstrated in a number of proof-of-concept studies. Genome-wide and targeted sequencing of maternal plasma has been shown to allow the non-invasive prenatal testing of β-thalassaemia and can potentially be generalized to other monogenic diseases. It is thus expected that plasma DNA-based non-invasive prenatal testing will play an increasingly important role in future obstetric care. It is thus timely and important that the ethical, social and legal issues of non-invasive prenatal testing be discussed actively by all parties involved in prenatal care. Copyright © 2013 Reproductive Healthcare Ltd. Published by Elsevier Ltd. All rights reserved.
Stefanska, B; Karlic, H; Varga, F; Fabianowska-Majewska, K; Haslberger, AG
2012-01-01
The hallmarks of carcinogenesis are aberrations in gene expression and protein function caused by both genetic and epigenetic modifications. Epigenetics refers to the changes in gene expression programming that alter the phenotype in the absence of a change in DNA sequence. Epigenetic modifications, which include amongst others DNA methylation, covalent modifications of histone tails and regulation by non-coding RNAs, play a significant role in normal development and genome stability. The changes are dynamic and serve as an adaptation mechanism to a wide variety of environmental and social factors including diet. A number of studies have provided evidence that some natural bioactive compounds found in food and herbs can modulate gene expression by targeting different elements of the epigenetic machinery. Nutrients that are components of one-carbon metabolism, such as folate, riboflavin, pyridoxine, cobalamin, choline, betaine and methionine, affect DNA methylation by regulating the levels of S-adenosyl-L-methionine, a methyl group donor, and S-adenosyl-L-homocysteine, which is an inhibitor of enzymes catalyzing the DNA methylation reaction. Other natural compounds target histone modifications and levels of non-coding RNAs such as vitamin D, which recruits histone acetylases, or resveratrol, which activates the deacetylase sirtuin and regulates oncogenic and tumour suppressor micro-RNAs. As epigenetic abnormalities have been shown to be both causative and contributing factors in different health conditions including cancer, natural compounds that are direct or indirect regulators of the epigenome constitute an excellent approach in cancer prevention and potentially in anti-cancer therapy. PMID:22536923
Superimposed Code Theorectic Analysis of DNA Codes and DNA Computing
2010-03-01
because only certain collections (partitioned by font type) of sequences are allowed to be in each position (e.g., Arial = position 0, Comic ...rigidity of short oligos and the shape of the polar charge. Oligo movement was modeled by a Brownian motion 3 dimensional random walk. The one...temperature, kB is Boltz he viscosity of the medium. The random walk motion is modeled by assuming the oligo is on a three dimensional lattice and may
The agents of natural genome editing.
Witzany, Guenther
2011-06-01
The DNA serves as a stable information storage medium and every protein which is needed by the cell is produced from this blueprint via an RNA intermediate code. More recently it was found that an abundance of various RNA elements cooperate in a variety of steps and substeps as regulatory and catalytic units with multiple competencies to act on RNA transcripts. Natural genome editing on one side is the competent agent-driven generation and integration of meaningful DNA nucleotide sequences into pre-existing genomic content arrangements, and the ability to (re-)combine and (re-)regulate them according to context-dependent (i.e. adaptational) purposes of the host organism. Natural genome editing on the other side designates the integration of all RNA activities acting on RNA transcripts without altering DNA-encoded genes. If we take the genetic code seriously as a natural code, there must be agents that are competent to act on this code because no natural code codes itself as no natural language speaks itself. As code editing agents, viral and subviral agents have been suggested because there are several indicators that demonstrate viruses competent in both RNA and DNA natural genome editing.
Bacolla, Albino; Tainer, John A; Vasquez, Karen M; Cooper, David N
2016-07-08
Gross chromosomal rearrangements (including translocations, deletions, insertions and duplications) are a hallmark of cancer genomes and often create oncogenic fusion genes. An obligate step in the generation of such gross rearrangements is the formation of DNA double-strand breaks (DSBs). Since the genomic distribution of rearrangement breakpoints is non-random, intrinsic cellular factors may predispose certain genomic regions to breakage. Notably, certain DNA sequences with the potential to fold into secondary structures [potential non-B DNA structures (PONDS); e.g. triplexes, quadruplexes, hairpin/cruciforms, Z-DNA and single-stranded looped-out structures with implications in DNA replication and transcription] can stimulate the formation of DNA DSBs. Here, we tested the postulate that these DNA sequences might be found at, or in close proximity to, rearrangement breakpoints. By analyzing the distribution of PONDS-forming sequences within ±500 bases of 19 947 translocation and 46 365 sequence-characterized deletion breakpoints in cancer genomes, we find significant association between PONDS-forming repeats and cancer breakpoints. Specifically, (AT)n, (GAA)n and (GAAA)n constitute the most frequent repeats at translocation breakpoints, whereas A-tracts occur preferentially at deletion breakpoints. Translocation breakpoints near PONDS-forming repeats also recur in different individuals and patient tumor samples. Hence, PONDS-forming sequences represent an intrinsic risk factor for genomic rearrangements in cancer genomes. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Non-coding RNA generated following lariat-debranching mediates targeting of AID to DNA
Zheng, Simin; Vuong, Bao Q.; Vaidyanathan, Bharat; Lin, Jia-Yu; Huang, Feng-Ting; Chaudhuri, Jayanta
2015-01-01
SUMMARY Transcription through immunoglobulin switch (S) regions is essential for class switch recombination (CSR) but no molecular function of the transcripts has been described. Likewise, recruitment of activation-induced cytidine deaminase (AID) to S regions is critical for CSR; however, the underlying mechanism has not been fully elucidated. Here, we demonstrate that intronic switch RNA acts in trans to target AID to S region DNA. AID binds directly to switch RNA through G-quadruplexes formed by the RNA molecules. Disruption of this interaction by mutation of a key residue in the putative RNA-binding domain of AID impairs recruitment of AID to S region DNA, thereby abolishing CSR. Additionally, inhibition of RNA lariat processing leads to loss of AID localization to S regions and compromises CSR; both defects can be rescued by exogenous expression of switch transcripts in a sequence-specific manner. These studies uncover an RNA-mediated mechanism of targeting AID to DNA. PMID:25957684
Cloning and expression of cDNA coding for bouganin.
den Hartog, Marcel T; Lubelli, Chiara; Boon, Louis; Heerkens, Sijmie; Ortiz Buijsse, Antonio P; de Boer, Mark; Stirpe, Fiorenzo
2002-03-01
Bouganin is a ribosome-inactivating protein that recently was isolated from Bougainvillea spectabilis Willd. In this work, the cloning and expression of the cDNA encoding for bouganin is described. From the cDNA, the amino-acid sequence was deduced, which correlated with the primary sequence data obtained by amino-acid sequencing on the native protein. Bouganin is synthesized as a pro-peptide consisting of 305 amino acids, the first 26 of which act as a leader signal while the 29 C-terminal amino acids are cleaved during processing of the molecule. The mature protein consists of 250 amino acids. Using the cDNA sequence encoding the mature protein of 250 amino acids, a recombinant protein was expressed, purified and characterized. The recombinant molecule had similar activity in a cell-free protein synthesis assay and had comparable toxicity on living cells as compared to the isolated native bouganin.
Mitochondrial DNA haplogroup phylogeny of the dog: Proposal for a cladistic nomenclature.
Fregel, Rosa; Suárez, Nicolás M; Betancor, Eva; González, Ana M; Cabrera, Vicente M; Pestano, José
2015-05-01
Canis lupus familiaris mitochondrial DNA analysis has increased in recent years, not only for the purpose of deciphering dog domestication but also for forensic genetic studies or breed characterization. The resultant accumulation of data has increased the need for a normalized and phylogenetic-based nomenclature like those provided for human maternal lineages. Although a standardized classification has been proposed, haplotype names within clades have been assigned gradually without considering the evolutionary history of dog mtDNA. Moreover, this classification is based only on the D-loop region, proven to be insufficient for phylogenetic purposes due to its high number of recurrent mutations and the lack of relevant information present in the coding region. In this study, we design 1) a refined mtDNA cladistic nomenclature from a phylogenetic tree based on complete sequences, classifying dog maternal lineages into haplogroups defined by specific diagnostic mutations, and 2) a coding region SNP analysis that allows a more accurate classification into haplogroups when combined with D-loop sequencing, thus improving the phylogenetic information obtained in dog mitochondrial DNA studies. Copyright © 2015 Elsevier B.V. All rights reserved.
Yong, Hoi-Sen; Song, Sze-Looi; Lim, Phaik-Eem; Chan, Kok-Gan; Chow, Wan-Loo; Eamsobhana, Praphathip
2015-01-01
The whole mitochondrial genome of the pest fruit fly Bactrocera arecae was obtained from next-generation sequencing of genomic DNA. It had a total length of 15,900 bp, consisting of 13 protein-coding genes, 2 rRNA genes, 22 tRNA genes and a non-coding region (A + T-rich control region). The control region (952 bp) was flanked by rrnS and trnI genes. The start codons included 6 ATG, 3 ATT and 1 each of ATA, ATC, GTG and TCG. Eight TAA, two TAG, one incomplete TA and two incomplete T stop codons were represented in the protein-coding genes. The cloverleaf structure for trnS1 lacked the D-loop, and that of trnN and trnF lacked the TΨC-loop. Molecular phylogeny based on 13 protein-coding genes was concordant with 37 mitochondrial genes, with B. arecae having closest genetic affinity to B. tryoni. The subgenus Bactrocera of Dacini tribe and the Dacinae subfamily (Dacini and Ceratitidini tribes) were monophyletic. The whole mitogenome of B. arecae will serve as a useful dataset for studying the genetics, systematics and phylogenetic relationships of the many species of Bactrocera genus in particular, and tephritid fruit flies in general. PMID:26472633
Isoform Sequencing and State-of-Art Applications for Unravelling Complexity of Plant Transcriptomes
An, Dong; Li, Changsheng; Humbeck, Klaus
2018-01-01
Single-molecule real-time (SMRT) sequencing developed by PacBio, also called third-generation sequencing (TGS), offers longer reads than the second-generation sequencing (SGS). Given its ability to obtain full-length transcripts without assembly, isoform sequencing (Iso-Seq) of transcriptomes by PacBio is advantageous for genome annotation, identification of novel genes and isoforms, as well as the discovery of long non-coding RNA (lncRNA). In addition, Iso-Seq gives access to the direct detection of alternative splicing, alternative polyadenylation (APA), gene fusion, and DNA modifications. Such applications of Iso-Seq facilitate the understanding of gene structure, post-transcriptional regulatory networks, and subsequently proteomic diversity. In this review, we summarize its applications in plant transcriptome study, specifically pointing out challenges associated with each step in the experimental design and highlight the development of bioinformatic pipelines. We aim to provide the community with an integrative overview and a comprehensive guidance to Iso-Seq, and thus to promote its applications in plant research. PMID:29346292
Reddy, Sushma; Kimball, Rebecca T; Pandey, Akanksha; Hosner, Peter A; Braun, Michael J; Hackett, Shannon J; Han, Kin-Lan; Harshman, John; Huddleston, Christopher J; Kingston, Sarah; Marks, Ben D; Miglia, Kathleen J; Moore, William S; Sheldon, Frederick H; Witt, Christopher C; Yuri, Tamaki; Braun, Edward L
2017-09-01
Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a "model system" to understand the basis for incongruence among phylogenomic trees. We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [$\\sim$ 42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters ($\\sim$ 0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: the taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich data matrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which may be difficult to detect. If incongruence among trees estimated using phylogenomic methods largely reflects problems with model fit developing more "biologically-realistic" models is likely to be critical for efforts to reconstruct the tree of life. [Birds; coding exons; GTR model; model fit; Neoaves; non-coding DNA; phylogenomics; taxon sampling.]. © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Walworth, Nathan G.; Pfreundt, Ulrike; Nelson, William C.; ...
2015-04-07
Understanding the evolution of the free-living, cyanobacterial, diazotroph Trichodesmium is of great importance due to its critical role in oceanic biogeochemistry and primary production. Unlike the other >150 available genomes of free-living cyanobacteria, only 63.8% of the Trichodesmium erythraeum (strain IMS101) genome is predicted to encode protein, which is 20-25% less than the average for other cyanobacteria and non-pathogenic, free-living bacteria. We use distinctive isolates and metagenomic data to show that low coding density observed in IMS101 is a common feature of the Trichodesmium genus both in culture and in situ. Transcriptome analysis indicates that 86% of the non-coding spacemore » is expressed, although the function of these transcripts is unclear. The density of noncoding, possible regulatory elements predicted in Trichodesmium, when normalized per intergenic kilobase, was comparable and two fold higher than that found in the gene dense genomes of the sympatric cyanobacterial genera Synechococcus and Prochlorococcus, respectively. Conserved Trichodesmium ncRNA secondary structures were predicted between most culture and metagenomic sequences lending support to the structural conservation. Conservation of these intergenic regions in spatiotemporally separated Trichodesmium populations suggests possible genus-wide selection for their maintenance. These large intergenic spacers may have developed during intervals of strong genetic drift caused by periodic blooms of a subset of genotypes, which may have reduced effective population size. Our data suggest that transposition of selfish DNA, low effective population size, and high fidelity replication allowed the unusual ‘inflation’ of noncoding sequence observed in Trichodesmium despite its oligotrophic lifestyle.« less
Funk, Helena T; Berg, Sabine; Krupinska, Karin; Maier, Uwe G; Krause, Kirsten
2007-08-22
The holoparasitic plant genus Cuscuta comprises species with photosynthetic capacity and functional chloroplasts as well as achlorophyllous and intermediate forms with restricted photosynthetic activity and degenerated chloroplasts. Previous data indicated significant differences with respect to the plastid genome coding capacity in different Cuscuta species that could correlate with their photosynthetic activity. In order to shed light on the molecular changes accompanying the parasitic lifestyle, we sequenced the plastid chromosomes of the two species Cuscuta reflexa and Cuscuta gronovii. Both species are capable of performing photosynthesis, albeit with varying efficiencies. Together with the plastid genome of Epifagus virginiana, an achlorophyllous parasitic plant whose plastid genome has been sequenced, these species represent a series of progression towards total dependency on the host plant, ranging from reduced levels of photosynthesis in C. reflexa to a restricted photosynthetic activity and degenerated chloroplasts in C. gronovii to an achlorophyllous state in E. virginiana. The newly sequenced plastid genomes of C. reflexa and C. gronovii reveal that the chromosome structures are generally very similar to that of non-parasitic plants, although a number of species-specific insertions, deletions (indels) and sequence inversions were identified. However, we observed a gradual adaptation of the plastid genome to the different degrees of parasitism. The changes are particularly evident in C. gronovii and include (a) the parallel losses of genes for the subunits of the plastid-encoded RNA polymerase and the corresponding promoters from the plastid genome, (b) the first documented loss of the gene for a putative splicing factor, MatK, from the plastid genome and (c) a significant reduction of RNA editing. Overall, the comparative genomic analysis of plastid DNA from parasitic plants indicates a bias towards a simplification of the plastid gene expression machinery as a consequence of an increasing dependency on the host plant. A tentative assignment of the successive events in the adaptation of the plastid genomes to parasitism can be inferred from the current data set. This includes (1) a loss of non-coding regions in photosynthetic Cuscuta species that has resulted in a condensation of the plastid genome, (2) the simplification of plastid gene expression in species with largely impaired photosynthetic capacity and (3) the deletion of a significant part of the genetic information, including the information for the photosynthetic apparatus, in non-photosynthetic parasitic plants.
Funk, Helena T; Berg, Sabine; Krupinska, Karin; Maier, Uwe G; Krause, Kirsten
2007-01-01
Background The holoparasitic plant genus Cuscuta comprises species with photosynthetic capacity and functional chloroplasts as well as achlorophyllous and intermediate forms with restricted photosynthetic activity and degenerated chloroplasts. Previous data indicated significant differences with respect to the plastid genome coding capacity in different Cuscuta species that could correlate with their photosynthetic activity. In order to shed light on the molecular changes accompanying the parasitic lifestyle, we sequenced the plastid chromosomes of the two species Cuscuta reflexa and Cuscuta gronovii. Both species are capable of performing photosynthesis, albeit with varying efficiencies. Together with the plastid genome of Epifagus virginiana, an achlorophyllous parasitic plant whose plastid genome has been sequenced, these species represent a series of progression towards total dependency on the host plant, ranging from reduced levels of photosynthesis in C. reflexa to a restricted photosynthetic activity and degenerated chloroplasts in C. gronovii to an achlorophyllous state in E. virginiana. Results The newly sequenced plastid genomes of C. reflexa and C. gronovii reveal that the chromosome structures are generally very similar to that of non-parasitic plants, although a number of species-specific insertions, deletions (indels) and sequence inversions were identified. However, we observed a gradual adaptation of the plastid genome to the different degrees of parasitism. The changes are particularly evident in C. gronovii and include (a) the parallel losses of genes for the subunits of the plastid-encoded RNA polymerase and the corresponding promoters from the plastid genome, (b) the first documented loss of the gene for a putative splicing factor, MatK, from the plastid genome and (c) a significant reduction of RNA editing. Conclusion Overall, the comparative genomic analysis of plastid DNA from parasitic plants indicates a bias towards a simplification of the plastid gene expression machinery as a consequence of an increasing dependency on the host plant. A tentative assignment of the successive events in the adaptation of the plastid genomes to parasitism can be inferred from the current data set. This includes (1) a loss of non-coding regions in photosynthetic Cuscuta species that has resulted in a condensation of the plastid genome, (2) the simplification of plastid gene expression in species with largely impaired photosynthetic capacity and (3) the deletion of a significant part of the genetic information, including the information for the photosynthetic apparatus, in non-photosynthetic parasitic plants. PMID:17714582
Toward rules relating zinc finger protein sequences and DNA binding site preferences.
Desjarlais, J R; Berg, J M
1992-08-15
Zinc finger proteins of the Cys2-His2 type consist of tandem arrays of domains, where each domain appears to contact three adjacent base pairs of DNA through three key residues. We have designed and prepared a series of variants of the central zinc finger within the DNA binding domain of Sp1 by using information from an analysis of a large data base of zinc finger protein sequences. Through systematic variations at two of the three contact positions (underlined), relatively specific recognition of sequences of the form 5'-GGGGN(G or T)GGG-3' has been achieved. These results provide the basis for rules that may develop into a code that will allow the design of zinc finger proteins with preselected DNA site specificity.
Specific DNA binding of the two chicken Deformed family homeodomain proteins, Chox-1.4 and Chox-a.
Sasaki, H; Yokoyama, E; Kuroiwa, A
1990-01-01
The cDNA clones encoding two chicken Deformed (Dfd) family homeobox containing genes Chox-1.4 and Chox-a were isolated. Comparison of their amino acid sequences with another chicken Dfd family homeodomain protein and with those of mouse homologues revealed that strong homologies are located in the amino terminal regions and around the homeodomains. Although homologies in other regions were relatively low, some short conserved sequences were also identified. E. coli-made full length proteins were purified and used for the production of specific antibodies and for DNA binding studies. The binding profiles of these proteins to the 5'-leader and 5'-upstream sequences of Chox-1.4 and Chox-a coding regions were analyzed by immunoprecipitation and DNase I footprint assays. These two Chox proteins bound to the same sites in the 5'-flanking sequences of their coding regions with various affinities and their binding affinities to each site were nearly the same. The consensus sequences of the high and low affinity binding sites were TAATGA(C/G) and CTAATTTT, respectively. A clustered binding site was identified in the 5'-upstream of the Chox-a gene, suggesting that this clustered binding site works as a cis-regulatory element for auto- and/or cross-regulation of Chox-a gene expression. Images PMID:1970866
SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments
Wiehe, Thomas; Gebauer-Jung, Steffi; Mitchell-Olds, Thomas; Guigó, Roderic
2001-01-01
Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors. PMID:11544202
Decoding the non-coding genome: elucidating genetic risk outside the coding genome.
Barr, C L; Misener, V L
2016-01-01
Current evidence emerging from genome-wide association studies indicates that the genetic underpinnings of complex traits are likely attributable to genetic variation that changes gene expression, rather than (or in combination with) variation that changes protein-coding sequences. This is particularly compelling with respect to psychiatric disorders, as genetic changes in regulatory regions may result in differential transcriptional responses to developmental cues and environmental/psychosocial stressors. Until recently, however, the link between transcriptional regulation and psychiatric genetic risk has been understudied. Multiple obstacles have contributed to the paucity of research in this area, including challenges in identifying the positions of remote (distal from the promoter) regulatory elements (e.g. enhancers) and their target genes and the underrepresentation of neural cell types and brain tissues in epigenome projects - the availability of high-quality brain tissues for epigenetic and transcriptome profiling, particularly for the adolescent and developing brain, has been limited. Further challenges have arisen in the prediction and testing of the functional impact of DNA variation with respect to multiple aspects of transcriptional control, including regulatory-element interaction (e.g. between enhancers and promoters), transcription factor binding and DNA methylation. Further, the brain has uncommon DNA-methylation marks with unique genomic distributions not found in other tissues - current evidence suggests the involvement of non-CG methylation and 5-hydroxymethylation in neurodevelopmental processes but much remains unknown. We review here knowledge gaps as well as both technological and resource obstacles that will need to be overcome in order to elucidate the involvement of brain-relevant gene-regulatory variants in genetic risk for psychiatric disorders. © 2015 John Wiley & Sons Ltd and International Behavioural and Neural Genetics Society.
Glenn, Travis C; Lance, Stacey L; McKee, Anna M; Webster, Bonnie L; Emery, Aidan M; Zerlotini, Adhemar; Oliveira, Guilherme; Rollinson, David; Faircloth, Brant C
2013-10-17
Urogenital schistosomiasis caused by Schistosoma haematobium is widely distributed across Africa and is increasingly being targeted for control. Genome sequences and population genetic parameters can give insight into the potential for population- or species-level drug resistance. Microsatellite DNA loci are genetic markers in wide use by Schistosoma researchers, but there are few primers available for S. haematobium. We sequenced 1,058,114 random DNA fragments from clonal cercariae collected from a snail infected with a single Schistosoma haematobium miracidium. We assembled and aligned the S. haematobium sequences to the genomes of S. mansoni and S. japonicum, identifying microsatellite DNA loci across all three species and designing primers to amplify the loci in S. haematobium. To validate our primers, we screened 32 randomly selected primer pairs with population samples of S. haematobium. We designed >13,790 primer pairs to amplify unique microsatellite loci in S. haematobium, (available at http://www.cebio.org/projetos/schistosoma-haematobium-genome). The three Schistosoma genomes contained similar overall frequencies of microsatellites, but the frequency and length distributions of specific motifs differed among species. We identified 15 primer pairs that amplified consistently and were easily scored. We genotyped these 15 loci in S. haematobium individuals from six locations: Zanzibar had the highest levels of diversity; Malawi, Mauritius, Nigeria, and Senegal were nearly as diverse; but the sample from South Africa was much less diverse. About half of the primers in the database of Schistosoma haematobium microsatellite DNA loci should yield amplifiable and easily scored polymorphic markers, thus providing thousands of potential markers. Sequence conservation among S. haematobium, S. japonicum, and S. mansoni is relatively high, thus it should now be possible to identify markers that are universal among Schistosoma species (i.e., using DNA sequences conserved among species), as well as other markers that are specific to species or species-groups (i.e., using DNA sequences that differ among species). Full genome-sequencing of additional species and specimens of S. haematobium, S. japonicum, and S. mansoni is desirable to better characterize differences within and among these species, to develop additional genetic markers, and to examine genes as well as conserved non-coding elements associated with drug resistance.
Roux-Rouquie, Magali; Marilley, Monique
2000-01-01
We have modeled local DNA sequence parameters to search for DNA architectural motifs involved in transcription regulation and promotion within the Xenopus laevis ribosomal gene promoter and the intergenic spacer (IGS) sequences. The IGS was found to be shaped into distinct topological domains. First, intrinsic bends split the IGS into domains of common but different helical features. Local parameters at inter-domain junctions exhibit a high variability with respect to intrinsic curvature, bendability and thermal stability. Secondly, the repeated sequence blocks of the IGS exhibit right-handed supercoiled structures which could be related to their enhancer properties. Thirdly, the gene promoter presents both inherent curvature and minor groove narrowing which may be viewed as motifs of a structural code for protein recognition and binding. Such pre-existing deformations could simply be remodeled during the binding of the transcription complex. Alternatively, these deformations could pre-shape the promoter in such a way that further remodeling is facilitated. Mutations shown to abolish promoter curvature as well as intrinsic minor groove narrowing, in a variant which maintained full transcriptional activity, bring circumstantial evidence for structurally-preorganized motifs in relation to transcription regulation and promotion. Using well documented X.laevis rDNA regulatory sequences we showed that computer modeling may be of invaluable assistance in assessing encrypted architectural motifs. The evidence of these DNA topological motifs with respect to the concept of structural code is discussed. PMID:10982860
Specific and non-specific interactions of ParB with DNA: implications for chromosome segregation
Taylor, James A.; Pastrana, Cesar L.; Butterer, Annika; Pernstich, Christian; Gwynn, Emma J.; Sobott, Frank; Moreno-Herrero, Fernando; Dillingham, Mark S.
2015-01-01
The segregation of many bacterial chromosomes is dependent on the interactions of ParB proteins with centromere-like DNA sequences called parS that are located close to the origin of replication. In this work, we have investigated the binding of Bacillus subtilis ParB to DNA in vitro using a variety of biochemical and biophysical techniques. We observe tight and specific binding of a ParB homodimer to the parS sequence. Binding of ParB to non-specific DNA is more complex and displays apparent positive co-operativity that is associated with the formation of larger, poorly defined, nucleoprotein complexes. Experiments with magnetic tweezers demonstrate that non-specific binding leads to DNA condensation that is reversible by protein unbinding or force. The condensed DNA structure is not well ordered and we infer that it is formed by many looping interactions between neighbouring DNA segments. Consistent with this view, ParB is also able to stabilize writhe in single supercoiled DNA molecules and to bridge segments from two different DNA molecules in trans. The experiments provide no evidence for the promotion of non-specific DNA binding and/or condensation events by the presence of parS sequences. The implications of these observations for chromosome segregation are discussed. PMID:25572315
Turmel, Monique; Otis, Christian; Lemieux, Claude
2016-09-19
To probe organelle genome evolution in the Ulvales/Ulotrichales clade, the newly sequenced chloroplast and mitochondrial genomes of Gloeotilopsis planctonica and Gloeotilopsis sarcinoidea (Ulotrichales) were compared with those of Pseudendoclonium akinetum (Ulotrichales) and of the few other green algae previously sampled in the Ulvophyceae. At 105,236 bp, the G planctonica mitochondrial DNA (mtDNA) is the largest mitochondrial genome reported so far among chlorophytes, whereas the 221,431-bp G planctonica and 262,888-bp G sarcinoidea chloroplast DNAs (cpDNAs) are the largest chloroplast genomes analyzed among the Ulvophyceae. Gains of non-coding sequences largely account for the expansion of these genomes. Both Gloeotilopsis cpDNAs lack the inverted repeat (IR) typically found in green plants, indicating that two independent IR losses occurred in the Ulvales/Ulotrichales. Our comparison of the Pseudendoclonium and Gloeotilopsis cpDNAs offered clues regarding the mechanism of IR loss in the Ulotrichales, suggesting that internal sequences from the rDNA operon were differentially lost from the two original IR copies during this process. Our analyses also unveiled a number of genetic novelties. Short mtDNA fragments were discovered in two distinct regions of the G sarcinoidea cpDNA, providing the first evidence for intracellular inter-organelle gene migration in green algae. We identified for the first time in green algal organelles, group II introns with LAGLIDADG ORFs as well as group II introns inserted into untranslated gene regions. We discovered many group II introns occupying sites not previously documented for the chloroplast genome and demonstrated that a number of them arose by intragenomic proliferation, most likely through retrohoming. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Turmel, Monique; Otis, Christian; Lemieux, Claude
2016-01-01
Abstract To probe organelle genome evolution in the Ulvales/Ulotrichales clade, the newly sequenced chloroplast and mitochondrial genomes of Gloeotilopsis planctonica and Gloeotilopsis sarcinoidea (Ulotrichales) were compared with those of Pseudendoclonium akinetum (Ulotrichales) and of the few other green algae previously sampled in the Ulvophyceae. At 105,236 bp, the G. planctonica mitochondrial DNA (mtDNA) is the largest mitochondrial genome reported so far among chlorophytes, whereas the 221,431-bp G. planctonica and 262,888-bp G. sarcinoidea chloroplast DNAs (cpDNAs) are the largest chloroplast genomes analyzed among the Ulvophyceae. Gains of non-coding sequences largely account for the expansion of these genomes. Both Gloeotilopsis cpDNAs lack the inverted repeat (IR) typically found in green plants, indicating that two independent IR losses occurred in the Ulvales/Ulotrichales. Our comparison of the Pseudendoclonium and Gloeotilopsis cpDNAs offered clues regarding the mechanism of IR loss in the Ulotrichales, suggesting that internal sequences from the rDNA operon were differentially lost from the two original IR copies during this process. Our analyses also unveiled a number of genetic novelties. Short mtDNA fragments were discovered in two distinct regions of the G. sarcinoidea cpDNA, providing the first evidence for intracellular inter-organelle gene migration in green algae. We identified for the first time in green algal organelles, group II introns with LAGLIDADG ORFs as well as group II introns inserted into untranslated gene regions. We discovered many group II introns occupying sites not previously documented for the chloroplast genome and demonstrated that a number of them arose by intragenomic proliferation, most likely through retrohoming. PMID:27503298
Intra-specific variation in genome size in maize: cytological and phenotypic correlates
Realini, María Florencia; Poggio, Lidia; Cámara-Hernández, Julián; González, Graciela Esther
2016-01-01
Genome size variation accompanies the diversification and evolution of many plant species. Relationships between DNA amount and phenotypic and cytological characteristics form the basis of most hypotheses that ascribe a biological role to genome size. The goal of the present research was to investigate the intra-specific variation in the DNA content in maize populations from Northeastern Argentina and further explore the relationship between genome size and the phenotypic traits seed weight and length of the vegetative cycle. Moreover, cytological parameters such as the percentage of heterochromatin as well as the number, position and sequence composition of knobs were analysed and their relationships with 2C DNA values were explored. The populations analysed presented significant differences in 2C DNA amount, from 4.62 to 6.29 pg, representing 36.15 % of the inter-populational variation. Moreover, intra-populational genome size variation was found, varying from 1.08 to 1.63-fold. The variation in the percentage of knob heterochromatin as well as in the number, chromosome position and sequence composition of the knobs was detected among and within the populations. Although a positive relationship between genome size and the percentage of heterochromatin was observed, a significant correlation was not found. This confirms that other non-coding repetitive DNA sequences are contributing to the genome size variation. A positive relationship between DNA amount and the seed weight has been reported in a large number of species, this relationship was not found in the populations studied here. The length of the vegetative cycle showed a positive correlation with the percentage of heterochromatin. This result allowed attributing an adaptive effect to heterochromatin since the length of this cycle would be optimized via selection for an appropriate percentage of heterochromatin. PMID:26644343
Mechanisms of radiation-induced gene responses
DOE Office of Scientific and Technical Information (OSTI.GOV)
Woloschak, G.E.; Paunesku, T.
1996-10-01
In the process of identifying genes differentially expressed in cells exposed ultraviolet radiation, we have identified a transcript having a 26-bp region that is highly conserved in a variety of species including Bacillus circulans, yeast, pumpkin, Drosophila, mouse, and man. When the 5` region (flanking region or UTR) of a gene, the sequence is predominantly in +/+ orientation with respect to the coding DNA strand; while in the coding region and the 3` region (UTR), the sequence is most frequently in the +/-orientation with respect to the coding DNA strand. In two genes, the element is split into two parts;more » however, in most cases, it is found only once but with a minimum of 11 consecutive nucleotides precisely depicting the original sequence. The element is found in a large number of different genes with diverse functions (from human ras p21 to B. circulans chitonase). Gel shift assays demonstrated the presence of a protein in HeLa cell extracts that binds to the sense and antisense single-stranded consensus oligomers, as well as to the double- stranded oligonucleotide. When double-stranded oligomer was used, the size shift demonstrated as additional protein-oligomer complex larger than the one bound to either sense or antisense single-stranded consensus oligomers alone. It is speculated either that this element binds to protein(s) important in maintaining DNA is a single-stranded orientation for transcription or, alternatively that this element is important in the transcription-coupled DNA repair process.« less
BASiNET-BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification.
Ito, Eric Augusto; Katahira, Isaque; Vicente, Fábio Fernandes da Rocha; Pereira, Luiz Filipe Protasio; Lopes, Fabrício Martins
2018-06-05
With the emergence of Next Generation Sequencing (NGS) technologies, a large volume of sequence data in particular de novo sequencing was rapidly produced at relatively low costs. In this context, computational tools are increasingly important to assist in the identification of relevant information to understand the functioning of organisms. This work introduces BASiNET, an alignment-free tool for classifying biological sequences based on the feature extraction from complex network measurements. The method initially transform the sequences and represents them as complex networks. Then it extracts topological measures and constructs a feature vector that is used to classify the sequences. The method was evaluated in the classification of coding and non-coding RNAs of 13 species and compared to the CNCI, PLEK and CPC2 methods. BASiNET outperformed all compared methods in all adopted organisms and datasets. BASiNET have classified sequences in all organisms with high accuracy and low standard deviation, showing that the method is robust and non-biased by the organism. The proposed methodology is implemented in open source in R language and freely available for download at https://cran.r-project.org/package=BASiNET.
Ahmad, Muneer; Jung, Low Tan; Bhuiyan, Al-Amin
2017-10-01
Digital signal processing techniques commonly employ fixed length window filters to process the signal contents. DNA signals differ in characteristics from common digital signals since they carry nucleotides as contents. The nucleotides own genetic code context and fuzzy behaviors due to their special structure and order in DNA strand. Employing conventional fixed length window filters for DNA signal processing produce spectral leakage and hence results in signal noise. A biological context aware adaptive window filter is required to process the DNA signals. This paper introduces a biological inspired fuzzy adaptive window median filter (FAWMF) which computes the fuzzy membership strength of nucleotides in each slide of window and filters nucleotides based on median filtering with a combination of s-shaped and z-shaped filters. Since coding regions cause 3-base periodicity by an unbalanced nucleotides' distribution producing a relatively high bias for nucleotides' usage, such fundamental characteristic of nucleotides has been exploited in FAWMF to suppress the signal noise. Along with adaptive response of FAWMF, a strong correlation between median nucleotides and the Π shaped filter was observed which produced enhanced discrimination between coding and non-coding regions contrary to fixed length conventional window filters. The proposed FAWMF attains a significant enhancement in coding regions identification i.e. 40% to 125% as compared to other conventional window filters tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms. This study proves that conventional fixed length window filters applied to DNA signals do not achieve significant results since the nucleotides carry genetic code context. The proposed FAWMF algorithm is adaptive and outperforms significantly to process DNA signal contents. The algorithm applied to variety of DNA datasets produced noteworthy discrimination between coding and non-coding regions contrary to fixed window length conventional filters. Copyright © 2017 Elsevier B.V. All rights reserved.
A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes.
Mehmood, Tahir; Bohlin, Jon; Snipen, Lars
2015-01-01
The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value < 0.01) and SVM (p-value < 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
Smurf2 Regulates DNA Repair and Packaging to Prevent Tumors | Center for Cancer Research
The blueprint for all of a cell’s functions is written in the genetic code of DNA sequences as well as in the landscape of DNA and histone modifications. DNA is wrapped around histones to package it into chromatin, which is stored in the nucleus. It is important to maintain the integrity of the chromatin structure to ensure that the cell continues to behave appropriately.
Silicon nanowire sensor for DNA detection and sequencing: an ab initio simulation
NASA Astrophysics Data System (ADS)
Lu, Wenchang; Li, Yan; Hodak, Miroslav; Xiao, Zhongcan; Bernholc, Jerry
Electrical sensors able to detect DNA replication and determine its sequence would enable fast and relatively cheap diagnosis of gene-related vulnerabilities and cancers. At present, it is already possible to electrically monitor DNA replication events using a Klenow fragment of polymerase I attached to a carbon nanotube. Since devices based on Si nanowires would be much easier to produce in quantity, we examine theoretically the sensitivity of a Si nanowire/Klenow fragment for electrical detection of nucleotide addition. A highly parallel real-space multigrid code is used for DFT-based non-equilibrium Green's function calculations involving up to 16,000 atoms, employing highly-accurate variationally-optimized localized orbitals. We find that the open and closed Klenow fragment configurations, prior and during nucleotide addition, respectively, screen the Si nanowire differently and result in a detectable current difference. The sensitivity is the largest in the subthreshold regime while the absolute current difference is maximized in the turn-on state. The sensitivity decreases with an increase of the nanowire size, as expected, but the current difference between different enzymatic states is nearly independent on the nanowire size up to 800 Å2 cross section.
Sargsyan, Ori
2012-05-25
Hitchhiking and severe bottleneck effects have impact on the dynamics of genetic diversity of a population by inducing homogenization at a single locus and at the genome-wide scale, respectively. As a result, identification and differentiation of the signatures of such events from DNA sequence data at a single locus is challenging. This study develops an analytical framework for identifying and differentiating recent homogenization events at multiple neutral loci in low recombination regions. The dynamics of genetic diversity at a locus after a recent homogenization event is modeled according to the infinite-sites mutation model and the Wright-Fisher model of reproduction withmore » constant population size. In this setting, I derive analytical expressions for the distribution, mean, and variance of the number of polymorphic sites in a random sample of DNA sequences from a locus affected by a recent homogenization event. Based on this framework, three likelihood-ratio based tests are presented for identifying and differentiating recent homogenization events at multiple loci. Lastly, I apply the framework to two data sets. First, I consider human DNA sequences from four non-coding loci on different chromosomes for inferring evolutionary history of modern human populations. The results suggest, in particular, that recent homogenization events at the loci are identifiable when the effective human population size is 50000 or greater in contrast to 10000, and the estimates of the recent homogenization events are agree with the “Out of Africa” hypothesis. Second, I use HIV DNA sequences from HIV-1-infected patients to infer the times of HIV seroconversions. The estimates are contrasted with other estimates derived as the mid-time point between the last HIV-negative and first HIV-positive screening tests. Finally, the results show that significant discrepancies can exist between the estimates.« less
Sequence and Structure Dependent DNA-DNA Interactions
NASA Astrophysics Data System (ADS)
Kopchick, Benjamin; Qiu, Xiangyun
Molecular forces between dsDNA strands are largely dominated by electrostatics and have been extensively studied. Quantitative knowledge has been accumulated on how DNA-DNA interactions are modulated by varied biological constituents such as ions, cationic ligands, and proteins. Despite its central role in biology, the sequence of DNA has not received substantial attention and ``random'' DNA sequences are typically used in biophysical studies. However, ~50% of human genome is composed of non-random-sequence DNAs, particularly repetitive sequences. Furthermore, covalent modifications of DNA such as methylation play key roles in gene functions. Such DNAs with specific sequences or modifications often take on structures other than the canonical B-form. Here we present series of quantitative measurements of the DNA-DNA forces with the osmotic stress method on different DNA sequences, from short repeats to the most frequent sequences in genome, and to modifications such as bromination and methylation. We observe peculiar behaviors that appear to be strongly correlated with the incurred structural changes. We speculate the causalities in terms of the differences in hydration shell and DNA surface structures.
Long interspersed repeated DNA (LINE) causes polymorphism at the rat insulin 1 locus.
Lakshmikumaran, M S; D'Ambrosio, E; Laimins, L A; Lin, D T; Furano, A V
1985-09-01
The insulin 1, but not the insulin 2, locus is polymorphic (i.e., exhibits allelic variation) in rats. Restriction enzyme analysis and hybridization studies showed that the polymorphic region is 2.2 kilobases upstream of the insulin 1 coding region and is due to the presence or absence of an approximately 2.7-kilobase repeated DNA element. DNA sequence determination showed that this DNA element is a member of a long interspersed repeated DNA family (LINE) that is highly repeated (greater than 50,000 copies) and highly transcribed in the rat. Although the presence or absence of LINE sequences at the insulin 1 locus occurs in both the homozygous and heterozygous states, LINE-containing insulin 1 alleles are more prevalent in the rat population than are alleles without LINEs. Restriction enzyme analysis of the LINE-containing alleles indicated that at least two versions of the LINE sequence may be present at the insulin 1 locus in different rats. Either repeated transposition of LINE sequences or gene conversion between the resident insulin 1 LINE and other sequences in the genome are possible explanations for this.
Portable and Error-Free DNA-Based Data Storage.
Yazdi, S M Hossein Tabatabaei; Gabrys, Ryan; Milenkovic, Olgica
2017-07-10
DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.
The complete DNA sequence of lymphocystis disease virus.
Tidona, C A; Darai, G
1997-04-14
Lymphocystis disease virus (LCDV) is the causative agent of lymphocystis disease, which has been reported to occur in over 100 different fish species worldwide. LCDV is a member of the family Iridoviridae and the type species of the genus Lymphocystivirus. The virions contain a single linear double-stranded DNA molecule, which is circularly permuted, terminally redundant, and heavily methylated at cytosines in CpG sequences. The complete nucleotide sequence of LCDV-1 (flounder isolate) was determined by automated cycle sequencing and primer walking. The genome of LCDV-1 is 102.653 bp in length and contains 195 open reading frames with coding capacities ranging from 40 to 1199 amino acids. Computer-assisted analyses of the deduced amino acid sequences led to the identification of several putative gene products with significant homologies to entries in protein data banks, such as the two major subunits of the viral DNA-dependent RNA polymerase, DNA polymerase, several protein kinases, two subunits of the ribonucleoside diphosphate reductase, DNA methyltransferase, the viral major capsid protein, insulin-like growth factor, and tumor necrosis factor receptor homolog.
Soares, René Arderius; Passaglia, Luciane Maria Pereira
2010-10-01
Bradyrhizobium elkanii is successfully used in the formulation of commercial inoculants and, together with B. japonicum, it fully supplies the plant nitrogen demands. Despite the similarity between B. japonicum and B. elkanii species, several works demonstrated genetic and physiological differences between them. In this work Representational Difference Analysis (RDA) was used for genomic comparison between B. elkanii SEMIA 587, a crop inoculant strain, and B. japonicum USDA 110, a reference strain. Two hundred sequences were obtained. From these, 46 sequences belonged exclusively to the genome of B. elkanii strain, and 154 showed similarity to sequences from B. japonicum genome. From the 46 sequences with no similarity to sequences from B. japonicum, 39 showed no similarity to sequences in public databases and seven showed similarity to sequences of genes coding for known proteins. These seven sequences were divided in three groups: similar to sequences from other Bradyrhizobium strains, similar to sequences from other nitrogen-fixing bacteria, and similar to sequences from non nitrogen-fixing bacteria. These new sequences could be used as DNA markers in order to investigate the rates of genetic material gain and loss in natural Bradyrhizobium strains.
Le Chevanton, L; Leblon, G
1989-04-15
We cloned the ura5 gene coding for the orotate phosphoribosyl transferase from the ascomycete Sordaria macrospora by heterologous probing of a Sordaria genomic DNA library with the corresponding Podospora anserina sequence. The Sordaria gene was expressed in an Escherichia coli pyrE mutant strain defective for the same enzyme, and expression was shown to be promoted by plasmid sequences. The nucleotide sequence of the 1246-bp DNA fragment encompassing the region of homology with the Podospora gene has been determined. This sequence contains an open reading frame of 699 nucleotides. The deduced amino acid sequence shows 72% similarity with the corresponding Podospora protein.